March 5, 2025

ikayaniaamirshahzad@gmail.com

Deep mutational learning for the selection of therapeutic antibodies resistant to the evolution of Omicron variants of SARS-CoV-2


Design and construction of a high-distance Omicron BA.1 RBD library

A mutagenesis library was constructed based on BA.1, covering the entire 201 amino acid RBD region (positions 331–531 of SARS-CoV-2 S protein). To maximize the interrogated RBD sequence space, the library design was entirely synthetic and unbiased, as it did not consider evolutionary data or previous experimental findings. For the construction of the library, the RBD sequence was split into 11–12 fragments, each with an approximate length of 48 nucleotides (Supplementary Table 1). For a fragment of average length, 137 different single-stranded oligonucleotides (ssODN) were designed, where each ssODN had either zero, one or all possible combinations of two codons replaced by fully degenerate NNK codons (N = A, G, C or T; K = G or T) (Fig. 2a and Methods). In total, 6,298 ssODNs were used to construct the library. For each fragment, ssODNs were amplified using PCR to generate double-stranded DNA. Each fragment was flanked by recognition sites for the type II-S restriction enzyme BsmBI, thus enabling assembly into full-length RBD regions using Golden Gate assembly (GGA)33. GGA uses type II-S restriction enzymes capable of cleaving DNA outside their recognition sequence, thereby allowing the resulting DNA overhangs to have any sequence. Based on the overhangs, individual fragments were assembled by DNA ligase to full-length RBD sequences with high fidelity34,35. The restriction sites were eliminated during the process, thus enabling scarless assembly of full-length RBD sequences (Fig. 2b and Methods)34. This approach yielded approximately 98% correctly assembled RBD sequences (Supplementary Fig. 1). As GGA required four nucleotide homology between individual fragments for ligation, this led to portions of the sequence which needed to remain constant, thereby restricting library diversity36. To overcome this limitation, four staggered sub-libraries were designed and individually assembled. Using sub-library 1 as a reference, sub-library 2 is shifted by 12 nucleotides, sub-library 3 by 24 nucleotides and sub-library 4 by 36 nucleotides. These sub-libraries provided an increase in the mutational space covered by the RBD combinatorial mutagenesis library, as at the GGA homology region for a given library, the remaining three libraries can have mutations (Fig. 2c). Considering all possibilities of combining fragments with either zero, one or two mutations, this design led to a theoretical library diversity of approximately 1042 variants.

Fig. 2: Construction of a high ED synthetic variant library based on Omicron BA.1 RBD.
figure 2

a, The RBD sequence was split into 11–12 fragments, each being on average 48 nucleotide in length. For each fragment, a ssODN library with either zero, one or two mutations was designed. b, To introduce mutations, NNK codons were tiled across the fragments (1). Each fragment was flanked by BsmBI sites (2). The ssODNs were flanked by primer binding sites for double-stranded synthesis through PCR (primers are represented by black arrows, and primer binding sites are peach coloured) (3). The type II-S restriction enzyme BsmBI gives rise to orthogonal four nucleotide overhangs, which are used by a ligase to assemble individual fragments into full-length RBD sequences (4). c, The use of GGA for library construction required the presence of constant regions for ligation between fragments (in black), thereby restricting the library diversity. To overcome this drawback, four staggered sub-libraries were constructed. Due to limitations in sequencing length, it was further necessary to split the RBD into two separate libraries. The extent of seq-library A is indicated in orange and seq-library B in cyan. The primer binding sites for deep sequencing are indicated using orange and cyan arrows. d, Targeted sequencing of seq-libraries A and B showed comprehensive mutational coverage for both libraries. The same colour scheme as in c was used to indicate the extent of both libraries. e, To adjust the mutational rate of the library, different ratios of fragments with zero, one or two mutations (60%/20%/20%, 70%/15%/15% and 80%/10%/10%) were pooled, yielding libraries with average number of mutations of 3.59, 2.07 and 1.41, respectively.

The current read length of Illumina does not allow coverage of the entire RBD with a single sequencing read (paired end). Therefore, two separate sequencing libraries (seq-libraries A and B) were individually constructed. Seq-libraries A and B possessed mutations in positions 331–475 and 386–531, respectively (Fig. 2c). The seq-libraries were constructed separately, but all subsequent steps were performed in a pooled fashion. For targeted sequencing, each seq-library was flanked by unique primer binding sites. Following deep sequencing, complete mutational coverage for each residue was observed in both seq-libraries (Fig. 2d). It is worth noting that the mutational frequency is somewhat variable across the seq-libraries, showing a marked decrease in mutations every 16 residues. The low mutational frequencies line up with GGA homologies of sub-library 1. We hypothesize that when pooling the sub-libraries, sub-library 1 was more prominent than the other sub-libraries, and therefore less mutations at these sites are observed.

Next, to optimize the number of mutations per RBD sequence, a titration of the fragment assembly step was performed. Wild-type fragments (BA.1 sequence) and fragments with one and two mutations, respectively, were pooled in different ratios for assembly. Separately, assembly was performed with 60%, 70% and 80% of wild-type fragments, with the remaining percentage split evenly between fragments with one and two mutations. Deep sequencing of these libraries revealed a clear trend in mutational distribution based on the different ratios, highlighting the tunable nature of our approach (Fig. 2e). Based on these results, all subsequent work was carried out using the 60% wild-type library as it has the highest mean number of mutations, therefore allowing us to adequately model and profile extensively mutated Omicron sublineages.

Screening RBD libraries for ACE2 binding and antibody escape

Co-transformation of yeast cells (Saccharomyces cerevisiae, strain EBY100) using the PCR-amplified RBD library and linearized plasmid yielded more than 2 × 108 transformants (Methods). Yeast surface display of RBD variants was achieved through C-terminal fusion to Aga237. Next, fluorescence-activated cell sorting (FACS) was used to isolate yeast cells expressing RBD variants that either retained binding or completely lost binding to dimeric soluble human ACE2 (Fig. 3a). It is worth noting that RBD variants with only partial binding to ACE2 were not isolated, as such intermediate populations could not be confidently classified as either binding or non-binding. Removing these variants is essential to obtain cleanly labelled datasets for training supervised machine learning models. As binding to ACE2 is a prerequisite for cell entry and subsequent viral replication, only this population is biologically relevant. Thus, only the ACE2-binding population was used in subsequent FACS steps to isolate RBD variants that either retained binding or completely lost binding (escape) activity to a panel of eight neutralizing antibodies (Fig. 3a,b, Supplementary Fig. 2 and Supplementary Table 2). The antibodies selected target different epitopes and are well characterized for their neutralizing activity to BA.1 and its sublineages, which provide a good internal control to assess the accuracy of our method38,39,40. The panel consists of the following antibodies: A23-58.1 (ref. 41), COV2-2196 (ref. 42), Brii-198 (ref. 43), ZCB11 (ref. 17), 2–7 (ref. 44), S2X259 (ref. 45), ADG20 (ref. 46) and S2H97 (ref. 20).

Fig. 3: Screening RBD libraries for ACE2 binding and antibody escape by yeast display and deep sequencing.
figure 3

a,b, Workflow for sorting of yeast display RBD libraries and FACS dot plots for ACE2 (a) and antibodies Brii-198 and ZCB11 (b). Gating schemes correspond to binding and non-binding (escape) RBD variant populations. c,d Heat maps depicting the binding score of each amino acid per position of full-length RBD following sorting and deep sequencing of libraries for ACE2 (c) and ZCB11 (d); higher binding score indicates greater frequency in the binding population versus non-binding population. Wild-type BA.1 residues are in grey. e, Heat maps for seq-libraries A and B depicting binding scores for ACE2 and antibodies of key mutations seen in major Omicron sublineage variants.

Source data

Following ACE2 and monoclonal antibody sorting, pure populations of RBD variants (binding and non-binding) were subjected to deep sequencing (Supplementary Table 3). During the sorting process, it was noted that antibodies COV2-2196 and 2–7 show a weaker binding signal (Supplementary Fig. 2). This was especially pronounced in the case of antibody 2–7 and is likely due to the low affinity of this antibody to Omicron BA.1 RBD (Supplementary Fig. 3) and a generally low mutational resilience (Supplementary Fig. 4a). Those factors contributed to the collection of fewer cells for those antibodies. Reads covering the RBD sequence were then extracted from the deep sequencing data, and heat maps were constructed depicting binding scores (relative amino acid frequencies per position in the RBD of binding versus non-binding variants) (Fig. 3c,d and Supplementary Figs. 4 and 5). The heat maps show nearly complete coverage of mutations across the RBD within all sorted populations. A heterogeneous distribution of mutations is observed for ACE2 binding, with no specific positions or mutations showing dominance (Fig. 3c). This agrees with previous studies that suggest the Q498R and N501Y mutations present in BA.1 show strong epistatic effects that compensate for many mutations that cause loss of binding47. By contrast, for certain antibodies, clear mutational patterns could be observed, including escape mutations that correspond with previous DMS studies (Fig. 3d,e and Supplementary Figs. 4 and 5). For example, RBD escape variants for Brii-198 are enriched for mutations in positions 346 and 452 (Supplementary Fig. 4d), which are present in BA.1 and BA.4/BA.5, respectively, and correspond to previous work that shows they drive a drastic loss of binding to Brii-198 (ref. 48). By contrast, enrichment of these escape mutations are not observed for antibody 2–7 (Supplementary Figs. 4 and 5), even though Brii-198 and 2–7 share a similar epitope, suggesting that the binding modality between these two antibodies are different, which is also reflected by their difference in resistance to Omicron variants (for example, 2–7 shows strong binding to BA.2 and BA.4/BA.5, while Brii-198 does not bind BA.2.12 and BA.4/BA.5)10,39. Similarly, the F486V mutation, which has been demonstrated to drastically reduce the neutralization potency of ZCB11 by over 2,000-fold10, is highly enriched in the RBD escape population (Fig. 3d,e). These mutations are also seen in A23-58.1 and COV2-2196, which bind to a similar epitope (Supplementary Figs. 4 and 5). Lastly, for ADG20, we observe a high enrichment of escape mutations in 408 (Fig. 3e and Supplementary Figs. 4 and 5); this position is also mutated in BA.2 and BA.4/BA.5 variants, which have been shown to have drastically reduced neutralization by ADG20 (ref. 10). While heat map analysis allows specific mutational patterns to be linked with antibody escape profiles, the high-dimensional nature—and potentially higher order impact—of combinatorial mutations is not reflected in this format. It is apparent that protein epistasis and combinatorial mutations can modify the effect of known escape mutations, either amplifying or reducing antibody binding. For example, individual RBD mutations (G339D, S371F, S373P, S375, K417N, N440K, G446S, S477N, T478K, E484A, Q493R, G496S, Q498R, N501Y, Y505H) in BA.1 and BA.1.1 do not enhance escape to COV2-2196, with each mutation causing an average fold reduction of 2.2, but together cause over 200-fold reduction in neutralization49. Conversely, the introduction of the single R493Q mutation in BA.2 substantially rescued the neutralizing activities of Brii-198, REGN10933, COV2-2196 and ZCB11 (ref. 10). Thus, while the heat maps indicate specific mutational contributions to antibody escape, other techniques such as deep learning are required to capture the high-dimensional nature of combinatorial mutations and generalize to future mutations.

Deep learning ensemble models accurately predict ACE2 binding and antibody escape

To address the high dimensionality of our dataset and to understand epistatic effects between mutations in the full RBD mutational sequence space, which is far too vast to be comprehensively screened experimentally, we trained deep learning ensemble models. Deep sequencing data from FACS-isolated yeast populations underwent pre-processing and quality filtering before being used as training data for machine learning. In the datasets for all antibodies, using the BA.1 RBD sequence as a reference, the mean rate of mutations ranged between edit distance (ED) two (ED2) and ED3, with a max ED8 (Supplementary Figs. 6 and 7 and Methods). Following DNA to protein translation, one-hot encoding was performed to convert amino acid sequences into an input matrix for machine and deep learning models (Fig. 4a). Supervised machine learning models were trained to predict the probability (P) that a specific RBD sequence will bind to ACE2 or a given antibody. A higher P signifies a stronger correlation with binding, whereas a lower P corresponds to non-binding (escape). The machine learning models tested included K-nearest neighbour, logistic regression, naive Bayes, support vector machines and random forests. In addition, as a baseline for deep learning models, a multilayer perceptron (MLP) model was also tested. Finally, we implemented a convolutional neural network (CNN) inspired by ProtCNN50, which leverages residual neural network blocks and dilated convolutions to learn global information across the full RBD sequence (Fig. 4a).

Fig. 4: Training and testing of deep learning ensemble models for prediction of ACE2 binding and antibody escape based on full-length RBD sequences.
figure 4

a, Deep sequencing data of sorted yeast display libraries are encoded by one-hot encoding and used to train CNN models with several dilated convolutional residual blocks. The models perform a final classification by predicting binding or non-binding to ACE2 or antibodies based on the encoded RBD sequence. b, Majority voting by an ensemble of models is used to determine the final label for each variant. c, Predicted labels of antibodies to well-characterized Omicron variants; colours indicate final labels; mis-classifications are marked with an ‘X’; conflicting data for S2H259 binding to BQ.1 and BA.2.75 are marked with ‘*’.

Source data

Each model was trained using an 80/10/10 train–validate–test split of data. Inputs were one-hot encoded RBD sequences, with the CNN using a two-dimensional (2D) matrix and others using a 1D flattened vector. For initial benchmarking, a collection of different baseline machine learning models, as well as CNN and MLP deep learning models, were trained on each dataset with hyperparameter optimization through random search and were evaluated with fivefold cross validation based on several common metrics (accuracy, measure of predictive performance (F1), Matthews correlation coefficient (MCC), precision and recall). During training, class balancing was achieved by upsampling the minority class in the training set, while the test set remained unbalanced. Comparing performances of the baseline models, both extreme gradient boosting (XGBoost) and CNN models obtained the highest MCC scores across most of the antibodies and libraries (Supplementary Data 1). However, in one single condition, seq-library A for antibody S2X259, the CNN model vastly outperformed all of the others, with an MCC score 0.15 higher than the next best model (Supplementary Data 1). This suggests that depending on the antibody, the use of deep learning architectures is still crucial for learning complex interactions across larger distances, and thus we performed all subsequent work with this CNN architecture.

We next applied an exhaustive hyperparameter search on CNN models to optimize their performance (Supplementary Table 4). To prevent data leakage during training, the held-out test set was fixed, and multiple models were trained on different training–validation splits of the remaining dataset to make sure each model learned slightly different parameters of the data. When tested on the held-out test set, the final models yielded robust predictive performance up to an ED8 from the wild-type BA.1 sequence (Supplementary Fig. 8 and Supplementary Table 5).

For our final ensemble, we selected three CNN models from each library with the highest MCC scores to generate the predicted labels for each RBD variant through majority voting (Fig. 4b). In short, each model outputs P of binding for each input sequence, and labels are assigned based on a threshold. An RBD variant was assigned a predicted ‘escape’ label if the ensemble models of either seq-library A or seq-library B predicted escape, and assigned a predicted ‘binding’ label only if both ensemble models predicted binding. We tested the performance of the ensemble models using published experimental data of antibody binding (or neutralization) to Omicron sublineages10,38,39,48,51,52,53 (Supplementary Data 2). Where possible, we used published antibody affinity data39 and set the escape (non-binding) threshold to Kd > 100 nM, a limit that indicates considerable loss of binding. For some antibodies, such as ZCB11 and Brii-198, neutralization data are only reported without affinity measurements10,48, and therefore in these cases, we used neutralization to define an escape threshold of half maximal inhibitory concentration (IC50) > 10 µg ml−1. For our deep learning models, standard thresholds were used for classification: P > 0.5 as binding, and P ≤ 0.50 as non-binding (escape). The models assigned accurate labels for 87.5% (63/72) of ACE2-RBD variant or antibody-RBD variant pairs, with two false positives, and seven false negatives (Fig. 4c). The high number of false negatives suggests that the models tend to be more conservative for predictions of antibody binding—which may be preferable in the context of selecting optimal antibody therapies. Furthermore, 6 out of 9 misclassifications occur when variants have >8 mutations from the parental BA.1 sequence (BA.4, BA.2.75, BQ.1), which may suggest that at high mutational loads, the combinatorial effects of mutations begin to exceed the predictive threshold. Finally, there have been contradictory reports for binding and neutralization with the antibody S2X259, one publication reporting low half maximal effective concentration (EC50) and IC50 values across all variants up to BQ.1 (ref. 39); however, a second study10 reports a much higher IC50 of >10 µg ml−1 to BA.2. Our model predictions support the latter publication, as we predict a loss of binding of S2X259 to all variants beyond BA.2.

Designing antibody combinations by predicting resistance to synthetic Omicron lineages

After validating the performance of CNN models on test and validation data, we next deployed them to evaluate the resistance of antibodies to viral evolution. While antibody breadth is normally evaluated retroactively based on neutralization or binding to previously observed variants, here we aimed to leverage this machine-learning-guided protein engineering approach to prospectively characterize and assess the breadth of antibodies against Omicron variants that may emerge in the future. This was achieved by generating synthetic lineages stemming from BA.1. As the potential sequence space of combinatorial RBD mutations is exceedingly massive, it was necessary to reduce this to a relevant subspace, and therefore mutational probabilities were calculated across the RBD using SARS-CoV-2 genome sequencing data (available on Global Initiative on Sharing Avian Influenza Data, GISAID (www.gisaid.org)) and used to generate synthetic lineages that mimic natural mutational frequencies. Starting with the BA.1 sequence, mutational frequencies from 2021 and 2022 were used to generate ten sets of 250,000 synthetic RBD sequences through six rounds of in silico evolution, where the 100 variants with the highest predicted score for ACE2 binding (averaged across the ensemble CNN models) in each round were used as seed sequences for the next round of mutations. Next, the ensemble deep learning models were used to predict antibody binding or escape for the synthetic RBD variants. This provides an estimation of each individual antibody’s binding breadth in the generated sequence space and thus correlates with resistance to prospective Omicron lineages (Fig. 5a,b and Supplementary Fig. 9).

Fig. 5: Evaluating antibody breadth on synthetic Omicron lineages.
figure 5

a, Example of a synthetic lineage tree of sequences generated containing mutations unseen in major Omicron variants, with heat map indicating the deep learning predictions of binding or escape for individual antibodies. VOCs, variants of concern. b, Total mean predicted breadth of individual antibodies and combinations on synthetic lineages generated from 2022 mutational probabilities. c, The fraction (%) of sequences bound by individual antibodies at different ED from BA.1. d, Phylogenetic tree of ZCB11 escape variants containing the 20 highest-scoring mutations, with antibody recapture indicated. e, Sequence logos show the mutations in the top 25 positions with greatest KL divergence in ZCB11 escape variants at ED6, and number of sequences re-captured by Brii-198, ADG20 and A23-58.1. (Higher recapture indicates a more complementary antibody). f, The top 50 predicted mutations ranked by their escape scores (Methods) from the generated synthetic lineages, with new mutations seen in the BA.2.86 variant highlighted.

Source data

As several of the clinically used antibody therapies for COVID-19 consisted of a cocktail of two antibodies (such as LY-CoV555 + LY-CoV16, REGN10933 + REGN10987 and COV2-2130 + COV2-2196), we also determined antibody breadth across all two-way combinations. For the 2022-based synthetic lineages, ZCB11 showed the greatest predicted breadth, followed by A23-58.1, Brii-198 and ADG20 (Fig. 5b). The predicted coverage of ZCB11 corresponds well with experimental measurements that show it maintains high affinities and neutralization to several Omicron variants (BA.2, BA.4/5)10. Similarly, Brii-198 and A23-58.1 have been shown to bind BA.2, BA.2.12 and BA.2.75 variants40, aligning with the predictions of their relatively high breadth. Examining breadth profiles of each antibody as a function of ED revealed differing profiles, such as ZCB11 and Brii-198 maintaining high breadth at larger ED (>ED4), while A23-58.1 and ADG20 have substantially lower breadth at large ED (Fig. 5c). The predicted breadth of several antibodies were substantially different for synthetic lineages generated using 2021 mutational probabilities. For example, the breadth of ADG20 is substantially higher as it is predicted to bind 16% more variants, while the breadth of Brii-198 and A23-58.1 are both reduced by 11% (Supplementary Fig. 10). This suggests that correctly anticipating antigenic drift and changes in mutational frequencies play an important role in determining breadth predictions.

It is worth noting that calculating the breadth of antibody combinations is not simply additive. For example, while Brii-198 ranks lower than A23-58.1 in total breadth, Brii-198 provides more complementary coverage to ZCB11 (Brii-198 binds to more variants that escape ZCB11), resulting in an overall increase in variant coverage in a simulated cocktail. Thus, when designing a cocktail, to select the best complementary antibody, both the quantity and additional qualities (such as mutational patterns) of escape variant lineages that are ‘re-captured’ by an antibody must be considered (Fig. 5d). Examining the distribution of escape variants for ZCB11 at ED6, where it sees its most substantial breadth reduction, the three other highly ranked antibodies (A23-58.1, Brii-198 and ADG20) re-establish coverage (predicted binding) over unique lineages, with Brii-198 re-capturing the greatest number of high-distance escape variants to ZCB11 (Fig. 5d,e). Taking a closer look at the mutations within the re-captured sequences, only ADG20 and Brii-198 cover and mitigate variants that include the key F486V mutation (for example, BA.4/5). Furthermore, Brii-198 covers the most diverse sequences that contain additional critical mutations at the F468 position, in addition to the surrounding residues in this epitope (Fig. 5e). Thus, while any of the three antibodies would be complementary to ZCB11 by nature of targeting a different epitope10, our breadth analysis aids in identifying the most complementary antibody based on RBD variant coverage.

To quantify the impact of how individual mutations can drive antibody escape, an escape score (S_m^) was computed for each mutation (m) within the synthetic lineages. This metric is a normalized product of the number of antibodies escaped by a given mutation and the mutation’s frequency within the lineage (Methods). When examining individual RBD mutations across the synthetic lineages (Fig. 5f), it was revealed that T523P has the highest escape score. Comparatively, DMS results showed that mutations at position 523 have a slightly negative influence on RBD protein expression level19, which may explain its low occurrence in natural variants, having only been observed in 70 sequences in the GISAID database. Furthermore, the combination of D339R, F486A and T523P mutations in the simulated BA.1 lineages caused the most antibody escape among mutations not previously observed in major variants (Fig. 5f). Out of these, the positions 339 and 486 are mutated in BA.2.75 and XBB and their sublineages. The top 50 mutations with the highest escape scores include K356T and R403K, which are present in the recently reported and highly mutated BA.2.86 variant and had not been previously reported in any other major variant (Fig. 5f). In addition, positions V445 and N481 were also mutated in BA.2.86. Taken together, this suggests that DML-derived escape scores may reveal mutations or positions that emerge in future variants.



Source link

Leave a Comment