Access the full text.
Sign up today, get DeepDyve free for 14 days.
KK Jensen, M Andreatta, P Marcatili (2018)Improved methods for predicting peptide binding affinity to MHC class II molecules
B Reynisson, B Alvarez, S Paul (2020)NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data
U Jüse, M Arntzen, P Højrup (2011)Assessing high affinity binding to HLA-DQ2.5 by a novel peptide library based approach
T Sturniolo, E Bono, J Ding (1999)Generation of tissue-specific and promiscuous HLA ligand databases using DNA microarrays and virtual HLA class II matrices
X Zhou, D Wang, P Krähenbühl (2019)Objects as points
H Zeng, DK Gifford (2019)Quantification of uncertainty in peptide-MHC binding prediction improves high-affinity peptide selection for therapeutic design
A Dosovitskiy, L Beyer, A Kolesnikov (2020)An image is worth 16x16 words: transformers for image recognition at scale
PP Nanaware, MM Jurewicz, JD Leszyk (2019)HLA-DO modulates the diversity of the MHC-II self-peptidome
S Paul, RV Kolla, J Sidney (2013)Evaluating the immunogenicity of protein drugs by applying in vitro MHC binding data and the immune epitope database and analysis resource
S Pokharel, P Pratyush, M Heinzinger (2022)Improving protein succinylation sites prediction using embeddings from protein language model
U Jüse, Y Wal, F Koning (2010)Design of new high-affinity peptide ligands for human leukocyte antigen-DQ2 using a positional scanning peptide library
J Jumper, R Evans, A Pritzel (2021)Highly accurate protein structure prediction with AlphaFold
BD Huisman, Z Dai, DK Gifford (2022)A high-throughput yeast display approach to profile pathogen proteomes for MHC-II binding
F Teufel, JJA Armenteros, AR Johansen (2022)SignalP 6.0 predicts all five types of signal peptides using protein language models
A Elnaggar, M Heinzinger, C Dallago (2022)ProtTrans: toward understanding the language of life through self-supervised learning
M Nielsen, C Lundegaard, O Lund (2007)Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method
CG Rappazzo, BD Huisman, ME Birnbaum (2020)Repertoire-scale determination of class II MHC peptide binding via yeast display improves antigen prediction
C King, EN Garza, R Mazor (2014)Removing T-cell epitopes with computational protein design
M Tan, QV Le (2021)EfficientNetV2: smaller models and faster training
S Paul, A Grifoni, B Peters (2020)Major histocompatibility complex binding, eluted ligands, and immunogenicity: benchmark testing and predictions
J Cheng, K Bendjama, K Rittner (2021)BERTMHC: improved MHC–peptide class II interaction prediction with transformer and multiple instance learning
J Long, E Shelhamer, T Darrell (2014)Fully convolutional networks for semantic segmentation
XM Shao, R Bhattacharya, J Huang (2020)High-throughput prediction of MHC class I and II Neoantigens with MHCnuggets
A Vaswani, N Shazeer, N Parmar (2017)Attention is all You need
J Sidney, S Southwood, C Moore (2013)Measurement of MHC/peptide interactions by gel filtration or monoclonal antibody capture
C Marquet, M Heinzinger, T Olenyi (2022)Embeddings from protein language models predict conservation and variant effects
H Schellekens (2005)Factors influencing the immunogenicity of therapeutic proteins
L Yin, LJ Stern (2014)Measurement of peptide binding to MHC class II molecules by fluorescence polarization
S Buus, A Sette, SM Colon (1986)Isolation and characterization of antigen-la complexes involved in T cell recognition
B Peters, H-H Bui, S Frankild (2006)A community resource benchmarking predictions of peptide binding to MHC-I molecules
A Rives, J Meier, T Sercu (2021)Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
GJ Wolbink, LA Aarden, BAC Dijkmans (2009)Dealing with immunogenicity of biologicals: assessment and clinical relevance
R You, W Qu, H Mamitsuka (2022)DeepMHCII: a novel binding core-aware deep interaction model for accurate MHC-II peptide binding affinity prediction
Antibody Therapeutics, 2023, Vol. 6, No. 2 137–146 https://doi.org/10.1093/abt/tbad011 Advance Access Publication on 14 May 2023 TransMHCII: a novel MHC-II binding prediction model built using a protein language model and an image classiﬁer Xin Yu , Christopher Negron, Lili Huang and Geertruida Veldman Biotherapeutics Discovery, AbbVie Bioresearch Center, 100 Research Drive, Worcester, MA 01605, USA Received: January 28, 2023; Revised: April 18, 2023; Accepted: May 9, 2023 ABSTRACT The emergence of deep learning models such as AlphaFold2 has revolutionized the structure prediction of proteins. Nevertheless, much remains unexplored, especially on how we utilize structure models to predict biological properties. Herein, we present a method using features extracted from protein language models (PLMs) to predict the major histocompatibility complex class II (MHC-II) binding afﬁnity of peptides. Specif- ically, we evaluated a novel transfer learning approach where the backbone of our model was interchanged with architectures designed for image classiﬁcation tasks. Features extracted from several PLMs (ESM1b, ProtXLNet or ProtT5-XL-UniRef) were passed into image models (EfﬁcientNet v2b0, EfﬁcientNet v2m or ViT- 16). The optimal pairing of the PLM and image classiﬁer resulted in the ﬁnal model TransMHCII, outperforming NetMHCIIpan 3.2 and NetMHCIIpan 4.0-BA on the receiver operating characteristic area under the curve, balanced accuracy and Jaccard scores. The architecture innovation may facilitate the development of other deep learning models for biological problems. Statement of Signiﬁcance: To our knowledge, TransMHCII is the ﬁrst multi-class classiﬁcation model for MHC-II binding prediction. This is also the ﬁrst study that utilizes PLM embeddings to predict peptide/MHC-II binding. In addition, it is the ﬁrst attempt to integrate PLMs with image classiﬁers for biological property prediction. KEYWORDS: immunogenicity; MHC-II; transfer learning; protein language model; deep learning INTRODUCTION using purified MHC-II proteins, cell lysates or live cells [4– Early and accurate assessment of immunogenicity risk 9]. Some of these techniques suffer from peptide degrada- remains a major challenge for the development of antibody tion (e.g. live-cell assays) or low throughput (e.g. plasmon therapeutics. Immunogenicity can impact drug safety, resonance). Currently, the most accurate and reproducible efficacy and pharmacokinetic properties . Clinical gold standard remains to be the classical radiolabeled com- immunogenicity is influenced by many factors, including petition assay using purified MHC-II proteins [4, 10, 11]. sequence and structure features, posttranslational mod- This technique has been widely used to generate a large ifications, contaminants (e.g. denatured or aggregated amount of data for training various MHC-II BA models. proteins caused by suboptimal preparation or storage Earlier MHC-II BA models such as Sturniolo and SMM condition), formulation, dosing regimen and genetic Align rely on position-specific scoring matrices [12, 13]. background of patients . Their performance and allele coverage were surpassed by One of the strategies for in silico immunogenicity risk later models based on machine learning  and deep assessment is the prediction of MHC-II epitopes . It learning networks. The deep learning models were either involves predicting the binding affinity of a peptide to one trained on binding data only (such as PUFFIN , or more MHC-II alleles (a.k.a. BA models), the probability DeepMHCII , MHCNugget  and NetMHCIIpan of detecting a peptide in mass spectrometry (MS) eluted lig- 3.2 ) or on a combination of binding and EL data where and assays (a.k.a. EL models) or a combination of both . the EL data were used to inform binding predictions (such The affinity of a peptide can be experimentally determined as NetMHCIIpan 4.0  and MHCBert ). Despite via unlabeled, radiolabeled or fluorescently labeled assays the advancement, there are several caveats. First, many of To whom correspondence should be addressed. Xin Yu. Email: email@example.com. © The Author(s) 2023. Published by Oxford University Press on behalf of Antibody Therapeutics. All rights reserved. For permissions, please e-mail: jour- firstname.lastname@example.org. 138 Antibody Therapeutics, 2023 these models were trained as regressors using data from the portion of the NetMHCIIpan 4.0-BA dataset was used. classical radiolabeled assays, or, in some cases, a mixture After cleaning, the final dataset consisted of 56 unique of different types of assays, to predict a single IC value human MHC-II alleles and 111 564 unique peptide- given a peptide–allele pair. When experimentally tested allele pairs. The majority of sequences were 15-mers, using other types of binding assays, especially those absent and the distribution of the four affinity bins (bin 0: 0– in the training dataset, there could be a large discrepancy 50 nM; bin 1: 50–500 nM; bin 2: 500–5000 nM and bin between the predicted and experimental IC [4, 21]. For 3: 5000–50 000 nM) is ∼18, 23, 28 and 31%, respectively example, NetMHCII, which performed well on the stan- (Supplementary Fig. 1A). Using a random seed, this dard radiolabeled assays, generated predictions that were dataset was randomly split into an 85% training set significantly different from the experimental data in (N = 94 829) and a 15% validation set (N = 16 735). a fluorescent-labeled assay . Second, despite being Training and hyperparameter tuning were performed only trained as regressors, these models were often benchmarked on the training set. The validation set was used as a hold- as binary classifiers using a 500-nM threshold. This is not out set during training and hyperparameter tuning to ideal because the loss function for training regressors, prevent model overfitting. The models were constructed by such as mean squared error (MSE) or mean absolute pairing embeddings from a PLM (ESM1b, ProtXLNet or error (MAE), is distinct from that required for training ProtT5-XL-UniRef) with an image classifier (EfficientNet classifiers, such as binary cross-entropy (BCE). This results v2b0, EfficientNet v2m or ViT-16). The three PLMs were in a mismatch between the training algorithm and the selected because they each represent a type of PLM: benchmark metrics. encoder only (ESM1b), decoder only (ProtXLNet) and Recently, there has been a surge of protein language encoder plus decoder (ProtT5-XL-UniRef). The three models (PLMs) that utilize transformer architectures . image classifiers were selected because they represent the By learning to predict masked amino acids from the context current state-of-the-art image classification models . of entire sequences, these models generate residue-level For comparison, two conventional protein embedding representations of the biochemical properties, structural methods (one-hot and BLOSUM62 ) were used as features and evolutionary information about the proteins controls for the PLMs. A ResNet block (a.k.a. vanilla . Several PLMs, including the AlphaFold 2 , ESM1b network) was used as a control for the image classifiers. , ProtT5-XL-UniRef  and ProtXLNet , have The architecture is shown in Fig. 1, and the training and reached state-of-the-art performance when benchmarked validation sets are provided in Supplementary Table 1. in secondary structure prediction tasks. Outputs from their intermediate layers (a.k.a. hidden layers) can be extracted Results from the training process and potentially used as inputs to train models designed for tasks other than those in the benchmark studies. This A total of 11 models were created and trained for 100 process is often called embedding or feature extraction in epochs. The models varied in size, with vanilla networks transfer learning. In the computer vision field, embeddings being the smallest (0.2-M parameters) and the ViT- from certain image classifiers can be used as inputs to train 16 the largest (86-M parameters, Table 1). To ensure task-specific architectures [27, 28]. Similar ideas may apply comparability, the same training parameters were applied to PLMs. Indeed, PLM embeddings have been shown in to all models. Among the parameters, the learning rate a few studies to build models that predict genetic variant scheduler is especially important for training efficiency. effects , succinylation  or signal peptide cleavage We tested several schedulers (data not shown) and . Nevertheless, there is still a lack of studies on how to eventually designed a scheduler with a linear warmup, systematically apply PLM embeddings to predict peptide a stable plateau at 1e-3, followed by a cosine decay binding. (Supplementary Fig. 2A). When used together with the We present a novel MHC-II BA model that was Adam optimizer, it yielded a smooth training process for trained using PLM embeddings. We paired embeddings the ProtT5-XL-UniRef + EfficientNet v2b0 model, with from ESM1b, ProtXLNet or ProtT5-XL-UniRef with accuracy eventually reaching 0.966 for the training set nearly intact architectures from three image classifiers and 0.997 for the validation set (Supplementary Fig. 2B). (EfficientNet v2b0, EfficientNet v2m or ViT-16) that Nevertheless, not all the combinations performed simi- were originally developed for classifying images of ordi- larly. For example, training of one-hot + EfficientNet nary objects on ImageNet [32–34]. The optimal pairing v2m (Supplementary Fig. 2C) and ESM1b + ViT-16 of the PLM (ProtT5-XL-UniRef) and image classifier (Supplementary Fig. 2D) ceased to progress after ∼20 (EfficientNet v2b0) yields a robust multi-class classifier, epochs, with validation accuracy only at 0.5–0.6. For TransMHCII. ESM1b + vanilla, accuracy improved for the training set but was unsteady for the validation set, suggesting that the model was severely overfit (Supplementary Fig. 2E). Nonproductive training such as these was terminated RESULTS before reaching 100 epochs. Statistics of the training dataset, validation dataset and For ProT5-XL-UniRef + EfficientNet v2b0, it took model architecture about 4 h to complete 100 epochs of training on the 4xGPU We obtained the classical dataset from Wang et al.  specified in the method section, which was relatively (a.k.a. the 44 k dataset) and the training dataset of efficient (Supplementary Fig. 3). The time to complete NetMHCIIpan 4.0-BA . Only the binding affinity training depends on the factors such as the size of the Antibody Therapeutics, 2023 139 Figure 1. Architecture of the models. Embeddings from PLMs were paired with a ResNet block (A) or an image classifier (B) for comparison. In some cases, conventional protein embeddings such as one-hot and BLOSUM62 were used in place of the PLM embeddings. model (a.k.a. the number of parameters in Table 1)and higher balanced accuracy (mean 0.995) than the non-PLM size of the input matrix. Because all peptides were padded embeddings (mean 0.942, P < 0.001, Fig. 2A). A similar to a maximum of 20 amino acids, the size of the input conclusion can be drawn for the Jaccard score (Fig. 2B). matrix reflected the number of features generated by This confirmed that PLM embedding indeed improved the embedding method (20 for one-hot or BLOSUM62, model performance. 1024 for ProtXLNet or ProtT5-XL-UniRef and 1280 for In addition to these models, we also built a support vec- ESM1b). Unsurprisingly, ESM1b + EfficientNet v2m took tor machine (SVM) classifier by averaging the per-residue the longest time (∼22 h) to train, because of its larger model features generated by PLM to produce per-peptide fea- size and input size, whereas BLOSUM62 + EfficientNet tures. The accuracy of the SVM model was low, with only v2b0 took a little more than 1 h, because of its smaller 0.35 on the training set and 0.33 on the validation set model size and input size. (data not shown). This confirmed the robustness of our method. Performance on the validation set Performance on the test set As shown in Table 1, the pairing of the PLM and image classifier is critical to model performance. Vanilla networks An independent test set (Supplementary Table 2) contain- displayed low performance with both non-PLM and PLM ing 38 unique human MHC-II alleles and 21 424 unique embeddings. For EfficientNet v2m, the accuracy was rel- peptide-allele pairs was collected from an IEDB weekly atively high with ESM1b embedding but low with one- benchmark database (20161231–20220902). The majority hot embedding. The opposite was observed for ViT-16. of sequences were 15-mers, and the distribution of the Nevertheless, for EfficientNet v2b0, all embeddings per- four affinity bins is ∼19, 28, 28 and 25%, respectively formed well. The validation accuracy and Jaccard score (Supplementary Fig. 1B). The test set was used to evaluate of EfficientNet v2b0 models surpassed that of the models two literature models (NetMHCIIpan 3.2 and NetMHCI- built with alternative image classifiers (P < 0.001, Fig. 2). Ipan 4.0-BA), as well as our EfficientNet v2b0 models. This suggested that EfficientNet v2b0 was a better choice Unlike our models, both NetMHCIIpan models were among the three image classifiers for this task. Among regressors that predict IC values instead of affinity bins. the EfficientNet v2b0 models, PLM embeddings produced As a result, we first looked at the area under the curve 140 Antibody Therapeutics, 2023 (AUC) of the receiver operating characteristic (ROC) curve, a metric often used in the literature. The results were displayed in Table 2 and Fig. 3A. Like the observation on the validation set, ProtT5-XL-UniRef + EfficientNet v2b0 emerged as the best-performing model with an AUC of 0.908 on the test set, which is higher than that of the NetMHCIIpan 3.2 (mean 0.900, P < 0.001) and NetMHCIIpan 4.0-BA (mean 0.871, p < 0.001). Even though the ProtXLNet embedded model had a similar AUC as the one-hot embedded model, on average, the non- PLM models (one-hot and BLOSUM62) had lower AUC than the three PLM models (0.889 vs. 0.898, P < 0.001). Balanced accuracy and Jaccard are widely used met- rics for multi-class classifiers. When evaluated using these two metrics, ProtT5-XL-UniRef + EfficientNet v2b0 significantly outperformed NetMHCIIpan 3.2 and NetMHCIIpan 4.0-BA, improving the scores by ∼0.2 points (Table 2 and Fig. 3B and C). The non-PLM models on average had lower balanced accuracy (0.673 vs. 0.690) and Jaccard (0.510 vs. 0.532) than the PLM-embedded models (P < 0.001 for both metrics). This observation, along with the improvement in the ROC AUC score, confirmed that PLM embedding is beneficial for model performance. To fully characterize model performance, we analyzed another metric: top2 accuracy. Top2 accuracy computes the number of times where the correct label is among the top two labels predicted, as ranked by probabilities. This feature is not available in binary classifiers or regressors such as the NetMHCIIpan models. As shown in Fig. 3D, all five models reached a top2 accuracy score > 0.82. Interestingly, on this metric, the BLOSUM62 embedded model appeared to perform better than the ProtT5-XL- UniRef embedded model, and the ProtXLNet embedded model was less robust than all other models. The non- PLM models on average performed better (mean 0.850) than PLM models (mean 0.842, P < 0.001). These data suggest that whether PLM embedding improves model per- formance needs to be evaluated in the context of the specific metrics. Notably, when evaluated on the validation set and the test set, ProtXLNet embedding had lower performance than ProtT5-XL-UniRef embedding across all metrics. This is consistent with the result from the benchmark study for these two PLMs . The ProtT5-XL-UniRef + EfficientNet v2b0 model was renamed to TransMHCII. While TransMHCII out- performed the NetMHCIIpan models on the test dataset (balanced accuracy—TransMHCII: 0.696, NetMHCIIpan 3.2: 0.508, NetMHCIIpan 4.0-BA: 0.459), its prediction performance appeared to have dropped when compared with the results on the validation set (balanced accuracy— test set: 0.696, validation set: 0.997). The model was exposed to neither the validation set nor the test dataset during the training. However, the validation set originated mostly from studies on classical radiolabeled assays [19, 35], whereas the test dataset was from the IEDB database, which contains data from different types of assays performed by different labs . The variance of data could be much larger when different types of assays or different experimental protocols are involved. This could be a reason for the lower performance of our models on the test Table 1. Performance of models on the training set (N = 94 829) and validation set (a hold-out dataset of N = 16 735). Rows are non-PLM controls (one-hot, BLOSUM62) and PLMs (ESM1b, ProtXLNet and ProtT5-XL-UniRef). Columns are image classifiers (EfficientNet v2b0, EfficientNet v2m and ViT-16) and their control, Vanilla, which was a simple ResNet block Vanilla EfficientNet v2b0 EfficientNet v2m ViT-16 1 2 2 Params Train set Val. set Params Train set Val. set Params Train set Val. set Params Train set Val. set ∗ ∗ ∗ ∗ One-hot 0.2 M 0.504 0.504 6M 0.857 0.944 ± 0.002 53 M 0.533 0.574 86 M 0.820 0.873 ± 0.002 BLOSUM62 6M 0.855 0.939 ± 0.002 ∗ ∗ ∗ ∗ ESM1b 0.2 M 0.476 0.471 6M 0.944 0.991 ± 0.001 53 M 0.908 0.957 ± 0.001 86 M 0.537 0.522 ProtXLNet 6M 0.965 0.996 ± 0.001 ProtT5-XL-UniRef 6M 0.966 0.997 ± 0.001 Params: number of trainable parameters in the model, in millions (M). For the training set, the default categorical accuracy in Keras was adjusted by hard-coded class weights computed from sklearn and reported here. For the validation set, the balanced accuracy was computed directly using the sklearn package post-training and reported here with its standard deviations. Nonproductive training was stopped before reaching 100 epochs, because of a lack of improvement in model performance on the validation set. Bold: PLM-image classifier pair that resulted in the best performance on the validation set. Model ProtT5-XL-UniRef + EfficientNet v2b0 was renamed to TransMHCII. Antibody Therapeutics, 2023 141 Figure 2. Performance of models on the validation set, which is a hold-out dataset of N = 16 735. Balanced accuracy (A) and the Jaccard score (B) were shown for models that completed 100 epochs of training. Dash lines represent the results from an untrained dummy model that makes random predictions. ∗∗∗ Asterisks represent the significance level from the t-test. : P ≤ 0.001. Table 2. Performance of models on the test dataset (N = 21 424). ROC AUC Balanced accuracy Jaccard Top2 accuracy NetMHCIIpan 3.2 0.900 ± 0.002 0.508 ± 0.003 0.355 ± 0.003 NetMHCIIpan 4.0-BA 0.871 ± 0.002 0.459 ± 0.003 0.310 ± 0.003 One-hot + EfficientNet v2b0 0.884 ± 2.22e−16 0.664 ± 0.003 0.499 ± 0.004 0.846 ± 0.002 BLOSUM62 + EfficientNet v2b0 0.892 ± 2.22e−16 0.682 ± 0.003 0.521 ± 0.004 0.853 ± 0.002 ESM1b + EfficientNet v2b0 0.903 ± 2.22e−16 0.695 ± 0.003 0.537 ± 0.004 0.850 ± 0.002 ProtXLNet + EfficientNet v2b0 0.884 ± 3.33e−16 0.680 ± 0.003 0.520 ± 0.004 0.828 ± 0.003 ProtT5-XL-UniRef + EfficientNet v2b0 0.908 ± 2.22e−16 0.696 ± 0.003 0.540 ± 0.004 0.849 ± 0.002 To calculate the top2 accuracy, the model must produce probability estimates for each of the four affinity bins. Therefore, NetMHCIIpan 3.2 and NetMHCIIpan 4.0-BA models were excluded from this analysis. Bold: PLM-image classifier pair that resulted in the best performance on the test dataset. Model ProtT5-XL-UniRef + EfficientNet v2b0 was renamed to TransMHCII. dataset. It might also explain the lower performance of robust on the validation set comprised of mostly radiola- the NetMHCIIpan models compared with their original beled assay data. However, performance dropped when a published results. mixture of assay types and data sources were allowed as in the test set. In the latter case, the top2 accuracy always appeared higher than its corresponding balanced accuracy, Per-allele analysis on the validation set and the test dataset reaching ∼0.8 or 0.9 for most alleles. As a result, top2 For TransMHCII, we analyzed its per-allele balanced accu- prediction could be a helpful readout when the model is racy and top2 accuracy on the validation set and the test used to predict outcomes from assays that deviate from the set. The results are shown in Fig. 4. The performance was classical radiolabeled assays. 142 Antibody Therapeutics, 2023 Figure 3. Performance of models on the test dataset (N = 21 424). Models built using EfficientNet v2b0 were compared against literature models NetMHCIIpan 3.2 and NetMHCIIpan 4.0-BA. (A) ROC AUC curve. FRP: false-positive rate. TRP: true-positive rate. (B) Balanced accuracy. (C) Jaccard score. In (B) and (C), the black bars represent the NetMHCIIpan models, and the purple bars represent models from this study. (D) Top2 accuracy. Dash lines represent an untrained dummy model that generates random predictions. T-tests were conducted for TransMHCII against NetMHCIIpan 3.2, TransMHCII against NetMHCIIpan 4.0-BA and non-PLM embedded models against PLM embedded models. Asterisks represent significance levels: ∗∗∗ : P ≤ 0.001. DISCUSSION To address these challenges, we developed TransMHCII, the first-in-literature multi-class classifier for MHC-II In the literature, MHC-II BA models were often trained binding prediction. The model has several advantages. as regressors to predict a single IC value given a peptide 50 First, instead of predicting a single IC value, it predicts and an MHC-II allele. There are two challenges with this the probability of the IC falling into one of the four approach. First, IC values are tied to a specific assay. 50 ranges. To a certain degree, it allows for data variance When the experimental procedures change, the assay sen- because of experimental errors or differences in laboratory sitivity and dynamic range might also change, generating procedures. Second, the thresholds (50, 500 and 5000 nM) different IC values. For example, live cell assays are more 50 were arbitrarily chosen to separate the strong, medium, prone to peptide degradation than assays using purified weak and non-binders. In this way, the model focuses proteins . Radioactive probes might be more sensitive on differentiating different categories of binders, instead than fluorescent probes, because of less interference from of correcting a specific IC value. Notably, the actual biological components and light scattering . Therefore, affinity threshold at which 90% of epitopes are retrieved the interpretation of a predicted IC value is limited to the 50 differs by allele . For a given allele, one can refer to context of the specific experimental procedure from which experimental data to determine what threshold to use and the training data were generated. In addition, there is often sum up the predicted probabilities of one or more bins. a lack of confidence interval estimates for the predicted Third, sometimes the model struggles to decide, as reflected IC value. Second, when the data are log-transformed for by similar probabilities for two adjacent bins. In this case, training, loss functions such as MAE yield the same value top2 prediction can be used to generate an estimate. when the ratio of the true IC and predicted IC is the 50 50 There are a few caveats in our model. First, it was not same. For example, the loss calculated when the true IC 50 trained on EL data. EL models predict the probability is 250 nM and the predicted IC is 50 nM is the same as 50 that a peptide is detected in the MAPPs assay. This is an the loss calculated when the true IC is 50 000 nM and 50 important aspect of MHC-II epitope prediction. Studies the predicted IC is 10 000 nM. When loss is the same, 50 suggest that high abundance could compensate for low the model puts in the same amount of effort to optimize binding affinity  and that low abundance of a peptide weights. However, the latter case might not deserve much that binds with high affinity might also be immunogenic effort because regardless of whether IC is really 50 000 or 50 . There are two possible workarounds. The first option 10 000 nM, this peptide is most likely a non-binder. is to extend our current architecture, train on both BA and Antibody Therapeutics, 2023 143 Figure 4. Per-allele analysis of the balanced accuracy and top2 accuracy of TransMHCII on the validation set and test set. The color of the blocks in the heatmap represents the score. Alleles missing from the test set were plotted as empty areas. EL data and generate two predictions (BA and EL) from model) and an EL output (NetMHCIIpan 4.0-EL model). the unified model. This is similar to the NetMHCIIpan Another option is to use TransMHCII with a separate EL 4.0 model, which has a BA output (NetMHCIIpan 4.0-BA model. Peptides that are predicted with both strong binding 144 Antibody Therapeutics, 2023 and a high likelihood for processing will then be flagged model optimization, saving time and resources. In addition, for further immunogenicity risk assessment. Second, Trans- each type of image classifier displays certain unique design MHCII can only predict peptide binding to 1 of the 56 patterns that distinguish it from other image classifiers. As human MHC-II alleles available in the training dataset. we begin to utilize PLM embeddings for biological property Each allele is a combination of an MHC-II α chain and an prediction, it would be useful to understand what design MHC-II β chain. Given the large numbers of the different α patterns are better suited for what kind of tasks. In this way, and β chains in the repertoire , our collection only rep- we might be able to utilize architectural designs from others’ resents a small fraction of all possible αβ combinations. We work to build models for similar tasks. can expand the capability of the model to predict binding to In summary, we developed a robust MHC-II BA model, MHC-II alleles that are missing from the training dataset. TransMHCII, using PLM embedding and an image This can be done by replacing the integer-encoded alleles classifier. The model could be useful for immunogenicity with pseudo-sequence encoded alleles, as demonstrated by risk assessment, and its architecture innovation may NetMHCIIpan 3.2. The PLM model will then be used to inspire the development of other deep learning models for embed both the input peptide and the pseudo-sequence of antibody discovery. the alleles. Regardless, our current collection covers all 27 alleles  that are recommended for population analysis in antibody immunogenicity risk assessment. MATERIALS AND METHODS A notable innovation of this study is the integration Datasets of PLMs with image classifiers. Because the PLMs are developed to predict protein structures, and the image clas- To compare our method with NetMHCIIpan, one of the sifiers are trained to tell apart common objects, neither top-performing MHC-II BA models in the literature, we was designed to predict MHC-II binding per se. The strat- obtained the dataset from Wang et al.  (a.k.a. the 44 k egy we pursued is an example of transfer learning, where dataset) and the training dataset of NetMHCIIpan 4.0-BA models trained in one knowledge domain are used to con- . The 44 k dataset contains binding data from classical struct models for another domain. When designing such radiolabeled assays. The training dataset of NetMHCIIpan a strategy, a fundamental question is which architectures 4.0-BA has two parts: the binding data and the EL data. are transferrable. For the PLMs, studies confirm that their Because of cells expressing multiple MHC-II alleles, pro- embeddings encode biophysical, structural and homology cessing EL data requires an additional motif deconvolution features. These features impact the binding affinity of a step. As a result, only the binding data of the NetMHCI- peptide to an MHC-II allele. As a result, PLM embed- Ipan 4.0-BA training set was retained. Since 86% of the 44 k dings are not only transferrable but also, in general, ben- dataset overlaps with binding data of the NetMHCIIpan eficial for model performance as we demonstrated. The 4.0-BA, the combined dataset was further cleaned using image classifiers recognize spatial patterns. For example, the following criteria. First, for each duplicated peptide- when trained on images of animals, lower layers recognize allele pair that has less than a 3-fold difference in IC granular details such as color patterns or edge junctions, values, an average IC was computed and retained for whereas upper layers capture local motifs such as the face that pair. Beyond this threshold, the data were discarded. of the dog or the leg of the bird . This is reminiscent Second, alleles that have fewer than 20 data points were of earlier MHC-II BA models based on position matrices. removed. Finally, to reduce computation time, we retained These models assign a score to each amino acid in the only peptides that have 20 amino acids or fewer, which rep- peptide according to its location and type (i.e. the “granular resent 99.8% of the dataset before length filtering. The final details”). A 9-mer window then glides along the sequence dataset contains 111 564 unique peptide–allele pairs and 56 to capture local motifs that have the highest propensity unique human MHC-II alleles. A random seed was applied for binding. From this aspect, two distant concepts—image to divide this dataset into a training set and a validation classifiers and MHC-II epitope—are connected. Another set at an 85:15% ratio. The model was trained and tuned hint is that the architectures of NetMHCIIpan 3.2 and 4.0- using only the training set. The validation set was used as BA heavily rely on convolutional networks, which share a hold-out dataset to monitor the training progress, so that the same building blocks (i.e. 2D convolutional layers) as training was terminated before the model overfit or when the image classifiers. These observations inspired us to the performance plateaued. An independent test dataset use transfer learning techniques to construct MHC-II BA was obtained from the IEDB weekly benchmark collection models based on PLM and image classifiers. (20161231–20220902) , cleaned using the same method Can we use the PLM—image classifier architecture to and filtered to retain only alleles available in the training build models that predict other biological properties? In set. The final test set contains 21 424 unique peptide–allele the paper where ProtT5-XL-UniRef was released , the pairs and 38 unique human MHC-II alleles. authors constructed a convolutional network that takes PLM embedding as input to generate predictions on the Model architecture subcellular location of proteins. Therefore, it seems that our strategy could be applicable to other problems. The main Models were built using the Tensorflow framework in advantage of using an image classifier, as opposed to coding Python. Peptide sequences were fed into three PLM from scratch a complex convolutional network, is that one models (ESM1b, ProtXLNet and ProtT5-XL-UniRef) for can easily replace and modify pre-existing image classifiers embedding. The output embedding matrix was processed in search of the optimal pairing. This allows convenient into a shape compatible with 2D convolutional networks Antibody Therapeutics, 2023 145 and then used as inputs for one of the three image provided by AbbVie. AbbVie participated in the interpreta- classifiers: EfficientNet v2b0, EfficientNet v2m and ViT- tion of data, review and approval of the publication. 16 [32, 33]. The core architectures of image classifiers were retained, except that the image pre-processing layers and top classifier heads were replaced by PLM embedding and DATA AVAILABILITY STATEMENT a small network that classifies the peptide–allele IC into Datasets are provided in the Supplementary Materials. one of the four bins: 0–50, 50–500, 500–5000 and 5000– Source code is provided in GitHub repository https://gi 50 000 nM (capped). Alleles were converted into integers, thub.com/xinyu-dev/TransMHCII. passed onto a trainable embedding layer for encoding and then used as the second input for the classifier. For comparison, conventional one-hot or BLOSUM62 ETHICS AND CONSENT STATEMENT embedding was used in place of the PLM embeddings. A small ResNet block (a.k.a. vanilla model) was used in place Consent was not required. of the image classifiers in some cases. A total of 11 models were built. Details of the architecture are shown in Fig. 1 and the supplementary source code. ANIMAL RESEARCH STATEMENT Not applicable. Model training Weights for the image classifiers were initialized to the AUTHOR CONTRIBUTIONS default ImageNet weights. Weights for the vanilla model Xin Yu (Conceptualization-Lead, Data curation-Lead, were randomly initialized. All models were trained for 100 Formal analysis-Lead, Investigation-Lead, Methodology- epochs with a batch size of 64 or 128. Early stopping Lead, Software-Lead, Validation-Lead, Visualization- was activated if validation loss based on categorical cross Lead, Writing—original draft-Lead, Writing—review & entropy did not improve by at least 0.0001 for 10 consecu- editing-Lead), Christopher Negron (Conceptualization- tive epochs. An Adam optimizer  was used along with Supporting, Writing—original draft-Supporting, Writing— a learning rate scheduler comprised of three stages: linear review & editing-Supporting), Lili Huang (Conceptualization- warmup, plateau and cosine decay. Categorical accuracy Lead, Funding acquisition-Lead, Resources-Lead,  was used as the training metric, with class weights Supervision-Lead, Writing—original draft-Supporting, applied to adjust for the difference in the proportions of Writing—review & editing-Supporting) and Geertruida label classes. A synchronous, multi-GPU distributed train- Veldman (Conceptualization-Equal, Funding acquisition- ing strategy was used on four 16G NVIDIA T4 GPUs. Lead, Resources-Lead, Supervision-Lead, Writing— original draft-Supporting, Writing—review & editing- Model evaluation and data analysis Supporting) The validation set was evaluated on balanced accuracy  and the Jaccard score . The test dataset was evaluated ACKNOWLEDGEMENTS on four metrics: balanced accuracy, Jaccard, top2 accuracy and ROC AUC. The ROC AUC was produced at a 500-nM We would like to thank the following AbbVie colleagues: threshold, by summing up the probabilities from the first Divya Pathania, Swati Gupta, Alexander Ibraghimov and two bins (0–50 and 50–500 nM). Metrics were calculated Yunhee Jeong for their insights and experimental support using the corresponding built-in functions in either sklearn that have been instrumental to this work. We would like to or Tensorflow package. Standard deviations were calcu- thank Meha Chhaya, Alayna George Thompson, Daniel lated using 1000 bootstrap samples . Statistical testing Serna, Domenick Kennedy, Felipe Rodriguez, Christian was conducted using the statsmodels package. All figures Grant and Ruchi Srivastava for their valuable advice. We were plotted using the matplotlib and plotly packages. would also like to thank Brian Martin for his technical advice and computational platform support. SUPPLEMENTARY DATA Supplementary Data are available at ABT Online. REFERENCES 1. Wolbink, GJ, Aarden, LA, Dijkmans, BAC. Dealing with immunogenicity of biologicals: assessment and clinical relevance. Curr Opin Rheumatol 2009; 21: 211–5. FUNDING 2. Schellekens, H. Factors influencing the immunogenicity of therapeutic proteins. Nephrol Dial Transplant 2005; 20: vi3–9. This work was fully funded by AbbVie Inc. 3. Paul, S, Kolla, RV, Sidney, J et al. Evaluating the immunogenicity of protein drugs by applying in vitro MHC binding data and the immune epitope database and analysis resource. Clin Dev Immunol CONFLICT OF INTEREST STATEMENT 2013; 2013:1–7. 4. Paul, S, Grifoni, A, Peters, B et al. Major histocompatibility All authors are current employees of AbbVie. The design, complex binding, eluted ligands, and immunogenicity: benchmark study conduct and financial support for this research were testing and predictions. Front Immunol 2020; 10: 3151. 146 Antibody Therapeutics, 2023 5. Huisman, BD, Dai, Z, Gifford, DK et al. A high-throughput yeast 27. Long, J, Shelhamer, E, Darrell, T. Fully convolutional networks for display approach to profile pathogen proteomes for MHC-II semantic segmentation In Proceedings of the IEEE conference on binding. Elife 2022; 11: e78589. computer vision and pattern recognition 2014; 3431–40. https://doi.o 6. Nanaware, PP, Jurewicz, MM, Leszyk, JD et al. HLA-DO rg/10.48550/arxiv.1411.4038. modulates the diversity of the MHC-II self-peptidome. Mol Cell 28. Zhou, X, Wang, D, Krähenbühl, P. Objects as points Arxiv. 2019. Proteomics 2019; 18: 490–503. https://doi.org/10.48550/arxiv.1904.07850. 7. Jüse, U, Arntzen, M, Højrup, P et al. Assessing high affinity binding 29. Marquet, C, Heinzinger, M, Olenyi, T et al. Embeddings from to HLA-DQ2.5 by a novel peptide library based approach. Bioorg protein language models predict conservation and variant effects. Med Chem 2011; 19: 2470–7. Hum Genet 2022; 141: 1629–47. 8. Jüse, U, van de Wal, Y, Koning, F et al. Design of new high-affinity 30. Pokharel, S, Pratyush, P, Heinzinger, M et al. Improving protein peptide ligands for human leukocyte antigen-DQ2 using a positional succinylation sites prediction using embeddings from protein scanning peptide library. Hum Immunol 2010; 71: 475–81. language model. Sci Rep 2022; 12: 16933. 9. Sidney, J, Southwood, S, Moore, C et al. Measurement of 31. Teufel, F, Armenteros, JJA, Johansen, AR et al. SignalP 6.0 predicts MHC/peptide interactions by gel filtration or monoclonal antibody all five types of signal peptides using protein language models. Nat capture. Curr Protoc Immunol 2013; 100: 18.3.1–18.3.36. Biotechnol 2022; 40: 1023–5. 10. Peters, B, Bui, H-H, Frankild, S et al. A community resource 32. Tan, M, Le, QV. EfficientNetV2: smaller models and faster training benchmarking predictions of peptide binding to MHC-I molecules. In International conference on machine learning, pp. 10096–106. PLoS Comput Biol 2006; 2: e65. PMLR, 2021. https://doi.org/10.48550/arxiv.2104.00298. 11. Buus, S, Sette, A, Colon, SM et al. Isolation and characterization of 33. Dosovitskiy, A, Beyer, L, Kolesnikov, A et al. An image is worth antigen-la complexes involved in T cell recognition. Cell 1986; 47: 16x16 words: transformers for image recognition at scale Arxiv. 1071–7. 2020. https://doi.org/10.48550/arxiv.2010.11929. 12. Sturniolo, T, Bono, E, Ding, J et al. Generation of tissue-specific 34. ImageNet. Image Net. 2022. https://www.image-net.org/. and promiscuous HLA ligand databases using DNA microarrays 35. Wang, P, Sidney, J, Kim, Y et al. Peptide binding predictions for and virtual HLA class II matrices. Nat Biotechnol 1999; 17: HLA DR, DP and DQ molecules. Bmc Bioinformatics 2010; 11: 555–61. 568–8. 13. Nielsen, M, Lundegaard, C, Lund, O. Prediction of MHC class II 36. Image Classification on ImageNet. Papers with Code. 2022. https:// binding affinity using SMM-align, a novel stabilization matrix paperswithcode.com/sota/image-classification-on-imagenet. alignment method. Bmc Bioinformatics 2007; 8: 238. 37. Eddy, SR. Where did the BLOSUM62 alignment score matrix come 14. King, C, Garza, EN, Mazor, R et al. Removing T-cell epitopes with from? Nat Biotechnol 2004; 22: 1035–6. computational protein design. Proc National Acad Sci 2014; 111: 38. Andreatta,M,Trolle,T,Yan, Z et al. An automated benchmarking 8577–82. platform for MHC class II binding prediction methods. 15. Zeng, H, Gifford, DK. Quantification of uncertainty in Bioinformatics 2017; 34: 1522–8. peptide-MHC binding prediction improves high-affinity peptide 39. Burrin, JM. New techniques in metabolic bone disease. 1990; 82–91. selection for therapeutic design. Cell Syst 2019; 9: 159–66.e3. https://doi.org/10.1016/b978-0-7236-0898-1.50010-1. 16. You, R, Qu, W, Mamitsuka, H et al. DeepMHCII: a novel binding 40. Abelin, JG, Keskin, DB, Sarkizova, S et al. Mass spectrometry core-aware deep interaction model for accurate MHC-II peptide profiling of HLA-associated Peptidomes in mono-allelic cells binding affinity prediction. Bioinformatics 2022; 38: i220–8. enables more accurate epitope prediction. Immunity 2017; 46: 17. Shao, XM, Bhattacharya, R, Huang, J et al. High-throughput 315–26. prediction of MHC class I and II Neoantigens with MHCnuggets. 41. Shiina, T, Hosomichi, K, Inoko, H et al. The HLA genomic loci Cancer Immunol Res 2020; 8: 396–408. map: expression, interaction, diversity and disease. J Hum Genet 18. Jensen, KK, Andreatta, M, Marcatili, P et al. Improved methods for 2009; 54: 15–39. predicting peptide binding affinity to MHC class II molecules. 42. Greenbaum, J, Sidney, J, Chung, J et al. Functional classification of Immunology 2018; 154: 394–406. class II human leukocyte antigen (HLA) molecules reveals seven 19. Reynisson, B, Alvarez, B, Paul, S et al. NetMHCpan-4.1 and different supertypes and a surprising degree of repertoire sharing NetMHCIIpan-4.0: improved predictions of MHC antigen across supertypes. Immunogenetics 2011; 63: 325–35. presentation by concurrent motif deconvolution and integration of 43. Zeiler, MD, Fergus, R. Visualizing and understanding convolutional MS MHC eluted ligand data. Nucleic Acids Res 2020; 48: gkaa379. networks In Computer Vision-ECCV 2014: 13th European 20. Cheng, J, Bendjama, K, Rittner, K et al. BERTMHC: improved Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, MHC–peptide class II interaction prediction with transformer and Part I 13, pp. 818–33. Springer International Publishing, 2014. multiple instance learning. Bioinformatics 2021; 37: btab422. https://doi.org/10.48550/arxiv.1311.2901. 21. Yin, L, Stern, LJ. Measurement of peptide binding to MHC class II 44. Kingma, DP, Ba, J. Adam: a method for stochastic optimization molecules by fluorescence polarization. Curr Protoc Immunol 2014; Arxiv. 2014. https://doi.org/10.48550/arxiv.1412.6980. 106: 5.10.1–12. 45. Grandini, M, Bagli, E, Visani, G. Metrics for multi-class 22. Rappazzo, CG, Huisman, BD, Birnbaum, ME. Repertoire-scale classification: an overview Arxiv. 2020. https://doi.org/10.48550/arxi determination of class II MHC peptide binding via yeast display v.2008.05756. improves antigen prediction. Nat Commun 2020; 11: 4414. 46. Brodersen, KH, Ong, CS, Stephan, KE et al. The balanced accuracy 23. Vaswani, A, Shazeer, N, Parmar, N et al. Attention is all You need and its posterior distribution. 2010 20th International Conference on Advances in neural information processing systems 2017; 30. https:// Pattern Recognition. pp. 3121–24. IEEE, 2010. https://doi.o doi.org/10.48550/arxiv.1706.03762. rg/10.1109/icpr.2010.764. 24. Rives, A, Meier, J, Sercu, T et al. Biological structure and function 47. Bertels, J, Eelbode, T, Berman, M et al. Optimizing the Dice Score emerge from scaling unsupervised learning to 250 million protein and Jaccard Index for Medical Image Segmentation: Theory and sequences. Proc Natl Acad Sci U S A 2021; 118: e2016239118. Practice. In: Medical Image Computing and Computer Assisted 25. Jumper, J, Evans, R, Pritzel, A et al. Highly accurate protein Intervention–MICCAI 2019. MICCAI 2019. Lecture Notes in structure prediction with AlphaFold. Nature 2021; 596: 583–9. Computer Science,Vol 11765. Springer, Cham, 2019. https://doi.o 26. Elnaggar, A, Heinzinger, M, Dallago, C et al. ProtTrans: toward rg/10.1007/978-3-030-32245-8_11. understanding the language of life through self-supervised learning. 48. Hesterberg, T. Bootstrap: bootstrap. Wiley Interdiscip Rev Comput Ieee T Pattern Anal 2022; 44: 7112–27. Statistics 2011; 3: 497–526.
Antibody Therapeutics – Oxford University Press
Published: May 14, 2023
Keywords: immunogenicity; MHC-II; transfer learning; protein language model; deep learning
Access the full text.
Sign up today, get DeepDyve free for 14 days.