Open Access

Incorporating structural features to improve the prediction and understanding of pathogenic amino acid substitutions

Yao Xiong1,2,Jing-Bo Zhou1,Ke An1,Wei Han1,Tao Wang3,Zhi-Qiang Ye1,3,*,Yun-Dong Wu1,3,4,*
State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, 518055 Shenzhen, Guangdong, China
Assisted Reproduction Center, Northwest Women’s and Children’s Hospital, 710003 Xi’an, Shaanxi, China
Shenzhen Bay Laboratory, 518055 Shenzhen, Guangdong, China
College of Chemistry and Molecular Engineering, Peking University, 100871 Beijing, China
DOI: 10.52586/5036 Volume 26 Issue 12, pp.1422-1433
Submited: 06 July 2021 Revised: 09 October 2021
Accepted: 21 October 2021 Published: 30 December 2021
*Corresponding Author(s):  
Zhi-Qiang Ye
*Corresponding Author(s):  
Yun-Dong Wu
Copyright: © 2021 The author(s). Published by BRI. This is an open access article under the CC BY 4.0 license (

Background: The wide application of gene sequencing has accumulated numerous amino acid substitutions (AAS) with unknown significance, posing significant challenges to predicting and understanding their pathogenicity. While various prediction methods have been proposed, most are sequence-based and lack insights for molecular mechanisms from the perspective of protein structures. Moreover, prediction performance must be improved. Methods: Herein, we trained a random forest (RF) prediction model, namely AAS3D-RF, underscoring sequence and three-dimensional (3D) structure-based features to explore the relationship between diseases and AASs. Results: AAS3D-RF was trained on more than 14,000 AASs with 21 selected features, and obtained accuracy (ACC) between 0.811 and 0.839 and Matthews correlation coefficient (MCC) between 0.591 and 0.684 on two independent testing datasets, superior to seven existing tools. In addition, AAS3D-RF possesses unique structure-based features, context-dependent substitution score (CDSS) and environment-dependent residue contact energy (ERCE), which could be applied to interpret whether pathogenic AASs would introduce incompatibilities to the protein structural microenvironments. Conclusion: AAS3D-RF serves as a valuable tool for both predicting and understanding pathogenic AASs.

Key words

Amino acid substitution; Single-nucleotide variant; Pathogenic; Protein structure; Machine learning


[1] Shendure J, Balasubramanian S, Church GM, Gilbert W, Rogers J, Schloss JA, et al. DNA sequencing at 40: past, present and future. Nature. 2017; 550: 345–353.

[2] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. DbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001; 29: 308–311.

[3] Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature Genetics. 2003; 33: 228–237.

[4] Stenson PD, Mort M, Ball EV, Shaw K, Phillips AD, Cooper DN. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Human Genetics. 2014; 133: 1–9.

[5] Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2018; 46: D1062–D1067.

[6] UniProt Consortium T. UniProt: the universal protein knowledgebase. Nucleic Acids Research. 2018; 46: 2699.

[7] Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Human Genetics. 2017; 136: 665–677.

[8] Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protocols. 2009; 4: 1073–1082.

[9] Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Research. 2002; 30: 3894–3900.

[10] López-Ferrando V, Gazzo A, de la Cruz X, Orozco M, Gelpí JL. PMut: a web-based tool for the annotation of pathological variants on proteins, 2017 update. Nucleic Acids Research. 2019; 45: W222–W228.

[11] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nature Methods. 2010; 7: 248–249.

[12] Pejaver V, Urresti J, Lugo-Martinez J, Pagel KA, Lin GN, Nam HJ, et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nature Communications. 2020; 11: 5918.

[13] Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PloS ONE. 2012; 7: e46688.

[14] Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GLA, Edwards KJ, et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation. 2013; 34: 57–65.

[15] Tang HM, Thomas PD. PANTHER-PSEP: predicting diseasecausing genetic variants using position-specific evolutionary preservation. Bioinformatics. 2016; 32: 2230–2232.

[16] Rose PW, Prlić A, Altunkaya A, Bi C, Bradley AR, Christie CH, et al. The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Research. 2017; 45: D271–D281.

[17] Ofoegbu TC, David A, Kelley LA, Mezulis S, Islam SA, Mersmann SF, et al. PhyreRisk: a Dynamic Web Application to Bridge Genomics, Proteomics and 3D Structural Data to Guide Interpretation of Human Genetic Variants. Journal of Molecular Biology. 2019; 431: 2460–2466.

[18] Ittisoponpisan S, Islam SA, Khanna T, Alhuzimi E, David A, Sternberg MJE. Can Predicted Protein 3D Structures Provide Reliable Insights into whether Missense Variants are Disease Associated? Journal of Molecular Biology. 2019; 431: 2197–2212.

[19] Laskowski RA, Stephenson JD, Sillitoe I, Orengo CA, Thornton JM. VarSite: Disease variants and protein structure. Protein Science. 2020; 29: 111–119.

[20] Karczewski KJ, Weisburd B, Thomas B, Solomonson M, Ruderfer DM, Kavanagh D, et al. The ExAC browser: displaying reference data information from over 60,000 exomes. Nucleic Acids Research. 2017; 45: D840–D845.

[21] Wagih O, Galardini M, Busby BP, Memon D, Typas A, Beltrao P. A resource of variant effect predictions of single nucleotide variants in model organisms. Molecular Systems Biology. 2018; 14:e8430

[22] Iqbal S, Pérez-Palma E, Jespersen JB, May P, Hoksza D, Heyne HO, et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proceedings of the National Academy of Sciences. 2020; 117: 28201–28211.

[23] Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research. 2019; 47: D886–D894.

[24] Breiman L. Random forests. Machine Learning. 2001; 45: 5–32.

[25] Pieper U, Webb BM, Dong GQ, Schneidman-Duhovny D, Fan H, Kim SJ, et al. ModBase, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Research. 2014; 42: D336–D346.

[26] Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, et al. A global reference for human genetic variation. Nature. 2015; 526: 68–74.

[27] Schaafsma GCP, Vihinen M. VariSNP, a Benchmark Database for Variations from dbSNP. Human Mutation. 2015; 36: 161–166.

[28] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10: 421.

[29] Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, et al. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Briefings in Bioinformatics. 2013; 14: 315–326.

[30] Zhou JB, Xiong Y, An K, Ye ZQ, Wu YD. IDRMutPred: predicting disease-associated germline nonsynonymous single nucleotide variants (nsSNVs) in intrinsically disordered regions. Bioinformatics. 2020; 36: 4977–4983.

[31] Grimm DG, Azencott CA, Aicheler F, Gieraths U, MacArthur DG, Samocha KE, et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Human Mutation. 2015; 36: 513–523.

[32] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011; 12: 2825–2830.

[33] Wei Q, Dunbrack RL, Jr. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE. 2013; 8: e67863.

[34] Dobson RJ, Munroe PB, Caulfield MJ, Saqi MA. Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics. 2006; 7: 217.

[35] Wang M, Wei L. IFish: predicting the pathogenicity of human nonsynonymous variants using gene-specific/family-specific attributes and classifiers. Scientific Reports. 2016; 6: 31321.

[36] Ghosh R, Oak N, Plon SE. Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines. Genome Biology. 2017; 18: 225.

[37] Riera C, Lois S, de la Cruz X. Prediction of pathological mutations in proteins: the challenge of integrating sequence conservation and structure stability principles. Wiley Interdisciplinary Reviews: Computational Molecular Science. 2014; 4: 249–268.

[38] Peterson TA, Doughty E, Kann MG. Towards Precision Medicine: Advances in Computational Approaches for the Analysis of Human Variants. Journal of Molecular Biology. 2013; 425: 4047–4063.

[39] Katsonis P, Koire A, Wilson SJ, Hsu TK, Lua RC, Wilkins AD, et al. Single nucleotide variations: Biological impact and theoretical interpretation. Protein Science. 2014; 23: 1650–1666.

[40] Niroula A, Vihinen M. Variation Interpretation Predictors: Principles, Types, Performance, and Choice. Human Mutation. 2016; 37: 579–597.

[41] Ye ZQ, Zhao SQ, Gao G, Liu XQ, Langlois RE, Lu H, et al. Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP). Bioinformatics. 2007; 23: 1444–1450.

[42] Steward RE, MacArthur MW, Laskowski RA, Thornton JM. Molecular basis of inherited diseases: a structural perspective. Trends in Genetics. 2003; 19: 505–513.

[43] de Beer TAP, Laskowski RA, Parks SL, Sipos B, Goldman N, Thornton JM. Amino acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset. PLoS Computational Biology. 2013; 9: e1003382.

[44] Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, et al. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics. 2005; 21: 2814–2820.

[45] Baugh EH, Simmons-Edler R, Müller CL, Alford RF, Volfovsky N, Lash AE, et al. Robust classification of protein variation using structural modelling and large-scale data integration. Nucleic Acids Research. 2016; 44: 2501–2513.

[46] Capriotti E, Altman RB. Improving the prediction of diseaserelated variants using protein three-dimensional structure. BMC Bioinformatics. 2011; 12: S3.

[47] Yang X, Gao H, Zhang J, Xu X, Liu X, Wu X, et al. ATP1A3 mutations and genotype-phenotype correlation of alternating hemiplegia of childhood in Chinese patients. PLoS ONE. 2014; 9: e97274.

[48] Riera C, Lois S, Domínguez C, Fernandez-Cadenas I, Montaner J, Rodríguez-Sureda V, et al. Molecular damage in Fabry disease: characterization and prediction of alpha-galactosidase a pathological mutations. Proteins. 2015; 83: 91–104.

[49] Yue P, Li ZL Moult J. Loss of protein structure stability as a major causative factor in monogenic disease. Journal of Molecular Biology. 2005; 353: 459–473.

[50] Wang Z, Moult J. SNPs, protein structure, and disease. Human Mutation. 2001; 17: 263–270.

[51] Koshi JM, Goldstein RA. Context-dependent optimal substitution matrices. Protein Engineering. 1995; 8: 641–645.

[52] Zhang C, Kim SH. Environment-dependent residue contact energies for proteins. Proceedings of the National Academy of Sciences of the United States of America. 2000; 97: 2550–2555.

[53] Saunders CT, Baker D. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. Journal of Molecular Biology. 2002; 322: 891–901.

[54] Bao L, Cui Y. Prediction of the phenotypic effects of nonsynonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics. 2005; 21: 2185–2190.

[55] Zhang J, Kinch LN, Cong Q, Katsonis P, Lichtarge O, Savojardo C, et al. Assessing predictions on fitness effects of missense variants in calmodulin. Human Mutation. 2019; 40: 1463–1473.

[56] Glusman G, Rose PW, Prlić A, Dougherty J, Duarte JM, Hoffman AS, et al. Mapping genetic variations to three-dimensional protein structures to enhance variant interpretation: a proposed framework. Genome Medicine. 2017; 9: 113.

[57] Quan L, Wu H, Lyu Q, Zhang Y. DAMpred: Recognizing Disease-Associated nsSNPs through Bayes-Guided Neural-Network Model Built on Low-Resolution Structure Prediction of Proteins and Protein–Protein Interactions. Journal of Molecular Biology. 2019; 431: 2449–2459.

[58] Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Reviews. Genetics. 2011; 12: 628–640.

[59] Li Y, Wen Z, Xiao J, Yin H, Yu L, Yang L, et al. Predicting disease-associated substitution of a single amino acid by analyzing residue interactions. BMC Bioinformatics. 2011; 12: 14.

[60] Wang M, Zhao XM, Takemoto K, Xu H, Li Y, Akutsu T, et al. FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model. PloS ONE. 2012; 7: e43847.

[61] Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genetics. 2013; 9: e1003709.

[62] Capriotti E, Montanucci L, Profiti G, Rossi I, Giannuzzi D, Aresu L, et al. Fido-SNP: the first webserver for scoring the impact of single nucleotide variants in the dog genome. Nucleic Acids Research. 2019; 47: W136–W141.

[63] Gray VE, Hause RJ, Luebeck J, Shendure J, Fowler DM. Quantitative Missense Variant Effect Prediction Using Large-Scale Mutagenesis Data. Cell Systems. 2018; 6: 116–124 e113.

[64] Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021; 596: 590–596.

[65] McGarvey PB, Nightingale A, Luo J, Huang H, Martin MJ, Wu C, et al. UniProt genomic mapping for deciphering functional effects of missense variants. Human Mutation. 2019; 40: 694–705.

Share and Cite
Yao Xiong, Jing-Bo Zhou, Ke An, Wei Han, Tao Wang, Zhi-Qiang Ye, Yun-Dong Wu. Incorporating structural features to improve the prediction and understanding of pathogenic amino acid substitutions. Frontiers in Bioscience-Landmark. 2021. 26(12); 1422-1433.