Integrated Machine Learning Analysis of HLA-Peptide Binding and Its Association with Diseases
Date
2024-12-10Metadata
Show full item recordAbstract
The immune system’s ability to identify and respond to pathogenic threats depends on the intricate interactions between Human Leukocyte Antigen (HLA) molecules and peptide antigens. This work improves the understanding of HLA-peptide binding and its association with disease through integrative machine learning frameworks. Specifically, it addresses critical gaps in computational modeling of these interactions and explores their implications for immunology, vaccine development, and personalized medicine. Central to this research is the development of the MISTIC (Model-Informed Feature Selection through Importance and Contribution) and FFTpro frameworks. Our approach leverages Fourier Transform-based encoding of amino acid sequences from FFTpro with Support Vector Machine (SVM) classifiers and novel feature selection techniques from MISTIC. By integrating these methods, the framework identifies key determinants of binding specificity, offering a biologically interpretable pathway to enhance prediction accuracy across diverse HLA alleles. The dissertation utilizes a comprehensive dataset curated from the Immune Epitope Database (IEDB), comprising both HLA Class I and II molecules. Through innovative encoding strategies, such as the Fourier Transform of BLOSUM-embedded amino acid properties, the study overcomes challenges associated with variable peptide lengths and the extensive polymorphism of HLA genes. This encoding not only ensures uniformity in data representation but also preserves essential biological information, making it pivotal for machine learning applications. A key feature of the MISTIC framework is its ability to rank features using a combination of two approaches: feature sensitivity to the SVM objective function and direct feature contributions to the decision boundary. These methods are integrated into a consensus ranking that highlights critical determinants of binding. Comparisons with established tools like NetMHCpan and ESM2-NN demonstrate that MISTIC consistently outperforms these methods in both accuracy and interpretability. Beyond binding prediction, this work explores the connections between HLA alleles and disease susceptibility. By clustering alleles based on their feature selection and attribution profiles, the study uncovers patterns that link similar HLA binding characteristics to shared disease associations. These findings provide new insights that could lead to new understand of immune responses and reveal potential targets for therapeutic development. Additionally, a novel attribution method inspired by Integrated Gradients is introduced to improve the interpretability of SVM-based models, offering a fresh perspective on the role of HLA polymorphisms in health and disease. This research makes substantial contributions to computational immunology, particularly in advancing machine learning frameworks tailored to the complexities of HLA biology. The findings could pave the way for more effective vaccine designs, immunotherapies, and lead to novel tools for studying autoimmune and infectious diseases. By emphasizing interpretability alongside predictive power, the dissertation sets a foundation for leveraging computational insights in clinical and research contexts, bridging the gap between bioinformatics and translational medicine.