Explainable and Interpretable Machine Learning of Structure-Function Relationships for Membrane-Active Peptides in Drug Discovery: Enhancing Therapeutic Specificity
Abstract
Membrane-active peptides, particularly those with antimicrobial, anticancer, and other thera peutic properties, offer a promising alternative to traditional drug treatments. However, accurate predictive models are essential to maximize their effectiveness while minimizing undesirable effects such as hemolysis and improving solubility. This dissertation focuses on developing data-driven models for activity prediction of membrane-active peptides (MAPs), with the ultimate goal of designing of therapeutics with enhanced specificity. A key innovation in this work is using Fourier transform (FFT)-based features, which capture the periodicities and order of amino acids in peptide sequences without requiring a sequence alignment, leading to a more detailed understanding of sequence properties that enable these peptides to interact with biological membranes. These FFT based features are not specific to MAP activity and potentially broad utility for predicting peptide properties, such as solubility and hemolytic potential, since these inherent structural periodicities could contribute to various types of protein structures and activities. To ensure our models are interpretable, we have incorporated a feature selection framework that ensures the most contributive features are identified and used in the models, allowing for high predictive accuracy while minimizing model complexity. By focusing on a small number of critical features, these models offer valuable insights into the sequence characteristics that are most influential in determining MAP activities, making them highly interpretable and practical for therapeutic peptide design. Support vector machines (SVMs) were employed due to their ability to handle complex, non-linear relationships, and the models developed in this work demonstrate high performance, robustness, and reliability. Extensive cross-validation and blind test evaluations reveal that the models achieve competitive performance when compared to state-of-the-art approaches, while also being simple and interpretable. This enhanced interpretability sets them apart from more complex models, offering a clear advantage in therapeutic peptide design. The performance of these models stands out for utilizing a minimal number of features while still outperforming or matching more complex state-of-the-art models. This is particularly relevant in drug discovery, where identifying meaningful predictive features directly influences both the speed and accuracy of therapeutic development. Although the ultimate goal of this research is to facilitate the design of MAPs with high speci ficity, the primary contribution lies in the development of powerful and computationally efficient predictive models. These models offer a practical and effective solution for advancing peptide based drug discovery, enabling the identification of MAPs with optimal therapeutic potential while minimizing undesired properties such as hemolytic activity and increasing solubility. The design aspect of this dissertation is positioned as a long-term outcome, supported by the high performance and reliability of the predictive models developed here.