Machine learning has shown great potential in the field of functional protein design. A recent article published in Nature Biotechnology highlights the different approaches and models used in this exciting area of research.
Protein design models utilize a combination of sequences, structures, and functional labels to generate innovative protein designs. Each of these modalities has its own advantages and limitations, depending on the available data, human knowledge and intervention, and proximity to the desired function.
The majority of machine learning models for protein design can be grouped into three categories based on the representation of proteins, the training data, and the underlying algorithm’s learning objective. A unifying probabilistic framework is used to facilitate comparisons between the different methods.
Sequence-based models form the first category and can be further divided into two groups. The first group, known as sequence-only models, aims to learn a generative model of the primary protein structure by training on a large collection of protein sequences. These models capture the biochemical constraints present in the training set. The second group, conditional sequence models, condition the generative process on taxonomic groups or gene ontology annotations to provide more control over the generated sequences.
Sequence-label models come second and are trained using a sufficiently large number of functional labels. These models predict the functional label of a given protein sequence. They can be used to prioritize potential protein designs based on their predicted properties. Label-conditioned generative models, on the other hand, approximate the joint probability of a sequence and its functional label, allowing for the generation of new sequences conditioned on a desired property.
Structure-based models comprise the third category. These models can be used for structure prediction, structure generation, inverse folding, or holistic design approaches. Structure prediction models aim to predict the tertiary structure of a protein based on its primary structure. Structure generation models, like generative adversarial networks and variational autoencoders, learn the probability distribution of protein structures. Inverse folding models condition the generative process on the 3D structure to generate the corresponding protein sequence. Holistic design approaches model the joint probability of the sequence and structure and are used to generate new designs.
Choosing the appropriate model architecture depends on the design objective and the available data. Generative models are useful for sampling new proteins resembling the training data, while sequence-label architectures prioritize a subset of variants for experimental validation. Supervised generative models offer finer control over the sampling process. It is crucial to ensure a robust experimental pipeline to measure the properties of the generated proteins accurately.
Overall, machine learning for functional protein design is an evolving field with exciting advancements. The use of different models and approaches allows researchers to design novel proteins with desired functions. By leveraging the power of machine learning and combining it with our understanding of protein structure and function, researchers are pushing the boundaries of protein design and opening up new possibilities in various domains.
References:
1. [Machine learning for functional protein design – Nature Biotechnology]
2. [Insert relevant reference hyperlinks where necessary]