Scientists have developed an extended dataset called PLAS-20k, which provides valuable information on protein-ligand interactions for machine learning applications in drug discovery. This dataset is an expansion of the previously developed PLAS-5k and includes 97,500 independent simulations of 19,500 different protein-ligand complexes. By incorporating dynamic features into the dataset, the researchers have improved the accuracy of predicting binding affinities compared to docking scores. The PLAS-20k dataset is also effective in classifying strong and weak binders and offers insights into the adherence of ligands to Lipinski’s Rule. To support the use of this dataset, the OnionNet model has been retrained and provided as a baseline for predicting binding affinities.
Computational methods, such as high-throughput docking and molecular dynamics (MD) simulations, have emerged as efficient alternatives to traditional high-throughput screening in drug discovery. These methods significantly reduce the time, cost, and resources required for physical experiments. However, existing docking methods have limitations in accurately predicting binding affinities due to restricted sampling of protein and ligand conformations and the use of approximated scoring functions. On the other hand, MD simulations offer several benefits by capturing dynamic properties of protein-ligand interactions and accurately predicting binding affinities. Nevertheless, the screening of a large number of molecules using MD simulations is computationally expensive, making it impractical for large-scale predictions.
Machine learning (ML) has become a powerful tool in drug development, with successful applications in various areas, including virtual screening, prediction of binding sites, and protein folding. ML models have been developed to predict protein-ligand binding affinity using static 3D structures from the Protein Data Bank. However, these models often lack dynamic features that provide important insights into the binding process. MD simulations can reveal the dynamic effects of biomolecules and capture both long-range and short-range interactions involved in binding events. To enhance the accuracy of ML models, larger and more dynamic datasets are needed.
The PLAS-20k dataset addresses the need for high-quality datasets by including a diverse collection of protein-ligand complexes. The dataset consists of 19,500 PL structures, providing protein-ligand affinities, non-covalent interaction components, and accompanying trajectories for machine learning applications. The performance of the dataset was evaluated by comparing the calculated binding affinities with experimentally determined values and using molecular mechanics/Poisson-Boltzmann surface area (MMPBSA) and docking methods. The dataset was also categorized into strong binders and weak binders to analyze the range of binding strengths. Furthermore, the dataset allows assessment of the ligands’ adherence to Lipinski’s Rule, which provides insights into their drug-like properties.
The availability of the PLAS-20k dataset is expected to accelerate drug discovery and design processes through data-driven approaches. The dataset empowers researchers to explore and apply ML techniques more effectively, leading to advancements in hit identification, lead optimization, and de novo molecular design. By incorporating dynamic features into the dataset, researchers can improve the accuracy and reliability of ML models in predicting binding affinities. The PLAS-20k dataset represents a significant step towards leveraging the dynamic nature of biomolecular systems and driving innovation in drug development.