A recent study published in npj Computational Materials sheds light on the development of accurate machine learning force fields through the fusion of experimental and simulation data. Machine Learning (ML)-based force fields have been gaining traction for their ability to bridge classical interatomic potentials with quantum-level precision across varying spatiotemporal scales. However, challenges arise from limited and flawed data, resulting in models that may not align with well-established experimental findings or are constrained to specific properties.
In this research, a novel approach was adopted by leveraging both Density Functional Theory (DFT) calculations and experimentally measured mechanical properties and lattice parameters to train an ML potential for titanium. By fusing data from multiple sources, the study demonstrated the ability to simultaneously achieve high accuracy across all target objectives, surpassing models trained with a single data origin. Notably, the technique corrected inaccuracies in DFT functionals related to experimental properties while minimally impacting off-target properties, mostly in a positive manner.
The significance of Molecular Dynamics (MD) simulations in material science is well-recognized for expediting material discovery and understanding existing materials. Nevertheless, the accuracy-efficiency trade-off in conventional approaches poses limitations. Ab initio MD offers superior accuracy at the cost of computational efficiency, whereas classical force field-based MD sacrifices accuracy for efficiency. ML approaches, particularly ML potentials, present a promising solution by constructing potential energy models with unspecified functional forms, theoretically overcoming the accuracy-efficiency trade-off. However, the success of ML potentials hinges heavily on the quality and diversity of the training data, sourced from simulations, experiments, or a combination of both.
While simulations provide detailed atomic configurations for training ML potentials (bottom-up learning), the generation of accurate and expansive ab initio training data poses challenges. To address this, researchers often resort to utilizing less accurate DFT calculations due to the computational infeasibility of the gold standard CCSD(T) method for large datasets. Moreover, the selection of diverse and non-redundant training data presents another hurdle, with specialized datasets required based on the target application.
Conversely, training ML potentials on experimental data (top-down learning) offers a wealth of information despite being arduous to obtain and prone to measurement errors. The high information content per sample from experimental observations benefits the training process, though challenges arise in running forward simulations and backpropagation, especially for time-dependent properties. Despite the different complexities associated with bottom-up and top-down learning methods, the fusion of simulation and experimental data emerges as an effective strategy for training ML potentials with higher accuracy and broader applicability.
Overall, the study underscores the potential of integrating diverse data sources to enhance the accuracy and versatility of ML potentials, paving the way for accelerated material research and innovation. The findings highlight the importance of a comprehensive training approach that accounts for the strengths and limitations of both simulation and experimental data to develop highly accurate ML force fields for various materials.