A recent study published in Nature Biotechnology raised some interesting points about the capabilities of AI-generated data and the potential distractions posed by ChatGPT being considered a ‘scientist.’
One of the key arguments presented in the study is that the protein folding problem stands out as an outlier among other scientific challenges due to the specific way it can be defined and measured, as well as the availability of high-quality data. While biological databases are relatively small compared to the vast datasets used to train large language models, it is suggested that the rapid increase in whole genome sequencing will soon provide massive amounts of biological data that could rival existing compendia.
As genome sequencing becomes more affordable and the clinical applications of genomic data expand, the possibility of fully sequencing populations, such as the US population of 300 million individuals, is increasingly likely. Each individual genome of 3 billion base pairs can be represented by 30 million unique bases, resulting in a dataset comparable in size to the 400-terabyte Common Crawl dataset used for training large language models. The challenge lies in harnessing such vast genomic data for machine learning models while navigating privacy concerns.
Despite the hurdles, there are at least four potential paths forward for building large-scale machine learning models based on massive genomic data. These pathways may offer valuable insights and advancements in the field of genomics and AI. It will be interesting to see how researchers and scientists navigate the complexities of using such extensive biological data for training AI models while respecting privacy considerations.
In conclusion, the intersection of AI-generated data and biological research presents exciting opportunities for scientific advancement. By leveraging the vast potential of genomic data, researchers can overcome challenges and unlock new possibilities in the realm of artificial intelligence and genomics.