A new study conducted by Yale researchers sheds light on the importance of large datasets when testing machine learning models designed to link brain structure with behavior. The findings, published in the journal Nature Human Behavior, highlight the potential implications for future research in the field of psychology.
Machine learning models play a crucial role in identifying patterns between brain structure or function and cognitive attributes like attention or symptoms of depression. By establishing these links, researchers aim to better understand how the brain influences these attributes and potentially predict cognitive challenges based on brain imaging data alone.
However, the effectiveness of these models heavily depends on the size of the datasets used for training and testing. When datasets are not sufficiently large, models may appear less capable than they actually are, leading to potential inaccuracies and reduced generalizability across the population.
To address this issue, researchers have started subjecting machine learning models to more rigorous testing by evaluating their performance on separate datasets provided by other researchers. While this approach enhances the robustness of brain-behavior relationships identified by the models, it also highlights the importance of dataset size in achieving statistical power.
Analyzing data from six neuroimaging studies, the researchers found that both training and testing datasets need to be relatively large to ensure sufficient statistical power. Most published studies in the field had datasets that were too small, compromising the power of their findings.
The study revealed that for measures with small effect sizes, such as attention problems, datasets of hundreds to thousands of individuals may be necessary to detect significant relationships between brain structure and behavior. As more neuroimaging datasets become available, researchers are encouraged to test their models on separate, large datasets to enhance the reliability and generalizability of their findings.
In conclusion, the study underscores the importance of large datasets in testing brain-behavior machine learning models for accurate and robust results. By considering dataset sizes and conducting external testing on sizeable datasets, researchers can improve the reproducibility and reliability of their findings in the field of neuroimaging research.