Title: The Role of Data Quality in Shaping the Outcomes of Machine Learning and AI
Data quality has always played a crucial role in the domains of data science, machine learning (ML), and artificial intelligence (AI). According to Kjell Carlsson, head of data science strategy at Domino Data Lab, while the importance of data quality has been acknowledged for a long time, there is now a growing awareness and discussion surrounding it, particularly in the context of generative AI.
Although techniques like feature engineering and ensemble modeling can partially compensate for insufficient or inadequate training data, the quality of input data ultimately determines the upper limit of a model’s potential performance.
When it comes to AI and ML initiatives in business, ensuring data quality becomes a critical factor for success. While it is possible to build a poor model with high-quality data, the quality of the data strongly influences the possibilities and capabilities of the models.
Companies utilize AI models for specific purposes, which require training with tailored and relevant datasets. Therefore, it is essential to consider the end system that will utilize the data when deciding what data to acquire and use. Carlsson emphasizes that without clearly defining the purpose of the data, it is difficult to determine the desired level of quality.
Due to the importance of data relevance and specificity, widely used but highly general models like GPT-4 may not always be the best fit for enterprise use cases. Models trained on vast but nonspecific datasets might not possess a representative sample of conversations, tasks, and relevant data for a specific industry or organizational workflow.
Rather than categorizing data as good or bad, data quality should be seen as a relative characteristic that is closely tied to the real-world purpose of the model. Even if a dataset is comprehensive, unique, and well structured, it might prove useless if it cannot produce the necessary predictions for a planned use case.
To illustrate this, Carlsson shares an example from a previous project involving an electronic health record platform. Despite having extensive data on how doctors used the platform, his team was unable to predict when a customer would leave the service. The decision to switch services was made by practice managers who didn’t directly use the platform, resulting in their behavior being untracked.
Thus, it is possible to have high-quality data that is completely useless for a particular purpose. This highlights the importance of aligning data quality with the intended use case.
While training AI models effectively requires significant resources and effort, industry-specific datasets have become easily accessible. In the financial sector, websites like Data.gov and the American Economic Association offer data sets that provide macroeconomic information on aspects such as employment, economic output, and trade within the United States. The official websites of the International Monetary Fund and the World Bank also provide data sets covering global financial markets and institutions.
Many of these data sets are available for free to enterprises. Similar to how ChatGPT was trained on text gathered from various websites, articles, and online forums, it is expected that enterprises will search online and explore data marketplaces to obtain the information needed to enhance their models.
In conclusion, the quality of data significantly influences the outcomes of machine learning and AI. While techniques can partially overcome insufficient data, the potential performance of models ultimately depends on the quality of input data. Therefore, businesses must carefully consider the relevance and specificity of data, ensuring alignment with the intended purpose of their AI models. By doing so, they can optimize the data quality and enhance the accuracy and effectiveness of their machine learning initiatives.