Cleanlab, a startup that specializes in data curation solutions for large language models (LLMs) used in enterprise AI, has raised $5 million in seed funding. The investment round was led by Bain Capital Ventures, demonstrating strong support for Cleanlab’s mission to address the challenge of dirty data in machine learning.
Cleanlab, founded by Curtis Northcutt, Jonas Mueller, and Anish Athalye, has developed an open-source product that identifies and cleans incorrect labels in data. This innovative approach has the potential to significantly enhance the performance of machine learning models, which often struggle due to poor data quality.
The dirty secret of machine learning is that your model is only as good as your data, explained Curtis Northcutt, Cleanlab’s CEO. If you have incorrect labels in your data, which is a common issue, it can negatively impact your model’s performance.
Data curation is typically a time-consuming and manual process that requires extensive resources from data teams. Cleanlab aims to automate and simplify this process by utilizing a method invented by Northcutt during his PhD at MIT called confident learning.
Confident learning involves estimating the joint distribution of true and noisy labels to identify and correct errors in the dataset. It also provides accuracy estimates for labels and examples, offering a confidence score for each label.
Cleanlab offers two products: Cleanlab Open Source and Cleanlab Studio. Cleanlab Open Source is a free Python library that enables users to apply confident learning to their datasets. Cleanlab Studio, on the other hand, is a cloud-based SaaS product that offers a user-friendly interface and advanced features for data curation. It seamlessly integrates with popular LLM frameworks and platforms like Hugging Face Transformers, Google Cloud AI Platform, Amazon SageMaker, Microsoft Azure Machine Learning, and IBM Watson.
Cleanlab has already garnered over 10,000 users for its open-source project and boasts more than 100 customers for its cloud product. Its clientele includes Fortune 500 companies, government agencies, research institutions, and startups from various industries such as e-commerce, healthcare, social media, education, entertainment, and finance.
The $5 million seed funding will be utilized to expand Cleanlab’s team, scale its product development efforts, and grow its customer base. CEO Curtis Northcutt expressed excitement about partnering with Bain Capital Ventures, renowned for its investment in AI startups.
Bain Capital Ventures partner Aaref Hilaly and principal Rak Garg praised Cleanlab’s team, technology, and vision. They emphasized that Cleanlab is addressing a significant and underserved problem in the enterprise AI space.
Cleanlab is the leading solution for data curation for LLMs, which is a massive unaddressed need in the enterprise. Data curation is crucial for model performance and reliability, and with Cleanlab’s open-source approach, users gain more control and an easier-to-adopt product. We are thrilled to back Curtis and his co-founders Jonas and Anish, who have built an exceptional product and a community around confident learning, said Hilaly.
Garg added that Cleanlab is part of a broader investment focus on artificial intelligence at Bain Capital Ventures. The company has also invested in other AI startups this year, including Contextual AI, Evenup, and Unstructured.
Cleanlab’s data curation solution aligns with the growing demand for enterprise AI solutions, particularly for LLMs. According to a recent Gartner report, 69% of tasks currently performed by managers will be fully automated by 2024, likely involving the use of LLMs for activities such as scheduling, reporting, and decision-making. However, data quality and curation remain major obstacles to the widespread adoption and deployment of LLMs in enterprises.
Cleanlab’s solution helps overcome these challenges, enabling enterprises to unlock the full potential of LLMs across various use cases and applications. By leveraging Cleanlab, organizations can enhance the quality and reliability of their datasets and models, streamline data curation processes, and ensure ethical and responsible use of LLMs. Additionally, Cleanlab provides a competitive advantage and helps enterprises derive value from their data assets.