IBM has recently patented a new method that aims to revolutionize the training process for Language Learning Models (LLMs) in the enterprise sector. Deep learning AI models like GenAI chatbots require a significant amount of data for effective training, posing challenges in terms of effort, compliance, and cost. To address this issue, IBM is looking to utilize synthetic data to satisfy the data requirements of AI models.
The tech giant’s innovative system for synthetic data generation involves creating simulated authentic data from real users through a method known as Large-Scale Alignment for Chatbots (LAB). This approach systematically generates synthetic data tailored to the specific tasks that developers want their chatbots to perform, ultimately streamlining the training process for LLMs.
IBM acknowledges that the quality and relevance of data significantly impact the effectiveness of AI models and recognizes the bottleneck posed by the need for accurate and representative training data. The LAB method aims to mitigate these challenges by continuously integrating new knowledge and capabilities into the model without erasing existing learnings, resulting in a wealth of processed data for training purposes.
A key component of the new data generation method is a taxonomy-based approach that classifies data into various categories and subcategories. IBM’s taxonomy segregates instruction data into three main categories: knowledge, foundational skills, and compositional skills. By mapping out the existing skills and knowledge of the chatbot, developers can identify gaps that need to be filled to enhance the model’s performance.
Furthermore, IBM’s patent highlights the deployment of a second LLM, referred to as a teacher model, which generates instructions based on a question-answer framework. This teacher model plays a crucial role in refining the simulation by providing instructions for each category while maintaining quality control. The progressive training approach enables the AI model to build upon its existing knowledge base, akin to human learning progression.
The use of synthetic data also offers added privacy benefits, as it enables the modeling of human behaviors, interactions, and choices without compromising user privacy. However, despite the advantages of synthetic data, there are associated risks, particularly in industries like healthcare and finance where closely mimicking real user data can pose challenges.
To test the effectiveness of the LAB method, IBM Research generated a synthetic dataset comprising 1.2 million instructions and utilized this data to train two open-source LLMs. The results demonstrated that both LLMs performed comparably or better than state-of-the-art chatbots across various benchmarks. IBM also leveraged the synthetic data to enhance its enterprise-focused Granite models on IBM watsonx.
IBM attributes the success of the LAB method to two key factors. Firstly, the teacher model’s ability to generate synthetic examples from each leaf node of the taxonomy, allowing for a broader coverage of target tasks. Secondly, the method enables the incorporation of new skills and knowledge into the base LLM without necessitating the teacher model to assimilate additional information. This streamlined approach eliminates the need for an all-powerful teacher model that imparts its capabilities to the base model.
The company’s patent suggests a potential surge in demand for AI services and highlights the profitability of offering AI support to enterprises building their own models. By utilizing synthetic data, IBM aims to provide a less resource-intensive alternative to collecting authentic user data, thereby facilitating efficient AI development.
In conclusion, IBM’s innovative approach to synthetic data generation marks a significant milestone in AI training processes. By leveraging the LAB method and taxonomy-based classification, the tech giant aims to enhance the efficiency and effectiveness of AI models, paving the way for advancements in enterprise AI development.