On Wednesday, Databricks released an innovative new development, Dolly 2.0, the first freely available, instruction-based large language model (LLM) for commercial use and fine-tuned on a human-generated data set. It represents a spark in the language model universe, providing a foundation for developing ChatGPT-style competitors.
Databricks, an American enterprise software company founded in 2013 by the creators of Apache Spark, enables organizations to create and customize LLMs without repercussions such as need to pay for API access or sharing data with third parties.
Dolly 2.0 is a 12-billion-parameter model, originally based on EleutherAI’s pythia model family and solely fine-tuned with Databricks-dolly-15K, a data set crowdsourced from Databricks personnel. This fine-tuning gives it capabilities closer to OpenAI’s ChatGPT, a model capable of properly answering questions and engaging realistically with conversations.
In March of this year, Databricks began their journey with the release of Dolly 1.0, which was hampered by limitations due to the training data featuring ChatGPT outputs, which required users to adhere to OpenAI’s terms of service.
The Databricks team then decided to take on the colossal task of creating a new data set to enable commercially accessible LLMs; a 13,000-demonstration data set crowdsourced from over 5,000 employees, who were encouraged by a participating competition. The tasks for data generation included open Q&A, closed Q&A, summarizing from Wikipedia, brainstorming, classification, and creative writing.
The data set, model weights and training code were released with a Creative Commons license, allowing any commercial use with modifications and extensions. This is beneficial to organizations in comparison to OpenAI’s ChatGPT, which demand users to pay for API access, and Meta’s LLaMA, which is only partially open source and forbids commercial use.
AI researcher Simon Willison deemed the launch of Dolly 2.0 a “really big deal” and commended Databricks for the fine-tuned instruction set created by the 5,000 Databricks personnel members and openly released with Creative Commons license.
The potential of Dolly 2.0 is absolutely astounding; it could potentially spark a new wave of open source language models free from the shackles of proprietary limitations and restrictions on commercial use. Furthermore, further refinements may allow for local consumer-class machines to enjoy the power of these finely-tuned language models.
Databricks LLC is a software company founded by the original creators of Apache Spark — an open-source distributed computing platform designed for processing large datasets. It provides a web-based platform designed with development and distributed processing of big data in mind, featuring support for a variety of languages, libraries, APIs, and other technologies.
Simon Willison is a venture capitalist and AI researcher. He conducts experiments with open source language models, including Dolly. Willison’s comments on the release of Dolly 2.0 created great anticipation for the potential of Open Source language models, summed in his words: “Even if Dolly 2 isn’t good, I expect we’ll see a bunch of new projects using that training data soon. And some of those might produce something really useful.”
The Dolly 2.0 weights are available on Hugging Face and the databricks-dolly-15k data set is free for download from GitHub. It is an exciting time for large language models, with the potential of unlimited possibilities enabled by freely available, open source AI.