This article discusses RedPajama, a collaborative project between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute to create leading, open-source large language models. This project was begun with a 1.2 trillion token dataset that follows the LLaMA recipe. This data enables any organization to pre-train models that can be permissively licensed.
It is the same procedure the creators of the LLaMA model went through, however they did not release their dataset. The RedPajama team followed the same recipe and recreated a dataset from scratch to provide organizations full access to open-source language models. Vipul Ved Prakash, founder and CEO of Together and previously co-founder of Cloudmark and Topsy, emphasized the importance of providing open-source models that are commercially viable.
The open source AI debate has recently stirred up conversations about competition among corporations and the ethical concerns. Companies such as OpenAI insist that the level of access needs to be governed for organizations to maintain their lead. On the other hand, Databricks, released Dolly 2.0, which is the first open, instruction-following LLM for commercial use.
The RedPajama project attempts to address both of these perspectives as the models are open-source yet commercially viable. It is hoped that the data and script availability could lead to a broader level of research to improve AI models and applications. In addition, the models are trained on openly available data and should not reproduce the training data.
Lastly, the team working on this project, namely Chris Re – co-founder of Together, Stanford associate professor, and co-founder of SambaNova, Snorkel.ai and Factory – and Vipul Ved Prakash are pushing for open access to large language models and software systems. This gives organizations, both big and small, equitable access to the same tools, which otherwise may be out of reach.