Researchers at the University of California, Berkeley, have shed light on the potentially unethical and legal issues associated with training ChatGPT, a language model created by OpenAI. Chang, Cramer, Son and Bamman published their paper, titled “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” on April 28 on the arXiv preprint server.
Their report highlighted that OpenAI models are trained using a wide range of copyrighted material, which can lead to the inclusion of bias in their analytics. Chang noted that science fiction and fantasy books make up a high percentage of the memorized material, thus skewing the results in one direction. It raises questions about the validity of these results and, as Chang recommends, concerns about transparency regarding the data used for training.
The researchers concluded that for OpenAI models to reach their full potential, the public needs to know the information and sources included or excluded from the training data. Knowing what books an AI was trained on is crucial to eliminate such hidden bias. They suggested the use of open models that disclose the materials used in the training process.
In addition, legal challenges such as “fair use” copying of text and copyright protection for multiple, similar outputs by various parties may arise in the near future. Lastly, the debate regarding the copyrightability of machine language will be tested in another court case.
The University of California Berkeley is a premier public research university. Established in 1868, UC Berkeley is renowned for its superior academic programs and research, a renowned faculty force, and meaningful impact on the international scale. Kent Chang is an assistant professor at UC Berkeley in the Department of Computer Science with a research focus on computer vision, natural language processing, and machine learning.
Mackenzie Cramer is a graduate student at UC Berkeley who specializes in Natural Language Processing and Machine Learning. Sandeep Son, also a graduate student at UC Berkeley, focuses on applying Deep Learning and Computer Vision technologies in healthcare. David Bamman is a professor at UC Berkeley in the Department of Linguistics and focuses on natural language processing and text analysis.