Chatbot technology has been a popular form of communication for customers needing assistance. Yet, the performance of the technology has been a mixed bag of helpfulness and nonsensical answers, making evaluations of the chatbots unreliable. Research teams at the University of California, Berkeley have created an experiment called the “Chatbot Arena” aimed at improving the quality of the chatbots. The platform allows anyone to anonymously chat with two AI models simultaneously before voting for a favorite. The website uses large language models (LLMs) which are packaged by the LMSYS Org within AI research and computer science departments. The site includes smaller models created by individuals and has since garnered about 40,000 participants since its inception in April.
Research into LLMs highlight the importance of human preferences to model usefulness and the task to complete. Hao Zhang, one of the Berkeley professors leading the experiment, explains they started the initiative to train different versions of their AI model based on Meta’s LLaMA model. They also wanted to standardize the evaluation process to encourage the development and implementation of generative AI tools. The experiment’s leaderboard, based on the Elo system, offers a standard rating mechanism for evaluating the models’ performance.
Currently, ChatGPT’s most advanced model, GPT-4, outperforms other models with an Elo rating of 1,225. Two versions of Claude, made by Anthropic rank second and third, with ratings of 1,195 and 1,153, respectively. However, as the technology and AI models improve, they may change ranking systems. ChatGPT and Microsoft Bing have models ranking highly, with Google Bard’s model, PaLM 2, following close behind.
The appeal of LLMs lies in their ability to extract usable information from the web to generate their own content. However, concerns about data privacy and the need to incentivize high-quality, human-created content remain pertinent. Zhang highlights the importance of AI regulation and data quality, saying “if they don’t incentivize people to create good materials, how can you guarantee they will improve the quality of life?”