Title: Open Source Chatbots vs. ChatGPT: Chasing the Hype without Substance
Introduction:
The race to replicate OpenAI’s groundbreaking chatbot, ChatGPT, has led to the emergence of numerous new chatbots, both from big-tech companies and the open-source community. However, many of these developers are resorting to shortcuts and making exaggerated claims in order to grab attention. One popular shortcut is training these chatbots on data generated by ChatGPT itself. Recently, OpenChat, an open-source alternative, boasted of surpassing ChatGPT’s performance on the Vicuna GPT-4 Benchmark. But a closer look reveals that these claims may not hold up.
Not for Commercial Use:
OpenChat is built on top of LLaMA-13B, a model designed exclusively for research purposes by Meta. As a result, OpenChat cannot be used for commercial purposes. This limitation undermines the credibility of its claims to outperform ChatGPT. Additionally, it is crucial to consider the dataset used for fine-tuning. The LLaMA-based model is trained on only 6,000 conversations out of the available 90,000 on ShareGPT, an online hub for sharing outputs generated by ChatGPT and GPT-4.
Flawed Evaluation Metrics:
The Vicuna GPT-4 Benchmark primarily tests the style and not the informativeness of the generated content. Moreover, this evaluation metric is GPT-based, which means models trained on ChatGPT or GPT-4 data will receive higher ratings when evaluated using GPT, rendering the benchmarking process unreliable. Hugging Face, a prominent platform, discovered similar discrepancies between the evaluation benchmarks published by other open-source models and their performance on Hugging Face’s own benchmarks.
False Hype and Style Imitation:
Experts have criticized the trend of imitating ChatGPT by training models on ChatGPT-generated output, branding it as false progress. These models often excel in mimicking the chatbot’s style while delivering better results on specific tasks. However, when assessed across various general tasks, ChatGPT proves to be the superior assistant.
Transition to MT-bench and Disappointing Results:
In response to the criticisms, the researchers behind OpenChat decided to transition to MT-bench for testing its performance. Surprisingly, compared to ChatGPT based on GPT-3.5, OpenChat performed significantly worse, amplifying concerns regarding evaluating the model based on Vicuna GPT-4 Benchmark.
Quality Data Drives Success:
The underlying message emerging from this discourse is the undeniable importance of high-quality data for training chatbots. OpenAI’s ChatGPT stands out as it possesses a unique and powerful dataset that sets it apart from its competitors. While the open-source community strives to replicate ChatGPT’s success, training on ChatGPT’s synthetic data might not be the most effective approach. OpenAI has already faced multiple lawsuits for training its models on internet data.
In conclusion, the claims made by models trained on ChatGPT data often fail to live up to expectations when benchmarked against the same metrics as their counterparts. OpenAI’s ChatGPT remains the frontrunner in the realm of chatbots, demonstrating its superiority across various tasks. Regardless of the hype surrounding these new models, the significance of high-quality data cannot be underestimated. OpenAI’s proprietary dataset has played a pivotal role in its success, making it challenging for open-source alternatives to replicate its achievements.