The AI model behind ChatGPT, developed by OpenAI, has been experiencing a noticeable decline in performance, raising concerns among researchers and the AI community. Researchers from Stanford University and UC Berkeley published a paper revealing that the underlying AI models, GPT-3.5 and GPT-4, show a significant variation in their behavior over time. Particularly, the more advanced multimodal model, GPT-4, which can understand images along with text, performed poorly on various tasks when compared to its previous performance.
The tasks used to assess the model’s capabilities included math problems, generating code, responding to sensitive questions, and visual reasoning, providing a comprehensive evaluation of its abilities. However, the results were far from impressive. GPT-4 displayed a drastic drop in accuracy, from 97.6% in identifying prime numbers in March to a shocking 2.4% in June. It also made more formatting mistakes in code generation and was less responsive to sensitive questions.
The research does not offer a clear explanation for this decline in performance, leaving the reason behind it unknown. Ethan Mollick, a professor of innovation at Wharton, expressed uncertainty about whether OpenAI is even aware of this degradation in abilities. However, the AI community has certainly taken notice, with ongoing debates on OpenAI’s developer forum about the declining quality of responses.
This decline in GPT-4’s performance is problematic for OpenAI since it is the AI model that underlies a more advanced version of ChatGPT, available to paying subscribers. OpenAI aims to outperform its competitors through its most advanced large language model, but the diminishing quality of GPT-4’s responses poses a challenge to their goal.
OpenAI has disputed the idea of GPT-4 becoming less capable, with Peter Welinder, VP of product at OpenAI, tweeting that each new version of the model is smarter than its predecessor. However, this latest research suggests otherwise.
Matei Zaharia, CTO at Databricks and co-author of the research paper, highlighted the difficulty in managing the quality of AI models’ responses. Model developers face challenges in detecting changes or preventing a loss of capabilities when tuning their models for new features.
While some experts, like Princeton professor Arvind Narayanan, have pointed out possible limitations in the evaluation methods and tasks used in the research, the concerns about GPT-4’s quality persist. OpenAI must address these concerns to maintain confidence in their AI models and stay ahead in the competitive landscape.
As the AI community continues to raise questions about GPT-4’s declining performance, OpenAI needs to provide answers and reassurance. The substantial evidence presented in this research paper suggests that OpenAI may need to reevaluate their stance on the model’s capabilities. The challenge lies in managing the quality and performance of AI models consistently, ensuring they meet the expectations of both developers and users.