ChatGPT’s Performance Shifts Over Time, Stanford Study Shows Potential Decline
In recent months, OpenAI’s ChatGPT has been at the forefront of generative AI, revolutionizing the possibilities of human-like conversational experiences. However, a recent study by researchers from Stanford University and UC Berkeley suggests that ChatGPT may have experienced a decline in its performance.
The research document titled How Is ChatGPT’s Behavior Changing over Time? delves into the behavior and capabilities of ChatGPT’s different versions, specifically the March and June versions of GPT-4 and GPT-3.5. The researchers aimed to understand the learning curve of these language models (LLMs) by assessing their performance in various categories.
The study showcases the contrasting performances and behaviors of the two models across a range of tasks. While the researchers carefully selected these tasks to cover diverse capabilities, they found that there were significant differences in performance and behavior, with certain tasks showing a negative impact.
One area of focus was on the models’ ability to solve math problems. In March, GPT-4 demonstrated impressive accuracy by following the chain-of-thought prompts and providing correct answers. However, in June, the model seemed to skip the chain-of-thought instruction, resulting in incorrect responses. On the other hand, GPT-3.5 initially provided wrong answers but showed improvements in June.
According to the researchers, GPT-4’s accuracy plummeted from 97.6% in March to a concerning 2.4% in June. Conversely, GPT-3.5’s accuracy significantly improved from 7.4% to 86.8% during the same period. The researchers also noted a shift in verbosity, with GPT-4 exhibiting more compact responses, while GPT-3.5’s response length increased by about 40%. These disparities were found to be influenced by the drifts in the effects of chain-of-thought prompts.
Additionally, the researchers examined the models’ responses to sensitive questions. The March versions of both models provided detailed responses but mentioned their inability to address prompts with discriminatory elements. Surprisingly, in June, both models outrightly declined to respond to the same queries.
The study has garnered attention from the Reddit community, where users expressed a mix of reactions and theories regarding the findings. While it is crucial to conduct further benchmarks to validate the study’s accuracy and relevance across different platforms, such as Bing Chat, it would be impractical to ignore these initial results.
Notably, Bing Chat, powered by Microsoft, has also faced issues, with users reporting instances of rudeness and incorrect responses. Microsoft has taken measures to rectify these problems, continuously releasing updates and implementing improvements.
As the debate around the changing performance of ChatGPT continues, it prompts discussions about the reliability, accuracy, and capabilities of AI-powered chatbots. These findings from Stanford University and UC Berkeley shed light on the evolving nature of language models, their strengths, and their potential weaknesses. It remains to be seen how OpenAI and other companies will address these concerns and enhance the user experience of AI chatbots moving forward.