Study Shows ChatGPT Quality Is Highly Volatile

First published: August 10, 2023

Number of Updates: 1

Researchers asked ChatGPT to carry out several standard tasks between March 2023 and June 2023 and found that the quality of the output changed substantially.

The study focused on GPT-3.5 and GPT-4 across four specific tasks to measure how the model output evolved over time. They studied the ability of the models to solve math problems, answer sensitive or dangerous questions, generate code, and execute visual reasoning.

In some areas, the models performed better in June than March, but not in all areas. For example, GPT-4 was able to identify prime numbers with an accuracy of 97.6% in March, but was accurate only 2.4% of the time a few months later. More importantly, the outcomes demonstrate the volatility of Generative AI solutions.

While the study concludes that quality changes rapidly over the four month period, it doesn’t speculate on the causes of this change primarily due to the lack of information on updates made to the models from OpenAI. The researchers are therefore unable to identify whether the changes in output are due to the changes to the data used to train the model or changes to the model itself.

Researchers measured the quality of the output related to math problems by the accuracy of the responses, the quality of code generation by the fraction of the code that was directly executable, and the quality for visual reasoning by exact matches.

In the case of answering sensitive questions, the researchers were trying to see if the models would provide harmful outputs including social biases or personal information when prompted by user input. Overall, the models performed better in June than in March.

Ultimately, the researchers suggest that “this highlights the need to continuously evaluate and assess the behavior of LLMs in production applications.” The researchers recommend that anyone who has integrated LLM services into their applications should run similar quality analyses regularly to reduce impact on the quality of their solution or downstream systems.

The research team is from Stanford University and UC Berkeley.

Beyond this study, users have also noticed changes as well and have spoken about its decline on Twitter, in ChatGPT Facebook groups, and OpenAI’s community platform.