Highlights:

  • In March, GPT-4 successfully tackled 97.6% of assigned mathematical problems, but its proficiency declined drastically to 2.4% by June.
  • Between March and June, the proportion of queries answered by GPT-4 with “directly executable” code, or code that can be executed without modification, decreased by more than 40%.

According to a recent study article, GPT-4, OpenAI LP’s most powerful artificial intelligence model, may have grown less successful at performing specified tasks.

Recently, Ars Technica revealed the paper’s findings. It was written by three researchers from Stanford University and the University of California, Berkeley, and was originally published on July 18. Following the paper’s publication, several AI experts questioned whether GPT-4 has become less accurate.

The paper’s authors assessed GPT-4’s reasoning abilities by requiring it to complete a series of tasks twice: once in March and again three months later. Then, they compared the outcomes of the two tests.

One subset of the tasks assigned to GPT-4 by the researchers required the AI to solve mathematical problems. In March, it effectively resolved 97.6% of the questions. This percentage had dropped to 2.4% by June.

The paper’s authors believe the decline might be due to “drifts of chain-of-thoughts’ effects.”

When researchers asked GPT-4 to tackle math problems, they interacted with the model using chain-of-thought prompting. In addition to requesting an answer from the model, they also requested a detailed explanation of its thought process. This procedure has been demonstrated to increase the precision of language models.

The researchers hypothesize that the observed change in GPT-4 accuracy may be attributable to the chain-of-thought prompts. In one test, they input a chain-of-thought query asking the model to determine if 1,7077 is a prime number. In March, GPT-4 provided the correct response and a step-by-step analysis of its thought process. However, three months later, it provided an incorrect response without giving a breakdown.

The accuracy of GPT-4 was also evaluated concerning other categories of tasks. A portion of the utilized evaluations required the model to compose software code. Between March and June, the proportion of queries answered by GPT-4 with “directly executable” code, or code that can be executed without modification, decreased by more than 40%.

Certain AI specialists have expressed doubts about the paper’s findings. Professor of computer science at Princeton University, Arvind Narayanan, noted that the fact that the code generated by GPT-4 could not be run promptly did not inherently imply that it was less accurate. Sometimes, the code could not be executed because GPT -4’s responses also contained explanatory prose.

Respected Software Engineer Simon Willison concurred. Willison told Ars Technica, “A decent portion of their criticism involves whether or not code output is wrapped in Markdown backticks or not.” Backticks are used to format software code in Markdown.

Logan Kilpatrick, OpenAI’s Head of Developer Relations, stated, “The team is aware of the reported regressions and looking into it.” Peter Welinder, the AI startup’s Vice President of product and partnerships, stated, “No, we haven’t made GPT-4 dumber. Quite the opposite.”

The paper’s authors published this week about the precision of GPT-4 also evaluated GPT-3.5, an earlier OpenAI model with fewer capabilities. Between March and June, they discovered that the accuracy of the latter model did not decline but instead improved. In three months, the accuracy with which GPT-3.5 solved math problems increased from 7.4% to 86.8%.