Connect with us

Hi, what are you looking for?

Bytes

New Study Shows GPT-4’s Performance Declining On Certain Tasks

Stanford University and UC Berkeley paper shows that GPT-4 is no longer as good at solving maths problems, generating code and answering questions

Visual reasoning. (a) Overall performance. For both GPT-4 and GPT-3.5, there was a 2% improvement of the exact match rate from March to June. The generation length remained roughly the same. More than 60% generation changed from March to June. (b) An example query and the corresponding responses. While overall GPT-4 became better over time, it was worse on this particular query. It gave the correct grid in March but the wrong one in June.

Although GPT-4 was quite impressive when it launched, some observers have noticed a decrease in its accuracy and effectiveness. These observations have been circulating online for several months, including on the OpenAI forums.

A study conducted in partnership with Stanford University and UC Berkeley indicates that GPT-4 hasn’t become better or more accurate in its responses but is actually worse after subsequent updates.

The study, titled ‘How Is ChatGPT’s Behavior Changing over Time?‘, examined the performance difference between GPT-4 and the previous language model, GPT-3.5, from March to June.

When researchers tested both model versions with a dataset of 500 problems, they noted that in March, GPT-4 achieved a 97.6% accuracy rate with 488 correct answers. However, in June, after GPT-4 received updates, its accuracy dropped significantly to just 2.4%, with only 12 correct answers.

Researchers also utilised a chain-of-thought method, where they posed a reasoning question to GPT-4: “Is 17,077 a prime number?” As a response, GPT-4 not only provided an incorrect answer of “No,” but it also failed to provide any explanation for its reasoning, according to the researchers.

GPT-3.5 giving better responses

It’s worth mentioning that GPT-4 is presently accessible to developers and ChatGPT Plus subscribers. However, when asking the same question to GPT-3.5 through the ChatGPT free research preview, a user will not only get the correct answer but also receive a thorough explanation of the mathematical process.

Also, developers at LeetCode have noticed a decline in code generation performance with GPT-4. The accuracy on its dataset of 50 easy problems dropped from 52% in March to just 10% in June.

To exacerbate the situation, Twitter commentator @svpino pointed out rumours suggesting that OpenAI could be employing “smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run.”

This less expensive and faster approach could potentially cause a decline in the quality of GPT-4’s responses at a critical moment when the parent company has numerous major organisations relying on its technology for collaboration.

It could be nothing (big)

However, some believe that a change in behaviour doesn’t necessarily mean a decrease in GPT-4’s capability.

The study itself acknowledges this, mentioning that, “A model that has a capability may or may not display that capability in response to a particular prompt.” Essentially, achieving the desired outcome might require the user to attempt different prompts.

When GPT-4 was first revealed, OpenAI explained that they used Microsoft Azure AI supercomputers to train the language model for six months. They claimed that this resulted in a 40% increased chance of generating the, “Desired information from user prompts.”

The company’s CEO Sam Altman recently expressed his disappointment in a tweet after the Federal Trade Commission launched an investigation on ChatGPT breaching consumer protection laws.

“We’re transparent about the limitations of our technology, especially when we fall short. And our capped-profits structure means we aren’t incentivized to make unlimited returns,” tweeted Altman.

As OpenAI becomes increasingly engaged in the politics of AI regulation and discussions about the risks of AI, the most it can do for its users is provide a brief look behind the scenes to help them understand why the AI they pay to use isn’t behaving in the way a good chatbot should.

Written By

Isa Muhammad is a writer and video game journalist covering many aspects of entertainment media including the film industry. He's steadily writing his way to the sharp end of journalism and enjoys staying informed. If he's not reading, playing video games or catching up on his favourite TV series, then he's probably writing about them.

You May Also Like

Level Up

Eager to be at the metaverse frontier, but not sure how to get started? As exciting as the idea of a shared digital space...

Bytes

Fashion brand teams up with proto-metaverse for two new eyewear options, the Helux and Hydra

Bytes

New blockchain gaming platform based on Unreal Engine 5.

Bytes

The record for the most expensive land sale in the metaverse has just been raised

Advertisement
Advertisement

Subscribe to the future

Advertisement