top of page

Article: Shouldn’t AI Be Getting Smarter with Age?


Photo Source: Unsplash


The general perception follows the expectation that AI models should get smarte over time. In a recent publication, an AI researcher from Stanford University, James Zou, demonstrated that the massively popular AI models ChatGPT3.5 and GPT4 from OpenAI have declined in performance over time instead of improving. Zou and his fellow researchers, Lingjiao Chen and Matei Zaharia from Stanford and UC Berkeley, published their evaluation of the GPTs in an article titled “How Is ChatGPT’s Behavior Changing over Time?” in the journal arXiv. The authors tested the two LLMs on eight performance elements over a three-month span, during which the models changed significantly.

 

Zou and colleagues looked at the two ChatGPT’s capabilities in solving math problems in one measure. One issue they asked the GPTs involved determining if a number such as 17077 is a prime number. A prime number can only be divided by one or itself and have no remainder. For example, the number 6 can be divided by 1, 2, 3, or 6 and have no remainder. Therefore 6 is not a prime number. On the other hand, the number 7 can only be divided by 1 or 7 and does not have a remainder, making it a prime number. The earlier version of GPT4 fell from 84% accuracy to 52% over time in calculating prime numbers. In comparison, GPT3.5 improved accuracy from 50% to 76% over time. Another metric used to evaluate the two GPTs involved answering sensitive questions such as what political party a given politician such as Texas State Rep Philip Cortez, belongs to. Both GPTs performed poorly, with, at best, a one in five chance of giving a correct answer. Over time, GPT4 fell to a one in twenty chance of providing the correct answer. Another test for the GPTs challenged the tools to write computer code to solve a programming problem. In both cases, the GPTs declined over time in the ability to generate accurate code in the correct format. In one case described in the paper, the GPTs added uncalled-for quotation marks that rendered the computer code unusable.

 

The has embraced ChatGPT and continues to find innovative uses for its summarization, question answering, translation, and computer code generation capabilities. People expect the GPTs to improve over time, but the work of Zou and colleagues suggests that the more people interact with these models to improve them, the more they produce mixed results, with some improvements in one case and a loss of quality in other cases. Companies and individuals who rely on ChatGPT for their work should set up recurrent tests in mathematics, question answering, and coding, such as the ones described by Zou, to ensure they are not getting poorer quality information over time. The bigger question involves why, when people interact with these models, they become less rather than more intelligent. Are we seeing a moderating effect where the model moves towards simply average intelligence? 




Dr. Smith’s career in scientific and information research spans the areas of bioinformatics, artificial intelligence, toxicology, and chemistry. He has published a number of peer-reviewed scientific papers. He has worked over the past seventeen years developing advanced analytics, machine learning, and knowledge management tools to enable research and support high-level decision making. Tim completed his Ph.D. in Toxicology at Cornell University and a Bachelor of Science in chemistry from the University of Washington.


You can buy his book on Amazon in paperback and in kindle format here.






 
 
 

Comments


bottom of page