The methodology to judge AI needs realignment

Posted On: 2025-06-03 Posted By: Vishal Mathur

Business & Finance Education Technology Cities Real Estate & Construction Hindustan Times

India, June 3 -- When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set "new standards for coding, advanced reasoning, and AI agents". They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.

AI companies flexing comparative test scores is a common theme.

The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance - plentiful examples, often used to judge whether a PC or a smartphone ...

Click here to read full article from source

To read the full article or to get the complete feed from this publication, please Contact Us.

Exclusive

Category

Source

Publication

Location

The methodology to judge AI needs realignment