India, June 3 -- When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set "new standards for coding, advanced reasoning, and AI agents". They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.

AI companies flexing comparative test scores is a common theme.

The world of technology has for long obsessed over synthetic benchmark test scores. Processor performance, memory bandwidth, speed of storage, graphics performance - plentiful examples, often used to judge whether a PC or a smartphone ...