OpenAI Claims Its New Model Reached Human Level on a Test for ‘General Intelligence.’ What Does That Mean?

A new artificial intelligence (AI) model has justachieved human-level resultson a test designed to measure general intelligence.

It also scored well on a very difficult mathematics test.

Creating artificial general intelligence, or AGI, is the stated goal of all the major AI research labs.

Sam Altman speaks at a conference

© Photo by Michael M. Santiago/Getty Images

At first glance, OpenAI appears to have at least made a significant step towards this goal.

While scepticism remains, many AI researchers and developers feel something just changed.

For many, the prospect of AGI now seems more real, urgent and closer than anticipated.

Several patterns of coloured squares on a black grid background.

An example task from the ARC-AGI benchmark test.

An AI system like ChatGPT (GPT-4) is not very sample efficient.

The result is pretty good at common tasks.

It is bad at uncommon tasks, because it has less data (fewer samples) about those tasks.

The Conversation

It is widely considered a necessary, even fundamental, element of intelligence.

Each question gives three examples to learn from.

The AI system then needs to figure out the rules that generalise from the three examples to the fourth.

Elon Musk, CEO of Tesla, during a cabinet meeting at the White House in Washington, DC, US, on Thursday, April 10, 2025.

These are a lot like the IQ tests sometimes you might remember from school.

From just a few examples, it finds rules that can be generalised.

What do we mean by the weakest rules?

Photo:

The technical definition is complicated, but weaker rules are usually ones that can bedescribed in simpler statements.

Searching chains of thought?

However, to succeed at the ARC-AGI tasks it must be finding them.

Data Centers Now Need A Reactor’s Worth Of Power, Dominion Says

It would then choose the best according to some loosely defined rule, or heuristic.

you’re free to think of these chains of thought like programs that fit the examples.

There could be thousands of different seemingly equally valid programs generated.

Robot cop at New Year celebration in Thailand.

That heuristic could be choose the weakest or choose the simplest.

However, if it is like AlphaGo then they simply had an AI create a heuristic.

This was the process for AlphaGo.

DeepSeek iPhone App

Google trained a model to rate different sequences of moves as better or worse than others.

What we still dont know

The question then is, is this really closer to AGI?

If that is how o3 works, then the underlying model might not be much better than previous models.

Predator Badlands

The concepts the model learns from language might not be any more suitable for generalisation than before.

The proof, as always, will be in the pudding.

Almost everything about o3 remains unknown.

Jblflip6

We will require new benchmarks for AGI itself and serious consideration of how it ought to be governed.

If not, then this will still be an impressive result.

However, everyday life will remain much the same.

Eufysolocam

News from the future, delivered to your present.

Two banks say Amazon has paused negotiations on some international data centers.

Alicia Witt in Urban Legend

Hp14

U.S. President Donald Trump speaks to the media during a guided tour of the John F. Kennedy Center for the Performing Arts before leading a board meeting on March 17, 2025 in Washington, DC.

Metaquest3s

Sharks

Predator Badlands

Jblflip6

Eufysolocam

Alicia Witt in Urban Legend

An image of a small disposable vape with a green case and mouth piece and visible oil in a clear container.

An image of a hand holding a black vape with a vibrant blue chamber where you can faintly see a laser.

Framework 13 Laptop 1 Hero

Samsung Odyssey 3d 6

Searching chains of thought?#

What we still dont know#

Searching chains of thought?

What we still dont know