OpenAI’s o3 reasoning model has shattered benchmarks with an impressive 87.5% score, but the hefty price tag of $30,000 or more to test its capabilities raises questions about the true cost of AI innovation.
Measuring the intelligence of artificial intelligence is a difficult task. This is why the tech industry has developed benchmarks like ARC-AGI, which test the capabilities of new technology through visual tasks that are challenging for A.I. models.
In December, OpenAI’s o3 reasoning model became the first A.I. system to pass the ARC-AGI test with an 87.5 percent score. This achievement highlights the model’s ability to pause and consider numerous potential prompts before responding with the most accurate answer.
OpenAI is a research organization focused on developing and advancing artificial intelligence (AI) technologies.
Founded in 2015, the company has made significant contributions to natural language processing, computer vision, and robotics.
Its mission is to ensure that AI benefits humanity as a whole.
OpenAI's work includes developing large-scale language models like GPT-3, which has been used in various applications such as 'text generation' and 'content creation'.
The organization also focuses on safety and governance of AI systems, acknowledging the potential risks associated with advanced technologies.
The cost of testing OpenAI’s o3 model was initially estimated at around $3,400 per task by the Arc Prize Foundation, which administers the ARC-AGI benchmark. However, recent estimates suggest that the actual costs could be significantly higher – up to ten times higher.

The new pricing for OpenAI’s o1-pro model, which is ten times more costly to run than o1, has been used to estimate the cost of testing o3. This puts the potential cost of running o3 at upwards of $30,000 per task.
The Arc Prize Foundation’s benchmark, ARC-AGI, relies on a series of puzzles that track how close A.I. systems are to human-level intelligence. It examines whether an A.I. system can adapt to new problems and learn new task-specific skills.
Recent A.I. releases have gotten closer to the 100 percent mark on ARC-AGI, but they’ve largely been stumped by a newer version of the test released last month, known as ARC-AGI-2. This test contains tasks that are even more difficult for A.I. systems and is especially designed for those that specialize in reasoning.
Artificial intelligence (AI) has a rich history dating back to 1951 when computer scientist Alan Turing proposed the Turing Test.
Since then, AI research has accelerated with significant advancements in machine learning, natural language processing, and deep learning.
Today, AI is integrated into various industries, including healthcare, finance, and transportation, improving efficiency and decision-making processes.
The high cost of running OpenAI’s o3 model highlights the challenges of measuring AI intelligence and the importance of developing benchmarks like ARC-AGI. As A.I. technology continues to evolve, it will be crucial to address these challenges and develop more efficient evaluation methods.
- observer.com | OpenAI’s o3 Reasoning Models Are Extremely Expensive to Run