Home Chess Grok 4 Dominates 1st Day Of AI Chess Tournament, Advances With Gemini 2.5 Pro, o4-mini, o3

Grok 4 Dominates 1st Day Of AI Chess Tournament, Advances With Gemini 2.5 Pro, o4-mini, o3

by

The inaugural day of the AI chess exhibition match in Google’s new Kaggle Game Arena project saw dominating performances by four of the Large Language Models (LLMs) on display. Four 4-0 sweeps took Gemini 2.5 Pro, o4-mini, Grok 4, and o3 to the Semifinals after they defeated Claude 4 Opus, DeepSeek R1, Gemini 2.5 Flash, and Kimi k2.

The event continues on Wednesday, August 6, starting at 1 p.m. ET / 19:00 CEST / 10:30 p.m. IST.

Kaggle Arena Chess Exhibition Tournament Bracket 


The Kaggle Game Arena is a new project by Kaggle, a company owned by Google, and the world’s leading community and platform for data scientists and machine learning practitioners. The arena is Kaggle’s initiative to explore how LLMs like Gemini, ChatGPT, DeepSeek, and others perform in a dynamic and competitive environment.

According to Google, this experiment can serve as a powerful indicator of the LLMs’ general problem-solving abilities. Google believes the LLMs’ play can provide insight into the future of AI’s strategic intelligence and its path toward Artificial General Intelligence (AGI).

To inaugurate their arena, Kaggle has chosen to host an AI chess exhibition tournament. To do so, they’ve partnered with DeepMind, another one of Google’s companies, which is no stranger to chess—in fact, they revolutionized the chess world with AlphaZero in 2017. Kaggle has provided the Game Arena, a trusted and neutral stadium where the AIs are competing, while DeepMind has designed the scientifically rigorous tournament in which the AIs are participating. 

What makes this tournament unique is that, unlike other computer-vs.-computer tournaments, this event does not involve chess-playing engines. Instead, the LLMs are “general-purpose” models created for writing, coding, and reasoning about the world. Their playing strength, then, is not even close to that of other engines.

However, as pointed out earlier, this experiment will let us catch a glimpse of the AIs’ way of thinking, which will help us understand the way they “see” and approach complex problems.

The tournament itself is a single-elimination knockout bracket. Eight of the world’s leading LLMs are participating: Gemini 2.5 Pro, Gemini 2.5 Flash, o3, o4-mini, Claude 4 Opus, Grok 4, DeepSeek R1, and Kimi k2. Using DeepMind’s “harness” (the universal controller the AIs will use to “see” the positions and make moves), the LLMs have four total attempts to make a legal move. If they fail to make a legal move in the fourth attempt, they lose the game.

Now that we’ve discussed the format and goals of the tournament, it’s time to dive into the games. You can watch IM Levy Rozman‘s video recap below, or read the full report:


Kimi k2 0-4 o3

Despite every match finishing 4-0, this was by far the least balanced of them all. The match between Kimi k2 and o3 finished early, as none of the games lasted more than eight moves. All four games finished after Kimi k2 failed to find a legal move four times in a row—forfeiting each game. Though Kimi k2’s chess performance was underwhelming, it was also very revealing. 

Judging by Kimi k2’s annotations, the AI seemed to be able to follow opening theory for a few moves. When that was the case, the AI had no trouble playing good moves. However, as soon as they “lost track” of theory, disaster was much more likely—and in Kimi k2’s case, it always came too soon.

It’s hard to tell why Kimi k2 struggled. At times, it could clearly see where each piece was located, but it forgot how pieces moved:

In other games, it couldn’t even read the position correctly. Below, you can see the pieces Kimi k2 misplaced in game two:

Either way, Kimi k2’s mistakes gave o3 an easy win and a ticket to the semifinals, which o3 will play against its “little brother,” the winner of the next match we’ll discuss.

DeepSeek R1 0-4 o4-mini

The match between Open AI’s o4-mini and DeepSeek R1 was, in many ways, bizarre. If you isolated the first few moves of every game, you’d think you were watching a couple of strong players competing. But at some point, the quality of play would go down really fast.

This trend would continue throughout the match. A few good opening moves would be followed by hallucination and a series of blunders. Despite this, o4-mini managed to checkmate twice in this match—an impressive feat, considering how troublesome it is for AIs to see the entire board.

Gemini 2.5 Pro 4-0 Claude 4 Opus

The Gemini 2.5 Pro vs. Claude 4 Opus match was the only one to have more games finishing by checkmate than by illegal move forfeits. However, it’s not clear how much better Gemini 2.5 Pro really is at chess, and how much of its victory was actually on Claude 4 Opus’s poor play.

An interesting moment happened in game four of the match, when Gemini 2.5 Pro had a 32-point material advantage with, among other pieces, two queens on the board. Despite all its firepower, it still hung some pieces on its way to checkmating the opponent:

But one interesting game to analyze is the first of the match. Both AIs held their own and played good moves up until move nine. It was at this point that Claude 4 Opus, playing Black, made a hasty decision and played 10…g5. Black’s blundered a pawn and shattered the shelter of their king, expediting their defeat. The comments from each AI are revealing:

Grok 4 4-0 Gemini 2.5 Flash 

The strongest performance today came from Grok 4. Beyond scoring four full points, Grok 4 also played the best chess so far. Granted, Gemini 2.5 Flash made its opponent’s life easier by blundering multiple pieces. However, unlike other AIs, it seems like Grok 4 was very intentional in identifying and capitalizing on undefended pieces.

A good example comes from our Game of the Day, which GM Rafael Leitao will analyze below:

Grok 4’s strong showing caught the attention of the tech world, including its creator. In a brief exchange on X, Elon Musk, who famously said chess is too simple, struck again:

So far, LLMs have displayed three key weaknesses when it comes to playing chess: visualizing the entire board, understanding how pieces interact with each other, and making legal moves (which is often a result of the first two shortcomings mentioned). For now, Grok 4 has shown that it doesn’t share the same limitations.

It will be interesting to see if the AIs’ weaknesses and strengths hold throughout the entire tournament. To find out, tune in to the live broadcast of the semifinals tomorrow!

The action continues tomorrow with the semifinals. Watch the event live!

The Kaggle Game Arena AI Chess Exhibition Tournament, which takes place from August 5-7, is an event organized by Google in their new Kaggle Game Arena, where some of the world’s leading Large Language Models (LLMs) compete in a series of chess games. The LLMs compete in a single-elimination bracket. 


Previous coverage:



Source link

You may also like

Leave a Comment