Why the World’s Best AI Systems Are Still Terrible at Pokémon

This image show Pokemon's video games, figure and toys. 13JUL16 SCMP/May Tse   [18JULY2016 FEATURES DIGITAL]

Currently, live on Twitch, you can watch three of the world’s most advanced AI systems—, , and —trying their hardest to beat classic Pokémon games. By human standards, at least, they’re not very skilled.

The systems are sluggish, overconfident, and frequently confused. But if you want to grasp what these systems can actually do in the real world, following their attempts to become Pokémon champions reveals far more than the often indecipherable benchmark figures that come with each new model launch.

The effort to turn a large language model (LLM) into a Pokémon master started last February, when an Anthropic researcher launched a of Claude playing the 1996 Game Boy title Pokémon Red to coincide with the release of Claude Sonnet 3.7—then one of the world’s top models. As the company pointed out, this was the first Claude model capable of playing the game in any meaningful way (earlier versions “wandered aimlessly or got stuck in loops” and couldn’t get through the game’s opening sequences). In the first few weeks, the stream drew around 2,000 viewers, who cheered Claude on in the public chat.

Most kids sail through this game in about 20 to 40 hours. Sonnet 3.7 never beat it, often getting stuck for dozens of hours straight. Anthropic’s newest model, Claude Opus 4.5, is doing much better, but still gets stuck frequently. Once, it spent four days looping around a gym without being able to go inside because it didn’t realize (or couldn’t see) it needed to cut down a tree. Google’s Gemini models finished a similar game last May, prompting Google CEO Sundar Pichai to joke the company was one step closer to building “Artificial Pokémon Intelligence.”

But that doesn’t make Gemini the superior Pokémaster. The reason? The two AI systems use different “harnesses.” As , an independent developer who runs the Gemini Plays Pokémon stream, explains, a harness is best thought of as an “Iron Man” suit that an AI system is put into, enabling it to use tools and take actions it can’t do on its own. Gemini’s harness gave it much more support—like converting the game’s visuals into text, which gets around its visual reasoning flaws, and providing custom tools to solve puzzles. Claude, on the other hand, has been put into a more basic harness, so its attempt reveals more about the model itself.

While the difference between a model and its harness is unclear to regular users, harnesses have already altered how we use AI. For instance, when you ask ChatGPT a question that requires a web search, it uses a web search tool—that’s part of its harness. For Pokémon, each model uses a unique custom harness that dictates what actions it can perform.

Pokémon is well-suited for testing AI capabilities—and not just because it’s culturally familiar. Unlike a game like Mario, which needs real-time reactions, Pokémon is turn-based and has no time limits. To play, an AI model gets a game screenshot and a prompt outlining its goals and available actions. It then thinks and outputs an action (such as “press A”)—that’s one step. Opus 4.5, which has been playing for over 500 human hours, was on step 170,000 at the time of writing. Each step starts the model fresh, using info left by its previous instance—like an amnesiac using Post-it notes.

It might be surprising that AI systems—superhuman at chess and Go—struggle with a game that’s easy for six-year-olds. But the systems that mastered chess and Go were built specifically for those games, unlike general-purpose ones like Gemini, Claude, and ChatGPT. Even so, since these LLMs keep acing exams and outperforming humans in coding contests, their poor performance here is, at first glance, confusing.

The challenge for AI is “how well it can stay focused on a task over a long period,” says Zhang. Importantly, this ability to plan and execute over the long term is also needed if AI is to automate cognitive work. “If you want an agent to do your job, it can’t forget what it did five minutes ago,” he adds.

Peter Whidden, an independent researcher who created a Pokémon-playing algorithm using an older type of AI, puts it this way: “The AI knows everything about Pokémon. It’s trained on a massive amount of human data. It knows what it should do, but it messes up the execution.” While “agent” has been overused in marketing, any AI system that deserves the label must bridge the gap between knowledge and execution and plan over long periods.

There are signs the gap is narrowing. Opus 4.5 is far better at leaving itself notes than earlier models, which—along with its improved ability to understand what it’s seeing—has let it progress further in the game. And after beating Pokémon Blue, the newest Gemini system (Gemini 3 Pro) has completed the more difficult Pokémon Crystal without losing a single battle—a feat its predecessor, Gemini 2.5 Pro, couldn’t pull off.

Meanwhile, Claude Code—essentially a harness that lets Claude write and run its own code and build software—has been put into another retro game, , where it’s reportedly managing a theme park successfully. All this points to a weird future where AI systems in harnesses might handle huge amounts of knowledge work—like software development, accounting, legal analysis, and graphic design—even as they struggle with anything needing real-time reactions, such as playing Call of Duty.

Another insight from these Pokémon runs is how models, trained on human data, show human-like traits. For example, in the Gemini 2.5 Pro , Google notes that when the model simulates panic—like when its Pokémon are near fainting—its reasoning ability gets worse.

And the models still act in unforeseen ways. When Gemini 3 Pro finished Pokémon Blue, it wrote to itself: “I have successfully completed the game, becoming the Pokémon League Champion and capturing Mewtwo.” Then it did something unexpected and unasked-for, which Zhang found touching. “To wrap things up poetically,” it wrote, “I’m going to head back to my house where it all began, effectively ‘retiring’ my character for now. I want to talk to Mom one last time to wrap up the playthrough.”