Researchers at the University of California San Diego have turned to tabletop roleplaying games to test the limits of Large Language Models, finding that while they can follow rules, they often prioritise dramatic flair over tactical survival.
If you have ever played in a Dungeons & Dragons group where the Paladin insists on delivering a rousing speech while the party rogue is actively bleeding out, you may find the latest research from the University of California San Diego strangely validating. It turns out that when Artificial Intelligence tries to play D&D, it develops the exact same habit.

In a new paper titled “Setting the DC: Tool-Grounded D&D Simulations to Test LLM Agents,” researchers argue that the complex, open-ended nature of TTRPGs makes them the ideal testing ground for AI. Unlike chess or Go, D&D requires long-term planning, team coordination, and strict adherence to a ruleset that, let’s be honest, even veteran human players argue about.
The Dungeon Crawl Simulation
The team created “D&D Agents,” a simulator that pits AI against AI in a closed-loop combat environment. In these scenarios, Large Language Models take on every role: the Dungeon Master, the players, and the monsters.
To keep things fair and to stop the AI from simply inventing convenient outcomes, the system used a “tool-grounded” approach. This means the AI couldn’t just narrate “I hit the goblin”; it had to use a code-based tool to roll the virtual dice and check the game state.
The researchers tested three major models across 27 different combat scenarios, ranging from a goblin ambush to a cave battle. The results might surprise those following the “AI Wars” closely. Claude 3.5 Haiku took the crown, proving the most reliable at following instructions and using game tools correctly. GPT-4o followed closely behind, while DeepSeek-V3 trailed significantly, struggling more often with the game’s logic.
Roleplaying over Roll-playing
Perhaps the most “Geek Native” finding of the study was that the models often struggled to separate “narrative flavour” from “tactical reality.”
The paper notes several instances of “quirky behaviour” where the models attempted to imbue the simulation with personality, even when it wasn’t tactically sound. Goblins would stop mid-fight to taunt players with lines like “Heh! Shiny man’s gonna bleed!” Warlocks became overly dramatic in mundane situations.
Most amusingly, the researchers observed Paladins delivering heroic speeches “for no reason” as they stepped into the line of fire. It seems that in training these models to be helpful and conversational, we have accidentally created the ultimate “Roleplay > Min-Max” gamer.
However, the study also highlighted the limitations of current tech. The models still hallucinated. In one example, a model correctly checked an enemy’s hit points, saw they were at 0 (dead), and then decided to attack them anyway.
Why this matters
While this is an academic paper rather than a commercial product announcement, it has implications for the future of Virtual Tabletops. Companies are already racing to integrate AI Game Masters and NPCs into their platforms.
This research suggests that while AI can handle the “maths” of Pathfinder or D&D, it still struggles with the “common sense” required to run a coherent long-term campaign without getting distracted by its own dramatic narration.
For now, if you want a tactical challenge that doesn’t involve a robot monologuing at you, you are still better off gathering a group of human friends, grabbing the core rulebooks from Barnes & Noble, or downloading a community-written module from the DMsGuild. And if you really want to see a tactical battle, you can always build it yourself with the new Dungeons & Dragons Lego sets.
The full paper, “Setting the DC,” is available on OpenReview (PDF Link).
Photo by 2H Media on Unsplash.