Summary
When we launchedAI Diplomacyearlier this month, we were excited to share what we felt was an innovative AI benchmark, built as a game that anyone could watch and enjoy. The response from readers and the AI research community has been fantastic, and today we’re revealing the details of howAlex Duffy’s labor of love came together.—Michael ReillyWas this newsletter forwarded to you?Sign upto get it in your inbox."Could you play the game given this context?" That simple question transformed our barely workingAI demointo something 50,000 people watched live on Twitch and millions would see around the world.I builtAI Diplomacyalong with my friend and developerTyler Marquesbecause we strongly believe in the power of benchmarks to help us learn about AI and shape its development. By modifying the classic strategy game Diplomacy into a game that AIs could play against one another, we felt we could accomplish two goals at once: We’d be able to measure how models work in a complex environment, and do it in a way that can appeal to regular people (everyone loves games!).After a few weeks of toil, though, we were struggling to make it work. The large language models (LLMs) couldn’t handle the vast amount of game data we were throwing at them. ThenSam Paech, an AI researcher and one of our collaborators, posed that fateful question about context, and our mistake snapped into focus: We were thinking too much like machines. Instead of force-feeding the LLM’s game data, we needed totell them a story.The story of our team’s first-hand experience building the game AI Diplomacy is also about the broader lessons we learned on how to effectivelycommunicate knowledge about the worldto LLMs—something that is crucial to building complex agentic AI systems. The process was instructive on a number of levels:We learned (the hard way) why orchestrating knowledge is so critical.We learned how to turn a fragile demo into a robust, production-grade system (including music, visuals, and synthetically voiced narration).We’ll share several practical insights into building intuitive, approachable benchmarks.The most important of the lessons was the first one: thinking carefully about how to tell models what they need to know, as Sam put it. This is orchestrating knowledge, or engineering context.Context engineering is communicationContext engineeringis less like operating a scientific instrument and more like mastering a musical one. At its core is the deeply human skill of communicating clearly. We can now use the same skill that enables us to tell great stories around a fire to talk to an LLM and turn our dream into reality.That was my approach with building AI Diplomacy, but it came with a number of obstacles. Early on, a big one was figuring out how to represent the complex visual map of Diplomacy as text. Our first attempt was to faithfully convert tables of every territory of the map of 1901 Europe, every one of the players’ units, and every possible move into sprawling lists that grew longer each turn. The models choked. Games froze mid-move. None of the AIs could form a coherent strategy—they just took random actions, many of which violated basic game rules. It was clear we hadn’t communicated our intentions for them well at all.Then Sam asked the question that changed everything for the project: "Could you play the game given this context?” The answer was obviously no. We'd been speaking to the AI in database queries when it needed the story behind the game, in the same way we understood it. So we rebuilt everything in plain text: clear summaries of who controlled what, which territories connected where, what moves mattered right now.Then we asked ourselves, what other context do we as humans think about while playing? How we relate to other players—as allies, enemies, or neutral—was an obvious one to track trustworthiness. So we gave the models relationships they could update each turn, scoring their opponents from -2 (enemy) to 2 (ally). We also created private diaries for models to clearly define their goals and reflect on past moves, negotiations, and outcomes. From them, we had the system generate summaries calling out what happened at each turn. The AI players could read these to understand the latest moves and decide what to do next.The diary and relationships between Gemini 2.5 Flash (France) and o3 (Italy) as they plot to betray each other. All screenshots courtesy of the author.If you’re curious to look under the hood at the actual prompts and code, it's all open-source. I'd recommend browsing theDeepWikidistillation of our Github repo. It's a great way to get a decent understanding at a glance.Diagram explaining the prompt engineering system from https://deepwiki.com/Alx-AI/AI_Diplomacy/2.4-prompt-engineering-system.Another key component of the project was the interface.Making it real and accessibleIf context engineering is about clear communication between you and the model, the interface is about clear communication between the system and everyone else.The interface is so important to achieving broad appeal that in some ways itisthe product. So making sure the project was as engaging and approachable as possible was a priority.With that in mind, it felt like our launch needed to be a live stream on Twitch. I thought about how fascinated I was by the over 1 million people engaging in the social experimentTwitch plays Pokemon, or how delightful it was to watch AI try to generate a blocky, rough, knockoff version ofSeinfeld, forever. Tyler quickly transformed my clunky initial attempts into a polished, interactive 3D interface on a browser with the libraryThree.jsso anyone could follow easily(ish), with live streaming built in.The cherry on top was audio. If this was going to be a live stream, I wanted it to be something people could broadcast in the background. I'm no audio expert, but I'm passionate about sound, have played piano for 20 years and minored in music. I spent about six hours in the AI music generation tool Udio, a couple more with open-source mixing software, and ended up with a 20-minute loopable space ballad, “Diplomacy Toujours,” that I was proud of.Progress of UI development.Lessons learnedWe learned a lot of practical technical lessons building this...Become apaid subscriber to Everyto unlock this piece and learn about how:Inference speed creates practical constraints on AI systemsStructured outputs hamper model creativity and performanceStep-by-step reasoning dramatically reduces hallucinationsContext quality reveals distinct model personalitiesUpgrade to paidClick hereto read the full postWant the full text of all articles in RSS?Become a subscriber, orlearn more.Book information
Genre
Business and Economics