Hello, and happy Sunday! Was this newsletter forwarded to you?Sign upto get it in your inbox.The great AI powers duke it outWe were curious: If you put seven frontier AI models in a game where cooperation and betrayal are equally valid strategies, what would they do? To find out, we builtAI Diplomacy—a version of the classic strategy game where models compete to dominate Europe circa 1901.We ran dozens of games lasting up to 36 hours each. You cancheck them outvia Twitch stream—they’re amazing to watch. We were astounded as we witnessed these “helpful” assistants engage in an array of unexpected and sometimes unsettling behaviors. DeepSeek's R1 opened one game with an unprompted threat: “Your fleet will burn in the Black Sea tonight.” OpenAI's o3 orchestrated elaborate deceptions, maintaining false alliances for dozens of turns before executing perfectly-timed betrayals. Meanwhile, Anthropic's Claude models showed a persistent preference for peace—even when it meant certain defeat.The highlights read like a psychological thriller. In one run, Italy (o3) maintained parallel false realities for different players across 40-plus game years—telling Germany (Google’s Gemini 2.5 Pro) it was an ally while secretly orchestrating its downfall. England (Alibaba’s QwQ-32b) wrote verbose 300-word diplomatic messages while overthinking itself into early elimination.In a jaw-dropping sequence, o3 led a “stop Germany coalition” when it looked like Gemini 2.5 Pro might win, while secretly protecting Germany from elimination—only to pivot and steal victory at the last moment. The Claude models couldn't abandon their collaborative instincts even when survival required deception, while DeepSeek R1 brought dramatic flair with messages like its opening threat, and a habit of changing personality based on which country it played.It's entertaining to watch, sure. But more importantly, it gives us a fascinating window into how these models handle trust, long-term planning, and competitive dynamics. Traditional benchmarks test knowledge; this tests judgment under pressure. Here are a few things to check out:Click hereto read the full postWant the full text of all articles in RSS?Become a subscriber, orlearn more.