Monday, December 5, 2022, 09:33 AM
Games have long been a proving ground for new AI advancements — from Deep Blue’s victory over chess grandmaster Garry Kasparov, to AlphaGo’s mastery of Go, to Pluribus out-bluffing the best humans in poker. But truly useful, versatile agents will need to go beyond just moving pieces on a board. Can we build more effective and flexible agents that can use language to negotiate, persuade, and work with people to achieve strategic goals similar to the way humans do?Today, we’re announcing a breakthrough toward building AI that has mastered these skills. We’ve built an agent – CICERO – that is the first AI to achieve human-level performance in the popular strategy game Diplomacy*. CICERO demonstrated this by playing on webDiplomacy.net, an online version of the game, where CICERO achieved more than double the average score of the human players and ranked in the top 10 percent of participants who played more than one game.
Additional Visual SettingsHD
Enter Fullscreen Unmute
Diplomacy has been viewed for decades as a near-impossible grand challenge in AI because it requires players to master the art of understanding other people’s motivations and perspectives; make complex plans and adjust strategies; and then use natural language to reach agreements with other people, convince them to form partnerships and alliances, and more. CICERO is so effective at using natural language to negotiate with people in Diplomacy that they often favored working with CICERO over other human participants.
Unlike games like Chess and Go, Diplomacy is a game about people rather than pieces. If an agent can't recognize that someone is likely bluffing or that another player would see a certain move as aggressive, it will quickly lose the game. Likewise, if it doesn't talk like a real person -- showing empathy, building relationships, and speaking knowledgeably about the game -- it won't find other players willing to work with it.
The key to our achievement was developing new techniques at the intersection of two completely different areas of AI research: strategic reasoning, as used in agents like AlphaGo and Pluribus, and natural language processing, as used in models like GPT-3, BlenderBot 3, LaMDA, and OPT-175B. CICERO can deduce, for example, that later in the game it will need the support of one particular player, and then craft a strategy to win that person’s favor – and even recognize the risks and opportunities that that player sees from their particular point of view.
We’ve open-sourced the code and published a paper to help the wider AI community use CICERO to spur further progress in human-AI cooperation. You can also visit the CICERO website to learn more about the project and see the agent in action. Interested researchers can submit a proposal to the CICERO RFP to gain access to the data.
Under the hood: How we built CICERO
At the heart of CICERO is a controllable dialogue model for Diplomacy coupled with a strategic reasoning engine. At each point in the game, CICERO looks at the game board and its conversation history, and models how the other players are likely to act. It then uses this plan to control a language model that can generate free-form dialogue, informing other players of its plans and proposing reasonable actions for the other players that coordinate well with them.
Controllable dialogue
To build a controllable dialogue model, we started with a 2.7 billion parameter BART-like language model pre-trained on text from the internet and fine tuned on over 40,000 human games on webDiplomacy.net. We developed techniques to automatically annotate messages in the training data with corresponding planned moves in the game, so that at inference time we can control dialogue generation to discuss specific desired actions for the agent and its conversation partners. For example, if our agent is playing as France, conditioning the dialogue model on a plan involving England supporting France into Burgundy might yield a message to England like, “Hi England! Are you willing to support me into Burgundy this turn?” Controlling generation in this manner allows Cicero to ground its conversations in a set of plans that it develops and revises over time to better negotiate. This helps the agent coordinate with and persuade other players more effectively.
Step 1 Using the board state and current dialogue, Cicero makes an initial prediction of what everyone will do.
Step 2 CICERO iteratively refines that prediction using planning and then uses those predictions to form an intent for itself and its partner.
Step 3 It generates several candidate messages based on the board state, dialogue, and its intents.
Step 4 It filters the candidate message to reduce nonsense, maximize value, and ensure consistency with our intents.
We further improve dialogue quality using several filtering mechanisms – such as classifiers trained to distinguish between human and model-generated text – that ensure that our dialogue is sensible, consistent with the current game state and previous messages, and strategically sound.
Dialogue-aware strategy & planning
Past superhuman agents in adversarial games like chess, Go, and poker were created through self-play reinforcement learning (RL) – having the agents learn optimal policies by playing millions of games against other copies of itself. However, games involving cooperation require modeling what humans will actually do in real life, rather than modeling what they should do if they were perfect copies of the bot. In particular, we want CICERO to make plans that are consistent with its dialogue with other players.
The classic approach to human modeling is supervised learning, where the agent is trained with labeled data such as a database of human players’ actions in past games. However, relying purely on supervised learning to choose actions based on past dialogue results in an agent that is relatively weak and highly exploitable. For example, a player could tell the agent, "I'm glad we agreed that you will move your unit out of Paris!" Since similar messages appear in the training data only when an agreement was reached, the agent might indeed move its unit out of Paris even if doing so is a clear strategic blunder.
To fix this, CICERO runs an iterative planning algorithm that balances dialogue consistency with rationality. The agent first predicts everyone's policy for the current turn based on the dialogue it has shared with other players, and also predicts what other players think the agent's policy will be. It then runs a planning algorithm we developed called piKL, which iteratively improves these predictions by trying to choose new policies that have higher expected value given the other players' predicted policies, while also trying to keep the new predictions close to the original policy predictions. We found that piKL better models human play and leads to better policies for the agent compared to supervised learning alone.
Generating natural, purposeful dialogue
In Diplomacy, how a player talks to other people can be even more important than how they move their pieces. CICERO is able to speak clearly and persuasively when strategizing with other players. For example, in one demonstration game CICERO asked one player for immediate support on one part of the board while pressing another to consider an alliance later in the game.
In these exchanges, CICERO tries to execute its strategy by proposing moves to three different players. In the second dialog, the agent is able to tell the other player why they should cooperate and how it will be mutually beneficial. In the third, CICERO is both soliciting information and setting the groundwork for future moves.
Where there is still room for improvement
It is important to recognize that CICERO also sometimes generates inconsistent dialogue that can undermine its objectives. In the example below where CICERO was playing as Austria, the agent contradicts its first message asking Italy to move to Venice. While our suite of filters aims to detect these sorts of mistakes, it is not perfect.
Diplomacy as a sandbox for advancing human-AI interaction
The emergence of goal-oriented dialogue systems in a game that involves both cooperation and competition raises important social and technical challenges in aligning AI with human intentions and objectives. Diplomacy provides a particularly interesting environment for studying this because playing the game requires wrestling with conflicting objectives and translating those complex goals into natural language. As a simple example, a player might choose to compromise on short term gains in order to maintain an ally, on the chance that this ally will help them into an even better position on the next turn.
While we’ve made significant headway in this work, both the ability to robustly align language models with specific intentions and the technical (and normative) challenge of deciding on those intentions remain open and important problems. By open sourcing the CICERO code, we hope that AI researchers can continue to build off our work in a responsible manner. We have made early steps towards detecting and removing toxic messages in this new domain by using our dialogue model for zero-shot classification. We hope Diplomacy can serve as a safe sandbox to advance research in human-AI interaction.
Future directions
While CICERO is only capable of playing Diplomacy, the technology behind this achievement is relevant to many real world applications. Controlling natural language generation via planning and RL, could, for example, ease communication barriers between humans and AI-powered agents. For instance, today's AI assistants excel at simple question-answering tasks, like telling you the weather, but what if they could maintain a long-term conversation with the goal of teaching you a new skill? Alternatively, imagine a video game in which the non player characters (NPCs) could plan and converse like people do — understanding your motivations and adapting the conversation accordingly — to help you on your quest of storming the castle.
We’re excited about the potential for future advances in these areas and seeing how others build on our research.