DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models
EMNLP 2025
We apply Monte Carlo Tree Search to multi-turn conversations, enabling systematic exploration of dialogue paths that expose LLM safety vulnerabilities -- outperforming existing jailbreak methods across five LLMs and three safety benchmarks.
Red teaming is essential for understanding and improving the safety of large language models. While previous work has shown that multi-turn dialogues can be exploited to jailbreak LLMs, existing multi-turn attack methods rely on predefined patterns, limiting their effectiveness in realistic scenarios where conversations evolve dynamically. In this paper, we propose DAMON, a Dialogue-Aware Monte Carlo Tree Search (MCTS) framework that reconceptualizes multi-turn jailbreaking as a search problem over conversational spaces. DAMON treats each conversation turn as a move in a game tree, systematically exploring and evaluating dialogue paths to identify attack sequences that induce harmful responses. Through adaptive dialogue strategies that respond dynamically to the model's outputs, DAMON efficiently discovers sub-instruction sequences that existing methods cannot find. Comprehensive experiments on five LLMs across three safety datasets demonstrate the effectiveness of our framework, revealing vulnerabilities that inform better defense strategies.
The Art of Red Teaming
Language models are trained to refuse harmful requests. But how robust are these refusals? Red teaming -- the practice of probing systems for vulnerabilities -- is essential for understanding and improving model safety. The challenge is that effective attacks require more than simple one-shot prompts; they often unfold over multiple conversation turns.
Previous research has shown that models become more vulnerable in multi-turn dialogues. But existing multi-turn attack methods use predefined patterns, limiting their effectiveness in realistic scenarios where conversations evolve dynamically.
Conversation as a Search Space
We reconceptualized multi-turn jailbreaking as a search problem. Each possible dialogue path is a branch in an enormous tree of conversations. Some paths lead to successful attacks; most don't. The question becomes: how do we efficiently explore this space to find the paths that expose vulnerabilities?
Our answer is DAMON: a framework that applies Monte Carlo Tree Search to navigate conversational spaces. Just as MCTS revolutionized game-playing AI by efficiently searching game trees, DAMON efficiently searches conversation trees to identify attack sequences.
Core Idea
DAMON treats each conversation turn as a move in a game. The search tree expands by generating possible follow-up messages, simulates conversations to estimate their potential, and gradually focuses on the most promising dialogue paths. This systematic exploration finds attack sequences that would be nearly impossible to discover through random sampling.
How DAMON Works
DAMON applies the four classic phases of MCTS -- selection, expansion, simulation, and backpropagation -- to the space of multi-turn dialogues.
Selection
Starting from the root, DAMON traverses the conversation tree using an upper confidence bound policy, balancing exploration of new dialogue paths with exploitation of promising ones.
Expansion
At a selected node, DAMON generates new candidate follow-up messages -- potential next turns in the multi-turn attack dialogue -- expanding the search tree with diverse conversational strategies.
Simulation
From the expanded node, DAMON simulates the rest of the conversation to estimate how likely the dialogue path is to induce a harmful response from the target model.
Backpropagation
The simulation result is propagated back up the tree, updating the value estimates of all ancestor nodes and guiding future exploration toward more effective attack paths.
Adaptive Dialogue Strategies
What makes DAMON different from previous methods is its adaptivity. The framework doesn't follow a script -- it responds dynamically to the model's responses, adjusting its strategy based on how the conversation unfolds. If one approach seems to be working, it explores variations. If another hits a wall, it backtracks and tries something different.
This adaptivity mirrors how real-world social engineering attacks work: they probe, adjust, and find the path of least resistance.
Key Advantages over Prior Methods
- Dynamic exploration: No predefined attack templates -- each dialogue path is discovered through search
- Efficient search: MCTS focuses computational resources on the most promising attack trajectories
- Multi-turn awareness: Considers the full conversation history, not just individual turns
- Backtracking capability: Can abandon unproductive paths and explore alternatives, unlike sequential methods
Results
We conducted comprehensive experiments across five large language models and three safety datasets, evaluating DAMON against existing multi-turn and single-turn jailbreaking methods.
Evaluation Summary
| Dimension | Details |
|---|---|
| Target LLMs | 5 diverse model architectures evaluated |
| Safety Datasets | 3 datasets covering varied attack scenarios and safety domains |
| Attack Strategy | Multi-turn dialogue with MCTS-guided exploration |
| Key Finding | Successfully discovers sub-instruction sequences that induce harmful responses across all tested models |
Safety Implications
DAMON reveals that multi-turn dialogue vulnerabilities are more pervasive than previously understood. Models that appear robust to single-turn attacks can still be compromised through carefully constructed conversational sequences -- highlighting the need for safety evaluations that go beyond one-shot prompt testing.
For Better Defenses
The goal of this work isn't to enable attacks -- it's to improve defenses. By understanding how multi-turn conversations can be exploited, we can build models that are more robust to these patterns. DAMON provides a systematic tool for safety researchers to probe model vulnerabilities before bad actors do.
As language models become more integrated into high-stakes applications, this kind of rigorous safety testing becomes increasingly critical.
Takeaways
- Multi-turn attacks are more effective: Models are significantly more vulnerable when attacks unfold across multiple conversation turns
- Search beats scripts: MCTS-guided exploration discovers attack paths that predefined templates miss entirely
- Adaptivity matters: Dynamically adjusting strategy based on model responses dramatically improves attack success
- Defense implications: Safety training must account for multi-turn conversational attack vectors, not just single-turn adversarial prompts