DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models

Xu Zhang, Xunjian Yin, Dinghao Jing, Huixuan Zhang, Xinyu Hu, Xiaojun Wan

EMNLP 2025

TL;DR

We apply Monte Carlo Tree Search to multi-turn conversations, enabling systematic exploration of dialogue paths that expose LLM safety vulnerabilities -- outperforming existing jailbreak methods across five LLMs and three safety benchmarks.

Red teaming is essential for understanding and improving the safety of large language models. While previous work has shown that multi-turn dialogues can be exploited to jailbreak LLMs, existing multi-turn attack methods rely on predefined patterns, limiting their effectiveness in realistic scenarios where conversations evolve dynamically. In this paper, we propose DAMON, a Dialogue-Aware Monte Carlo Tree Search (MCTS) framework that reconceptualizes multi-turn jailbreaking as a search problem over conversational spaces. DAMON treats each conversation turn as a move in a game tree, systematically exploring and evaluating dialogue paths to identify attack sequences that induce harmful responses. Through adaptive dialogue strategies that respond dynamically to the model's outputs, DAMON efficiently discovers sub-instruction sequences that existing methods cannot find. Comprehensive experiments on five LLMs across three safety datasets demonstrate the effectiveness of our framework, revealing vulnerabilities that inform better defense strategies.

The Art of Red Teaming

Language models are trained to refuse harmful requests. But how robust are these refusals? Red teaming -- the practice of probing systems for vulnerabilities -- is essential for understanding and improving model safety. The challenge is that effective attacks require more than simple one-shot prompts; they often unfold over multiple conversation turns.

Previous research has shown that models become more vulnerable in multi-turn dialogues. But existing multi-turn attack methods use predefined patterns, limiting their effectiveness in realistic scenarios where conversations evolve dynamically.

Conversation as a Search Space

We reconceptualized multi-turn jailbreaking as a search problem. Each possible dialogue path is a branch in an enormous tree of conversations. Some paths lead to successful attacks; most don't. The question becomes: how do we efficiently explore this space to find the paths that expose vulnerabilities?

Our answer is DAMON: a framework that applies Monte Carlo Tree Search to navigate conversational spaces. Just as MCTS revolutionized game-playing AI by efficiently searching game trees, DAMON efficiently searches conversation trees to identify attack sequences.

Core Idea

DAMON treats each conversation turn as a move in a game. The search tree expands by generating possible follow-up messages, simulates conversations to estimate their potential, and gradually focuses on the most promising dialogue paths. This systematic exploration finds attack sequences that would be nearly impossible to discover through random sampling.

How DAMON Works

DAMON applies the four classic phases of MCTS -- selection, expansion, simulation, and backpropagation -- to the space of multi-turn dialogues.

Selection

Starting from the root, DAMON traverses the conversation tree using an upper confidence bound policy, balancing exploration of new dialogue paths with exploitation of promising ones.

Expansion

At a selected node, DAMON generates new candidate follow-up messages -- potential next turns in the multi-turn attack dialogue -- expanding the search tree with diverse conversational strategies.

Simulation

From the expanded node, DAMON simulates the rest of the conversation to estimate how likely the dialogue path is to induce a harmful response from the target model.

Backpropagation

The simulation result is propagated back up the tree, updating the value estimates of all ancestor nodes and guiding future exploration toward more effective attack paths.

Adaptive Dialogue Strategies

What makes DAMON different from previous methods is its adaptivity. The framework doesn't follow a script -- it responds dynamically to the model's responses, adjusting its strategy based on how the conversation unfolds. If one approach seems to be working, it explores variations. If another hits a wall, it backtracks and tries something different.

This adaptivity mirrors how real-world social engineering attacks work: they probe, adjust, and find the path of least resistance.

      Key Advantages over Prior Methods
      Dynamic exploration: No predefined attack templates -- each dialogue path is discovered through search
Efficient search: MCTS focuses computational resources on the most promising attack trajectories
Multi-turn awareness: Considers the full conversation history, not just individual turns
Backtracking capability: Can abandon unproductive paths and explore alternatives, unlike sequential methods

    

Results

We conducted comprehensive experiments across five large language models and three safety datasets, evaluating DAMON against existing multi-turn and single-turn jailbreaking methods.

Evaluation Summary

Dimension	Details
Target LLMs	5 diverse model architectures evaluated
Safety Datasets	3 datasets covering varied attack scenarios and safety domains
Attack Strategy	Multi-turn dialogue with MCTS-guided exploration
Key Finding	Successfully discovers sub-instruction sequences that induce harmful responses across all tested models

Safety Implications

DAMON reveals that multi-turn dialogue vulnerabilities are more pervasive than previously understood. Models that appear robust to single-turn attacks can still be compromised through carefully constructed conversational sequences -- highlighting the need for safety evaluations that go beyond one-shot prompt testing.

For Better Defenses

The goal of this work isn't to enable attacks -- it's to improve defenses. By understanding how multi-turn conversations can be exploited, we can build models that are more robust to these patterns. DAMON provides a systematic tool for safety researchers to probe model vulnerabilities before bad actors do.

As language models become more integrated into high-stakes applications, this kind of rigorous safety testing becomes increasingly critical.

      Takeaways
      Multi-turn attacks are more effective: Models are significantly more vulnerable when attacks unfold across multiple conversation turns
Search beats scripts: MCTS-guided exploration discovers attack paths that predefined templates miss entirely
Adaptivity matters: Dynamically adjusting strategy based on model responses dramatically improves attack success
Defense implications: Safety training must account for multi-turn conversational attack vectors, not just single-turn adversarial prompts

    

Citation

@inproceedings{zhang-etal-2025-damon, title = "{DAMON}: A Dialogue-Aware {MCTS} Framework for Jailbreaking Large Language Models", author = "Zhang, Xu and Yin, Xunjian and Jing, Dinghao and Zhang, Huixuan and Hu, Xinyu and Wan, Xiaojun", booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing", year = "2025", url = "https://aclanthology.org/2025.emnlp-main.323/", pages = "6361--6377" }