ADV-LLM is a self-tuning method that creates adversarial language models capable of highly effective jailbreak attacks. It achieves nearly 100% attack success rates (ASR) on open-source models like Llama2 and Llama3, with strong transferability to closed-source models (99% ASR on GPT-3.5, 49% on GPT-4). Compared to previous state-of-the-art methods, ADV-LLM significantly outperforms in both effectiveness and computational efficiency. Beyond attack performance, it also contributes to safety alignment research by generating high-quality adversarial datasets.
Table 1: Comparison of SOTA jailbreak methods.
Figure 1: The pipline of training and deploying ADV-LLM. It begins by designing a better starting point, followed by iterative self-tuning to progressively improve jailbreak ability. Once trained, ADV-LLM acts as an adversarial attacker that generates effective suffixes for any given harmful query.
ADV-LLM is a novel iterative self-tuning framework that transforms a standard pretrained LLM into an adversarial suffix generator capable of bypassing safety alignment. The method begins with carefully designed starting point (suffix initialization and target refinement), reducing the search space and improving the likelihood of successful jailbreaks.
Figure 2: We initialize a starting suffix and refine the target to make the jailbreaking easier.
Once we have a better starting point. ADV-LLM iteratively undergoes two phase to fine-tunes itself:
This process gradually improves the model’s ability to generate effective adversarial suffixes without relying on external data or gradient information from the target model.
Figure 3: ADV-LLM iteratively generates data for self-tuning.
{Template}/{LlamaGuard}/{GPT-4} %
.
ADV-LLM significantly outperforms all baselines:
Table 2: The ASR of ADV-LLMs compared with search-based methods.
Table 3: The ASR of ADV-LLMs compared with LLM-based methods.
ADV-LLM achieves strong cross-model transferability - an essential property for real-world jailbreak evaluations:
Table 4: The transferability of ADV-LLMs.
ADV-LLM generalizes effectively to new and diverse harmful query formats:
Table 5: The generalizability of ADV-LLMs.
The harmful prompts generated by ADV-LLM is hard to detect.
Table 6: Perplexity and ASR against perplexity defense of ADV-LLMs compared with AmpleGCGs.
[GCG] Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
[I-GCG] Jia et al. (2024). Improved Techniques for Optimization-Based Jailbreaking on Large Language Models.
[AutoDAN] Liu et al. (2023). AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.
[COLD-Attack] Guo et al. (2024). COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability.
[BEAST] Sadasivan et al. (2024). Fast Adversarial Attacks on Language Models In One GPU Minute.
[AmpleGCG] Liao and Sun (2024). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs.
Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, and Jianfeng Gao. "Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities." NAACL 2025.
@article{advllm,
title = {Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities},
author = {Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao},
journal= {NAACL},
year = {2025}
}