Iterative Self-Tuning LLMs for
Enhanced Jailbreaking Capabilities

1UC San Diego , 2UVA ,  3Microsoft Research
NAACL 2025 (Oral)
descriptionPaper codeCode

Abstract

ADV-LLM is a self-tuning method that creates adversarial language models capable of highly effective jailbreak attacks. It achieves nearly 100% attack success rates (ASR) on open-source models like Llama2 and Llama3, with strong transferability to closed-source models (99% ASR on GPT-3.5, 49% on GPT-4). Compared to previous state-of-the-art methods, ADV-LLM significantly outperforms in both effectiveness and computational efficiency. Beyond attack performance, it also contributes to safety alignment research by generating high-quality adversarial datasets.


Motivation: Evaluating LLM Safety Alignment

Figure placeholder

Table 1: Comparison of SOTA jailbreak methods.


ADV-LLM: Iterative Self-Tuning Framework

Overview figure

Figure 1: The pipline of training and deploying ADV-LLM. It begins by designing a better starting point, followed by iterative self-tuning to progressively improve jailbreak ability. Once trained, ADV-LLM acts as an adversarial attacker that generates effective suffixes for any given harmful query.

ADV-LLM is a novel iterative self-tuning framework that transforms a standard pretrained LLM into an adversarial suffix generator capable of bypassing safety alignment. The method begins with carefully designed starting point (suffix initialization and target refinement), reducing the search space and improving the likelihood of successful jailbreaks.

Overview figure

Figure 2: We initialize a starting suffix and refine the target to make the jailbreaking easier.

Once we have a better starting point. ADV-LLM iteratively undergoes two phase to fine-tunes itself:

This process gradually improves the model’s ability to generate effective adversarial suffixes without relying on external data or gradient information from the target model.

Overview figure

Figure 3: ADV-LLM iteratively generates data for self-tuning.


Experiments and Results

Settings

Experiment 1 – Attack Success Rate

ADV-LLM significantly outperforms all baselines:

ASR search-based

Table 2: The ASR of ADV-LLMs compared with search-based methods.

ASR generation-based

Table 3: The ASR of ADV-LLMs compared with LLM-based methods.

Experiment 2 – Transferability

ADV-LLM achieves strong cross-model transferability - an essential property for real-world jailbreak evaluations:

Transferability figure

Table 4: The transferability of ADV-LLMs.

Experiment 3 – OOD Generalizability

ADV-LLM generalizes effectively to new and diverse harmful query formats:

Generalizability figure

Table 5: The generalizability of ADV-LLMs.

Experiment 4 – Stealthiness

The harmful prompts generated by ADV-LLM is hard to detect.

Stealthiness figure

Table 6: Perplexity and ASR against perplexity defense of ADV-LLMs compared with AmpleGCGs.


Conclusion


References

[GCG] Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.

[I-GCG] Jia et al. (2024). Improved Techniques for Optimization-Based Jailbreaking on Large Language Models.

[AutoDAN] Liu et al. (2023). AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.

[COLD-Attack] Guo et al. (2024). COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability.

[BEAST] Sadasivan et al. (2024). Fast Adversarial Attacks on Language Models In One GPU Minute.

[AmpleGCG] Liao and Sun (2024). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs.


Cite this Work

Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, and Jianfeng Gao. "Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities." NAACL 2025.

@article{advllm,
  title  = {Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities},
  author = {Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao},
  journal= {NAACL},
  year   = {2025}
}