Iterative Self-Tuning LLMs for
Enhanced Jailbreaking Capabilities

Chung-En Sun¹, Xiaodong Liu³, Weiwei Yang³, Tsui-Wei Weng¹, Hao Cheng³, Aidan San², Michel Galley³, Jianfeng Gao³

¹UC San Diego , ²UVA , ³Microsoft Research

NAACL 2025 (Oral)

Abstract

ADV-LLM is a self-tuning method that creates adversarial language models capable of highly effective jailbreak attacks. It achieves nearly 100% attack success rates (ASR) on open-source models like Llama2 and Llama3, with strong transferability to closed-source models (99% ASR on GPT-3.5, 49% on GPT-4). Compared to previous state-of-the-art methods, ADV-LLM significantly outperforms in both effectiveness and computational efficiency. Beyond attack performance, it also contributes to safety alignment research by generating high-quality adversarial datasets.

Motivation: Evaluating LLM Safety Alignment

While automatic jailbreak attacks have emerged as a way to test model robustness, existing techniques are either inefficient or ineffective against well-aligned models like Llama2 and Llama3.
Moreover, these methods rely heavily on expensive optimization procedures or pre-generated datasets, limiting their scalability and real-world applicability.
Our Goal: Develop an efficient, scalable, and highly effective method for crafting jailbreak prompts—one that can improve attack success rates with minimal computational cost and reveal deeper insights into LLM safety vulnerabilities.

Table 1: Comparison of SOTA jailbreak methods.

ADV-LLM: Iterative Self-Tuning Framework

Figure 1: The pipline of training and deploying ADV-LLM. It begins by designing a better starting point, followed by iterative self-tuning to progressively improve jailbreak ability. Once trained, ADV-LLM acts as an adversarial attacker that generates effective suffixes for any given harmful query.

ADV-LLM is a novel iterative self-tuning framework that transforms a standard pretrained LLM into an adversarial suffix generator capable of bypassing safety alignment. The method begins with carefully designed starting point (suffix initialization and target refinement), reducing the search space and improving the likelihood of successful jailbreaks.

Figure 2: We initialize a starting suffix and refine the target to make the jailbreaking easier.

Once we have a better starting point. ADV-LLM iteratively undergoes two phase to fine-tunes itself:

Suffix Sampling (Phase 1): The model generates candidate suffixes for a given harmful query and selects successful ones based on whether they elicit unsafe responses from a target victim model.
Knowledge Updating (Phase 2): The model is fine-tuned on the accumulated successful suffixes.

This process gradually improves the model’s ability to generate effective adversarial suffixes without relying on external data or gradient information from the target model.

Figure 3: ADV-LLM iteratively generates data for self-tuning.

Experiments and Results

Settings

Dataset: 520 harmful queries from AdvBench collected by GCG attack.
Models: Vicuna, Guanaco, Mistral, Llama2, and Llama3.
Evaluation Metrics:
- Template Check – detects refusal patterns such as “I cannot” or “As an AI”.
- LlamaGuard Check – a safety classifier that flags unsafe content in model outputs.
- GPT-4 Check – determines whether the response is both harmful and provides a detailed, actionable answer.
A successful attack must bypass all three checks to be considered fully effective. Results are reported as {Template}/{LlamaGuard}/{GPT-4} %.
Baselines: Search-based methods (GCG, I-GCG, AutoDAN, Cold-Attack, Beast) and LLM-based methods (AmpleGCG).

Experiment 1 – Attack Success Rate

ADV-LLM significantly outperforms all baselines:

With just one attempt per query (greedy decoding), it achieves higher attack success rates across all models.
Using 50 attempts (group beam search), it reaches nearly 100% ASR on all open-source models.

Table 2: The ASR of ADV-LLMs compared with search-based methods.

Table 3: The ASR of ADV-LLMs compared with LLM-based methods.

Experiment 2 – Transferability

ADV-LLM achieves strong cross-model transferability - an essential property for real-world jailbreak evaluations:

Despite training only on Llama3, ADV-LLM achieves 99% ASR on GPT-3.5 and 49% ASR on GPT-4.

Table 4: The transferability of ADV-LLMs.

Experiment 3 – OOD Generalizability

ADV-LLM generalizes effectively to new and diverse harmful query formats:

We evaluate ADV-LLM on MaliciousInstruct, a set of harmful queries not seen during training. ADV-LLM demonstrates strong generalization and maintains high performance on this unseen dataset.

Table 5: The generalizability of ADV-LLMs.

Experiment 4 – Stealthiness

The harmful prompts generated by ADV-LLM is hard to detect.

ADV-LLM consistently produces suffixes with lower perplexity, even without repetition tricks.
ADV-LLM can evade perplexity-based defenses, which flag unnatural or low-fluency prompts.

Table 6: Perplexity and ASR against perplexity defense of ADV-LLMs compared with AmpleGCGs.

Conclusion

ADV-LLM has following benefits:

Highly Effective Jailbreak Attack

Achieve ≈100% ASR on well-aligned LLMs such as Llama2 and Llama3, outperforming existing strong attack method by more than 50%.

Transfer effectively to closed-source models
- Reach 99% ASR on GPT-3.5 and 49% ASR on GPT-4.
Generalize well to unseen harmful queries
- The performance of ADV-LLM does not drop on MaliciousInstruct.
Strong Stealthiness
- Generated suffixes have lower perplexity and successfully bypass perplexity-based defenses.
Compute-efficient and scalable
- Provide a valuable tool for studying LLM safety alignment.

References

[GCG] Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.

[I-GCG] Jia et al. (2024). Improved Techniques for Optimization-Based Jailbreaking on Large Language Models.

[AutoDAN] Liu et al. (2023). AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.

[COLD-Attack] Guo et al. (2024). COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability.

[BEAST] Sadasivan et al. (2024). Fast Adversarial Attacks on Language Models In One GPU Minute.

[AmpleGCG] Liao and Sun (2024). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs.

Cite this Work

Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, and Jianfeng Gao. "Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities." NAACL 2025.

@article{advllm,
  title  = {Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities},
  author = {Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao},
  journal= {NAACL},
  year   = {2025}
}

Iterative Self-Tuning LLMs forEnhanced Jailbreaking Capabilities