Johnny
Persuade LLMs to Jailbreak each other
This project explores the systematic persuasion of LLMs to jailbreak them. It introduces 40 persuasion techniques and achieves a 92% attack success rate on aligned LLMs.
The study also reveals that advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs), and adaptive defenses against these PAPs provide effective protection against other attacks.