NeurIPS 2024PastSafety & alignment

Red Teaming GenAI: What Can We Learn from Adversaries?

Red Teaming GenAI Workshop @ NeurIPS'24

Official website ↗OpenReview venue ↗See all NeurIPS workshops →✎ Edit this entry

Submission deadline: Sep 21, 2024, 21:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (38)

Fetched from OpenReview (v2) on 2026-06-10.

A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation
Aviral Srivastava, Sourav Panda · PDF
A Realistic Threat Model for Large Language Model Jailbreaks
Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping · PDF
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
Hongfu Liu, Yuxi Xie, Ye Wang, Michael Shieh · PDF
Adversarial Negotiation Dynamics in Generative Language Models
Arinbjörn Kolbeinsson, Benedikt Kolbeinsson · PDF
Algorithmic Oversight for Deceptive Reasoning
Ege Onur Taga, Mingchen Li, Yongqi Chen, Samet Oymak · PDF
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
Aly M. Kassem, Omar Mahmoud, Niloofar Mireshghallah, Hyunwoo Kim, Yulia Tsvetkov, Yejin Choi, Sherif Saad, Santu Rana · PDF
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando · PDF
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
Ambrish Rawat, Stefan Schoepf, Giulio Zizzo, Giandomenico Cornacchia, Muhammad Zaid Hameed, Kieran Fraser, Erik Miehling, Beat Buesser, Elizabeth M. Daly, Mark Purcell, Prasanna Sattigeri, Pin-Yu Chen, Kush R. Varshney · PDF
Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features
Kaivalya Hariharan, Uzay Girit · PDF
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding
Haneul Yoo, Yongjin Yang, Hwaran Lee · PDF
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hannaneh Hajishirzi, Luke Zettlemoyer, Pang Wei Koh · PDF
Curiosity-driven Red teaming for Large Language Models
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, Pulkit Agrawal · PDF
Decoding Biases: An Analysis of Automated Methods and Metrics for Gender Bias Detection in Language Models
Shachi H. Kumar, Saurav Sahay, Sahisnu Mazumder, Eda Okur, Ramesh Manuvinakurike, Nicole Marie Beckage, Hsuan Su, Hung-yi Lee, Lama Nachman · PDF
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries
Julius Broomfield, George Ingebretsen, Reihaneh Iranmanesh, Sara Pieri, Ethan Kosak-Hine, Tom Gibbs, Reihaneh Rabbany, Kellin Pelrine · PDF
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel, Kai Yuanqing Xiao, Johannes Heidecke, Lilian Weng · PDF
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko, Nicolas Flammarion · PDF
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Zane Durante, Cristobal Eyzaguirre, Joe Benton, Brando Miranda, Henry Sleight, Tony Tong Wang, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez · PDF
Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage
Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Shagufta Mehnaz, Ye Wang · PDF
iART - Imitation guided Automated Red Teaming
Sajad Mousavi, Desik Rengarajan, Ashwin Ramesh Babu, Vineet Gundecha, Avisek Naug, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Antonio Guillen, Paolo Faraboschi, Soumyendu Sarkar · PDF
Infecting LLM Agents via Generalizable Adversarial Attack
Weichen Yu, Kai Hu, Tianyu Pang, Chao Du, Min Lin, Matt Fredrikson · PDF
Interactive Semantic Interventions for VLMs: Breaking VLMs with Human Ingenuity
Lukas Klein, Kenza Amara, Carsten T. Lüth, Hendrik Strobelt, Mennatallah El-Assady, Paul F Jaeger · PDF
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System
Julian Collado, Kevin Stangl · PDF
Large Language Model Detoxification: Data and Metric Solutions
SungJoo Byun, Hyopil Shin · PDF
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain · PDF
Lessons From Red Teaming 100 Generative AI Products
Blake Bullwinkel, Amanda J. Minnich, Shiven Chawla, Gary David Lopez Munoz, Martin Pouliot, Whitney Maxwell, Joris de Gruyter, Katherine Pratt, Saphir Qi, Nina Chikanov, Roman Lutz, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj, Rich Lundeen, Sam Vaughan, Victoria Westerhoff, Pete Bryan, Ram Shankar Siva Kumar, Yonatan Zunger, Mark Russinovich · PDF
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li, Ziwen Han, Ian Steneker, Willow E. Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, Summer Yue · PDF
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"
Som Sagar, Aditya Taparia, Ransalu Senanayake · PDF
MedAIScout: Automated Retrieval of Known Machine Learning Vulnerabilities in Medical Applications
Athish Pranav Dharmalingam, Gargi Mitra · PDF
Plentiful Jailbreaks with String Compositions
Brian R.Y. Huang · PDF
Rethinking LLM Memorization through the Lens of Adversarial Compression
Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary Chase Lipton, J Zico Kolter · PDF
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi · PDF
Semantic Membership Inference Attack against Large Language Models
Hamid Mozaffari, Virendra Marathe · PDF
SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization
Hanxi Guo, Siyuan Cheng, Guanhong Tao, Guangyu Shen, ZHUO ZHANG, Shengwei An, Kaiyuan Zhang, Xiangyu Zhang · PDF
Stability Evaluation of Large Language Models via Distributional Perturbation Analysis
Jiashuo Liu, Jiajin Li, Peng Cui, Jose Blanchet · PDF
Steganography in Large Language Models: Investigating Emergence and Mitigations
Yohan Mathew, Robert McCarthy, Ollie Matthews, Joan Velja, Nandi Schoots, Dylan Cope · PDF
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
Jonathan Nöther, Adish Singla, Goran Radanovic · PDF
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, J Zico Kolter · PDF
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch, Severin Field, Stephen Casper · PDF

Accepted papers (38)

☆A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation

☆A Realistic Threat Model for Large Language Model Jailbreaks

☆Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

☆Adversarial Negotiation Dynamics in Generative Language Models

☆Algorithmic Oversight for Deceptive Reasoning

☆Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

☆An Adversarial Perspective on Machine Unlearning for AI Safety

☆Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI

☆Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features

☆Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

☆CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation

☆Curiosity-driven Red teaming for Large Language Models

☆Decoding Biases: An Analysis of Automated Methods and Metrics for Gender Bias Detection in Language Models

☆Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

☆Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

☆Does Refusal Training in LLMs Generalize to the Past Tense?

☆Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

☆Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

☆iART - Imitation guided Automated Red Teaming

☆Infecting LLM Agents via Generalizable Adversarial Attack

☆Interactive Semantic Interventions for VLMs: Breaking VLMs with Human Ingenuity

☆Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

☆Large Language Model Detoxification: Data and Metric Solutions

☆Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

☆Lessons From Red Teaming 100 Generative AI Products

☆LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

☆LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

☆MedAIScout: Automated Retrieval of Known Machine Learning Vulnerabilities in Medical Applications

☆Plentiful Jailbreaks with String Compositions

☆Rethinking LLM Memorization through the Lens of Adversarial Compression

☆SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming

☆Semantic Membership Inference Attack against Large Language Models

☆SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization

☆Stability Evaluation of Large Language Models via Distributional Perturbation Analysis

☆Steganography in Large Language Models: Investigating Emergence and Mitigations

☆Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

☆TOFU: A Task of Fictitious Unlearning for LLMs

☆What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation

A Realistic Threat Model for Large Language Model Jailbreaks

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Adversarial Negotiation Dynamics in Generative Language Models

Algorithmic Oversight for Deceptive Reasoning

Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

An Adversarial Perspective on Machine Unlearning for AI Safety

Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI

Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation

Curiosity-driven Red teaming for Large Language Models

Decoding Biases: An Analysis of Automated Methods and Metrics for Gender Bias Detection in Language Models

Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Does Refusal Training in LLMs Generalize to the Past Tense?

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

iART - Imitation guided Automated Red Teaming

Infecting LLM Agents via Generalizable Adversarial Attack

Interactive Semantic Interventions for VLMs: Breaking VLMs with Human Ingenuity

Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

Large Language Model Detoxification: Data and Metric Solutions

Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

Lessons From Red Teaming 100 Generative AI Products

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

MedAIScout: Automated Retrieval of Known Machine Learning Vulnerabilities in Medical Applications

Plentiful Jailbreaks with String Compositions

Rethinking LLM Memorization through the Lens of Adversarial Compression

SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming

Semantic Membership Inference Attack against Large Language Models

SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization

Stability Evaluation of Large Language Models via Distributional Perturbation Analysis

Steganography in Large Language Models: Investigating Emergence and Mitigations

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

TOFU: A Task of Fictitious Unlearning for LLMs

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks