NeurIPS 2024 Past Safety & alignment

Red Teaming GenAI: What Can We Learn from Adversaries?

Red Teaming GenAI Workshop @ NeurIPS'24

Submission deadline
Sep 21, 2024, 21:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (38)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation

    Aviral Srivastava, Sourav Panda · PDF
  2. A Realistic Threat Model for Large Language Model Jailbreaks

    Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping · PDF
  3. Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

    Hongfu Liu, Yuxi Xie, Ye Wang, Michael Shieh · PDF
  4. Adversarial Negotiation Dynamics in Generative Language Models

    Arinbjörn Kolbeinsson, Benedikt Kolbeinsson · PDF
  5. Algorithmic Oversight for Deceptive Reasoning

    Ege Onur Taga, Mingchen Li, Yongqi Chen, Samet Oymak · PDF
  6. Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

    Aly M. Kassem, Omar Mahmoud, Niloofar Mireshghallah, Hyunwoo Kim, Yulia Tsvetkov, Yejin Choi, Sherif Saad, Santu Rana · PDF
  7. An Adversarial Perspective on Machine Unlearning for AI Safety

    Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando · PDF
  8. Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI

    Ambrish Rawat, Stefan Schoepf, Giulio Zizzo, Giandomenico Cornacchia, Muhammad Zaid Hameed, Kieran Fraser, Erik Miehling, Beat Buesser, Elizabeth M. Daly, Mark Purcell, Prasanna Sattigeri, Pin-Yu Chen, Kush R. Varshney · PDF
  9. Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features

    Kaivalya Hariharan, Uzay Girit · PDF
  10. Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

    Haneul Yoo, Yongjin Yang, Hwaran Lee · PDF
  11. CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation

    Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hannaneh Hajishirzi, Luke Zettlemoyer, Pang Wei Koh · PDF
  12. Curiosity-driven Red teaming for Large Language Models

    Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, Pulkit Agrawal · PDF
  13. Decoding Biases: An Analysis of Automated Methods and Metrics for Gender Bias Detection in Language Models

    Shachi H. Kumar, Saurav Sahay, Sahisnu Mazumder, Eda Okur, Ramesh Manuvinakurike, Nicole Marie Beckage, Hsuan Su, Hung-yi Lee, Lama Nachman · PDF
  14. Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

    Julius Broomfield, George Ingebretsen, Reihaneh Iranmanesh, Sara Pieri, Ethan Kosak-Hine, Tom Gibbs, Reihaneh Rabbany, Kellin Pelrine · PDF
  15. Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

    Alex Beutel, Kai Yuanqing Xiao, Johannes Heidecke, Lilian Weng · PDF
  16. Does Refusal Training in LLMs Generalize to the Past Tense?

    Maksym Andriushchenko, Nicolas Flammarion · PDF
  17. Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

    Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Zane Durante, Cristobal Eyzaguirre, Joe Benton, Brando Miranda, Henry Sleight, Tony Tong Wang, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez · PDF
  18. Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

    Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Shagufta Mehnaz, Ye Wang · PDF
  19. iART - Imitation guided Automated Red Teaming

    Sajad Mousavi, Desik Rengarajan, Ashwin Ramesh Babu, Vineet Gundecha, Avisek Naug, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Antonio Guillen, Paolo Faraboschi, Soumyendu Sarkar · PDF
  20. Infecting LLM Agents via Generalizable Adversarial Attack

    Weichen Yu, Kai Hu, Tianyu Pang, Chao Du, Min Lin, Matt Fredrikson · PDF
  21. Interactive Semantic Interventions for VLMs: Breaking VLMs with Human Ingenuity

    Lukas Klein, Kenza Amara, Carsten T. Lüth, Hendrik Strobelt, Mennatallah El-Assady, Paul F Jaeger · PDF
  22. Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

    Julian Collado, Kevin Stangl · PDF
  23. Large Language Model Detoxification: Data and Metric Solutions

    SungJoo Byun, Hyopil Shin · PDF
  24. Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

    Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain · PDF
  25. Lessons From Red Teaming 100 Generative AI Products

    Blake Bullwinkel, Amanda J. Minnich, Shiven Chawla, Gary David Lopez Munoz, Martin Pouliot, Whitney Maxwell, Joris de Gruyter, Katherine Pratt, Saphir Qi, Nina Chikanov, Roman Lutz, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj, Rich Lundeen, Sam Vaughan, Victoria Westerhoff, Pete Bryan, Ram Shankar Siva Kumar, Yonatan Zunger, Mark Russinovich · PDF
  26. LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

    Nathaniel Li, Ziwen Han, Ian Steneker, Willow E. Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, Summer Yue · PDF
  27. LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

    Som Sagar, Aditya Taparia, Ransalu Senanayake · PDF
  28. MedAIScout: Automated Retrieval of Known Machine Learning Vulnerabilities in Medical Applications

    Athish Pranav Dharmalingam, Gargi Mitra · PDF
  29. Plentiful Jailbreaks with String Compositions

    Brian R.Y. Huang · PDF
  30. Rethinking LLM Memorization through the Lens of Adversarial Compression

    Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary Chase Lipton, J Zico Kolter · PDF
  31. SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming

    Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi · PDF
  32. Semantic Membership Inference Attack against Large Language Models

    Hamid Mozaffari, Virendra Marathe · PDF
  33. SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization

    Hanxi Guo, Siyuan Cheng, Guanhong Tao, Guangyu Shen, ZHUO ZHANG, Shengwei An, Kaiyuan Zhang, Xiangyu Zhang · PDF
  34. Stability Evaluation of Large Language Models via Distributional Perturbation Analysis

    Jiashuo Liu, Jiajin Li, Peng Cui, Jose Blanchet · PDF
  35. Steganography in Large Language Models: Investigating Emergence and Mitigations

    Yohan Mathew, Robert McCarthy, Ollie Matthews, Joan Velja, Nandi Schoots, Dylan Cope · PDF
  36. Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

    Jonathan Nöther, Adish Singla, Goran Radanovic · PDF
  37. TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, J Zico Kolter · PDF
  38. What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

    Nathalie Maria Kirch, Severin Field, Stephen Casper · PDF