NeurIPS 2024 Past Safety & alignment
Red Teaming GenAI: What Can We Learn from Adversaries?
Red Teaming GenAI Workshop @ NeurIPS'24
- Submission deadline
- Sep 21, 2024, 21:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (38)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation
-
A Realistic Threat Model for Large Language Model Jailbreaks
-
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
-
Adversarial Negotiation Dynamics in Generative Language Models
-
Algorithmic Oversight for Deceptive Reasoning
-
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
-
An Adversarial Perspective on Machine Unlearning for AI Safety
-
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
-
Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features
-
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding
-
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
-
Curiosity-driven Red teaming for Large Language Models
-
Decoding Biases: An Analysis of Automated Methods and Metrics for Gender Bias Detection in Language Models
-
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries
-
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
-
Does Refusal Training in LLMs Generalize to the Past Tense?
-
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
-
Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage
-
iART - Imitation guided Automated Red Teaming
-
Infecting LLM Agents via Generalizable Adversarial Attack
-
Interactive Semantic Interventions for VLMs: Breaking VLMs with Human Ingenuity
-
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System
-
Large Language Model Detoxification: Data and Metric Solutions
-
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning
-
Lessons From Red Teaming 100 Generative AI Products
-
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
-
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"
-
MedAIScout: Automated Retrieval of Known Machine Learning Vulnerabilities in Medical Applications
-
Plentiful Jailbreaks with String Compositions
-
Rethinking LLM Memorization through the Lens of Adversarial Compression
-
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
-
Semantic Membership Inference Attack against Large Language Models
-
SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization
-
Stability Evaluation of Large Language Models via Distributional Perturbation Analysis
-
Steganography in Large Language Models: Investigating Emergence and Mitigations
-
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
-
TOFU: A Task of Fictitious Unlearning for LLMs
-
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks