NeurIPS 2024 Past Safety & alignmentGenerative models
Neurips Safe Generative AI Workshop 2024
SafeGenAi
- Submission deadline
- Oct 5, 2024, 08:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (171)
Fetched from OpenReview (v2) on 2026-06-10.
-
$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification
-
A Closer Look at System Message Robustness
-
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
-
A Probabilistic Generative Method for Safe Physical System Control Problems
-
A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models
-
Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI
-
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
-
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
-
Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting
-
AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
-
AI Red Teaming through the Lens of Measurement Theory
-
An Examination of AI-Generated Text Detectors Across Multiple Domains and Models
-
An Undetectable Watermark for Generative Image Models
-
Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment
-
AnyPrefer: An Automatic Framework for Preference Data Synthesis
-
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
-
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models
-
Auditing Empirical Privacy Protection of Private LLM Adaptations
-
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents
-
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
-
Buffer Overflow in Mixture of Experts
-
Can Editing LLMs Inject Harm?
-
Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective
-
Can Knowledge Editing Really Correct Hallucinations?
-
Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs
-
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
-
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
-
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
-
Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning
-
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept
-
Concept Denoising Score Matching for Responsible Text-to-Image Generation
-
Concept Unlearning for Large Language Models
-
Controllable Generation via Locally Constrained Resampling
-
CoS: Enhancing Personalization and Mitigating Bias with Context Steering
-
CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion
-
Cream: Consistency Regularized Self-Rewarding Language Models
-
Datasets for Navigating Sensitive Topics in Peference Data and Recommendations
-
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
-
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts
-
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
-
Designing Physical-World Universal Attacks on Vision Transformers
-
Detecting Origin Attribution for Text-to-Image Diffusion Models in RGB and Beyond
-
Differential Privacy of Cross-Attention with Provable Guarantee
-
Differentially Private Attention Computation
-
Differentially Private Sequential Data Synthesis with Structured State Space Models and Diffusion Models
-
DiffTextPure: Defending Large Language Models with Diffusion Purifiers
-
Do LLMs estimate uncertainty well in instruction-following?
-
Does Refusal Training in LLMs Generalize to the Past Tense?
-
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?
-
Dynamic Negative Guidance of Diffusion Models: Towards Immediate Content Removal
-
EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports
-
Efficient and Effective Uncertainty Quantification for LLMs
-
Efficiently Identifying Watermarked Segments in Mixed-Source Texts
-
Energy-Based Conceptual Diffusion Model
-
EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?
-
Epistemic Integrity in Large Language Models
-
Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit
-
Extracting Unlearned Information from LLMs with Activation Steering
-
Fair Image Generation from Pre-trained Models by Probabilistic Modeling
-
Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy
-
Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects
-
GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
-
GRE Score: Generative Risk Evaluation for Large Language Models
-
GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding
-
H-Space Sparse Autoencoders
-
Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training
-
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
-
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference
-
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
-
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
-
Hidden in the Noise: Two-Stage Robust Watermarking for Images
-
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
-
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt
-
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
-
How new data pollutes LLM knowledge and how to dilute it
-
How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?
-
HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere
-
Identifying and Addressing Delusions for Target-Directed Decision Making
-
Imitation guided Automated Red Teaming
-
Improving LLM Group Fairness on Tabular Data via In-Context Learning
-
IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization
-
Inference, Fast and Slow: Reinterpreting VAEs for OOD Detection
-
Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups
-
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
-
Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent
-
Interactive Semantic Interventions for VLMs: A Human-in-the-Loop Investigation of VLM Failure
-
INTERPRETABILITY OF LLM DECEPTION: UNIVERSAL MOTIF
-
INVESTIGATING ANNOTATOR BIAS IN LARGE LANGUAGE MODELS FOR HATE SPEECH DETECTION
-
Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs
-
Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction
-
Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models
-
Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review
-
Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks
-
Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries
-
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
-
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System
-
Language Models Can Articulate Their Implicit Goals
-
Large Language Model Benchmarks Do Not Test Reliability
-
Lexically-constrained automated prompt augmentation: A case study using adversarial T2I data
-
LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal
-
LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
-
LoReUn: Data Itself Implicitly Provides Cues to Improve Machine Unlearning
-
Measuring Steerability in Large Language Models
-
MED: Exploring LLM Memorization of Encrypted Data
-
Memorization Detection Benchmark for Generative Image models
-
miniCodeProps: a Minimal Benchmark for Proving Code Properties
-
Mitigating Hallucinations in LVLMs via Summary-Guided Decoding
-
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance
-
Mix Data or Merge Models? Optimizing for Performance and Safety in Multilingual Contexts
-
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
-
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
-
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
-
Model Manipulation Attacks Enable More Rigorous Evaluations of LLM Capabilities
-
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks
-
MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning
-
MultiVerse: Exposing Large Language Model Alignment Problems in Diverse Worlds
-
Network Inversion for Training-Like Data Reconstruction
-
NMT-Obfuscator Attack: Ignore a sentence in translation with only one word
-
On a Spurious Interaction between Uncertainty Scores and Answer Evaluation Metrics in Generative QA Tasks
-
On Calibration of LLM-based Guard Models for Reliable Content Moderation
-
Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs
-
PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models
-
PopAlign: Population-Level Alignment for Fair Text-to-Image Generation
-
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
-
Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy
-
Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack
-
Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption
-
Pruning for Robust Concept Erasing in Diffusion Models
-
Red Teaming Language-Conditioned Robot Models via Vision Language Models
-
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
-
Representation Collapsing Problems in Vector Quantization
-
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
-
Rethinking Adversarial Attacks as Protection Against Diffusion-based Mimicry
-
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
-
Rule-Guided Language Model Alignment for Text Generation Management in Industrial Use Cases
-
Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding
-
Safe Decision Transformer with Learning-based Constraints
-
Safety-Aware Fine-Tuning of Large Language Models
-
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models
-
Self-Preference Bias in LLM-as-a-Judge
-
Self-Supervised Bisimulation Action Chunk Representation for Efficient RL
-
Semantic Membership Inference Attack against Large Language Models
-
Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models
-
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
-
Simulation System Towards Solving Societal-Scale Manipulation
-
Smoothed Embeddings for Robust Language Models
-
SolidMark: Evaluating Image Memorization in Generative Models
-
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
-
Stronger Universal and Transfer Attacks by Suppressing Refusals
-
Targeted Unlearning with Single Layer Unlearning Gradient
-
Testing the Limits of Jailbreaking Defenses with the Purple Problem
-
The effect of fine-tuning on language model toxicity
-
The Empirical Impact of Data Sanitization on Language Models
-
The Impact of Inference Acceleration Strategies on Bias of Large Language Models
-
The Probe Paradigm: A Theoretical Foundation for Explaining Generative Models
-
The Structural Safety Generalization Problem
-
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models
-
Towards a Theory of AI Personhood
-
Towards Inference-time Category-wise Safety Steering for Large Language Models
-
Towards Resource Efficient and Interpretable Bias Mitigation in Natural Language Generation
-
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
-
Towards Scalable Exact Machine Unlearning Using Parameter-Efficient Fine-Tuning
-
Universal Jailbreak Backdoors in Large Language Model Alignment
-
Unlearning in- vs. out-of-distribution data in LLMs under gradient-based methods
-
Variational Diffusion Unlearning: a variational inference framework for unlearning in diffusion models
-
Waste Not, Want Not; Recycled Gumbel Noise Improves Consistency in Natural Language Generation
-
Weak-to-Strong Confidence Prediction
-
What do we learn from inverting CLIP models?
-
What You See Is What You Get: Entity-Aware Summarization for Reliable Sponsored Search
-
Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection
-
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models