ICML 2024 Past Safety & alignmentGenerative models
ICML 2024 Next Generation of AI Safety Workshop
NextGenAISafety 2024
- Submission deadline
- May 31, 2024, 12:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (93)
Fetched from OpenReview (v2) on 2026-06-10.
-
$\nabla \tau$: Gradient-based and Task-Agnostic Machine Unlearning
-
A Geometric Framework for Understanding Memorization in Generative Models
-
A Sim2Real Approach for Identifying Task-Relevant Properties in Interpretable Machine Learning
-
A statistical framework for weak-to-strong generalization
-
Accuracy on the wrong line: On the pitfalls of noisy data for OOD generalisation
-
AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers
-
Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies
-
Adversarial Training with Synthesized Data: A Path to Robust and Generalizable Neural Networks
-
AI Agents with Formal Security Guarantees
-
AI Alignment with Changing and Influenceable Reward Functions
-
Alignment Calibration: Machine Unlearning for Contrastive Learning under Auditing
-
AssistanceZero: Scalably Solving Assistance Games
-
Attacking Large Language Models with Projected Gradient Descent
-
Automatic Jailbreaking of the Text-to-Image Generative AI Systems
-
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
-
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
-
Bias Transmission in Large Language Models: Evidence from Gender-Occupation Bias in GPT-4
-
Black-Box Detection of Language Model Watermarks
-
Can Editing LLMs Inject Harm?
-
Can Go AIs be adversarially robust?
-
Can Language Models Safeguard Themselves, Instantly and For Free?
-
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?
-
Cascade Reward Sampling for Efficient Decoding-Time Alignment
-
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
-
Certifiably Robust RAG against Retrieval Corruption
-
Certified Robustness in NLP Under Bounded Levenshtein Distance
-
Chained Tuning Leads to Biased Forgetting
-
Consistency Checks for Language Model Forecasters
-
ContextCite: Attributing Model Generation to Context
-
CoSy: Evaluating Textual Explanations of Neurons
-
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors
-
Decomposed evaluations of geographic disparities in text-to-image models
-
DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing
-
Distillation based Robustness Verification with PAC Guarantees
-
DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints
-
Efficient Differentially Private Fine-Tuning of Diffusion Models
-
Eliciting Black-Box Representations from LLMs through Self-Queries
-
Enhancing Concept-based Learning with Logic
-
Enhancing the Resilience of LLMs Against Grey-box Extractions
-
Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models
-
Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference
-
Exploiting LLM Quantization
-
Exploring Scaling Trends in LLM Robustness
-
Fairness Through Controlled (Un)Awareness in Node Embeddings
-
Fairness through partial awareness: Evaluation of the addition of demographic information for bias mitigation methods
-
FairPFN: Transformers Can do Counterfactual Fairness
-
Generated Audio Detectors are Not Robust in Real-World Conditions
-
Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion
-
Gone With the Bits: Benchmarking Bias in Facial Phenotype Degradation Under Low-Rate Neural Compression
-
Hummer: Towards Limited Competitive Preference Dataset
-
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
-
Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-based Selection
-
In-Context Learning, Can It Break Safety?
-
Is ChatGPT Transforming Academics' Writing Style?
-
Is My Data Safe? Predicting Instance-Level Membership Inference Success for White-box and Black-box Attacks
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
-
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
-
Large Language Models as Misleading Assistants in Conversation
-
Leveraging Multi-Color Spaces as a Defense Mechanism Against Model Inversion Attack
-
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
-
Manipulating Feature Visualizations with Gradient Slingshots
-
Marginal Fairness Sliced Wasserstein Barycenter
-
Measuring Goal-Directedness
-
Medical Unlearnable Examples: Securing Medical Data from Unauthorized Training via Sparsity-Aware Local Masking
-
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models
-
Models That Prove Their Own Correctness
-
Neural Interactive Proofs
-
On the Calibration of Conditional-Value-at-Risk
-
On the Robustness of Neural Networks Quantization against Data Poisoning Attacks
-
One-Shot Safety Alignment for Large Language Models via Optimal Dualization
-
Open LLMs are Necessary for Private Adaptations and Outperform their Closed Alternatives
-
OxonFair: A Flexible Toolkit for Algorithmic Fairness
-
POST: A Framework for Privacy of Soft-prompt Transfer
-
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
-
Privacy Auditing of Large Language Models
-
Private Attribute Inference from Images with Vision-Language Models
-
ProFeAT: Projected Feature Adversarial Training for Self-Supervised Learning of Robust Representations
-
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
-
Robust Knowledge Unlearning via Mechanistic Localizations
-
Robustness Analysis of AI Models in Critical Energy Systems
-
Rule Based Rewards for Fine-Grained LLM Safety
-
Safer Reinforcement Learning by Going Off-policy: a Benchmark
-
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
-
Towards Adaptive Attacks on Constrained Tabular Machine Learning
-
Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques
-
Towards Safe Large Language Models for Medicine
-
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
-
Uncovering a Culture of AI Grassroots Experimentation by Boston City Employees: Safety Risks and Mitigation
-
Unfamiliar Finetuning Examples Control How Language Models Hallucinate
-
Using Large Language Models for Humanitarian Frontline Negotiation: Opportunities and Considerations
-
Weak-to-Strong Jailbreaking on Large Language Models
-
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
-
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models