ICML 2024 Past Safety & alignmentGenerative models

ICML 2024 Next Generation of AI Safety Workshop

NextGenAISafety 2024

Submission deadline
May 31, 2024, 12:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (93)

Fetched from OpenReview (v2) on 2026-06-10.

  1. $\nabla \tau$: Gradient-based and Task-Agnostic Machine Unlearning

    Daniel Trippa, Cesare Campagnano, Maria Sofia Bucarelli, Gabriele Tolomei, Fabrizio Silvestri · PDF
  2. A Geometric Framework for Understanding Memorization in Generative Models

    Brendan Leigh Ross, Hamidreza Kamkari, Zhaoyan Liu, Tongzi Wu, George Stein, Gabriel Loaiza-Ganem, Jesse C. Cresswell · PDF
  3. A Sim2Real Approach for Identifying Task-Relevant Properties in Interpretable Machine Learning

    Eura Nofshin, Esther Brown, Brian Lim, Weiwei Pan, Finale Doshi-Velez · PDF
  4. A statistical framework for weak-to-strong generalization

    Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Yaacov Ritov, Mikhail Yurochkin, Yuekai Sun · PDF
  5. Accuracy on the wrong line: On the pitfalls of noisy data for OOD generalisation

    Amartya Sanyal, Yaxi Hu, Yaodong Yu, Yian Ma, Yixin Wang, Bernhard Schölkopf · PDF
  6. AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

    Heng Wang, Ruiqi Zhong, Jiaxin Wen, Jacob Steinhardt · PDF
  7. Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

    Brian R. Bartoldson, James Diffenderfer, Konstantinos Parasyris, Bhavya Kailkhura · PDF
  8. Adversarial Training with Synthesized Data: A Path to Robust and Generalizable Neural Networks

    Reza Bayat, Irina Rish · PDF
  9. AI Agents with Formal Security Guarantees

    Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, Martin Vechev · PDF
  10. AI Alignment with Changing and Influenceable Reward Functions

    Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan · PDF
  11. Alignment Calibration: Machine Unlearning for Contrastive Learning under Auditing

    Yihan Wang, Yiwei Lu, Guojun Zhang, Franziska Boenisch, Adam Dziedzic, Yaoliang Yu, Xiao-Shan Gao · PDF
  12. AssistanceZero: Scalably Solving Assistance Games

    Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan · PDF
  13. Attacking Large Language Models with Projected Gradient Descent

    Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, Stephan Günnemann · PDF
  14. Automatic Jailbreaking of the Text-to-Image Generative AI Systems

    Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang · PDF
  15. Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

    Bang An, Sicheng Zhu, Ruiyi Zhang, Michael-Andrei Panaitescu-Liess, Yuancheng Xu, Furong Huang · PDF
  16. BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

    Diego Dorn, Alexandre Variengien, Charbel-Raphael Segerie, Vincent Corruble · PDF
  17. Bias Transmission in Large Language Models: Evidence from Gender-Occupation Bias in GPT-4

    Kirsten Morehouse, Weiwei Pan, Juan Manuel Contreras, Mahzarin R. Banaji · PDF
  18. Black-Box Detection of Language Model Watermarks

    Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev · PDF
  19. Can Editing LLMs Inject Harm?

    Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu · PDF
  20. Can Go AIs be adversarially robust?

    Tom Tseng, Euan McLean, Kellin Pelrine, Tony Tong Wang, Adam Gleave · PDF
  21. Can Language Models Safeguard Themselves, Instantly and For Free?

    Dyah Adila, Changho Shin, Yijing Zhang, Frederic Sala · PDF
  22. Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?

    Michael-Andrei Panaitescu-Liess, Zora Che, Bang An, Yuancheng Xu, Pankayaraj Pathmanathan, Souradip Chakraborty, Sicheng Zhu, Tom Goldstein, Furong Huang · PDF
  23. Cascade Reward Sampling for Efficient Decoding-Time Alignment

    Bolian Li, Yifan Wang, Ananth Grama, Ruqi Zhang · PDF
  24. Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

    Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso · PDF
  25. Certifiably Robust RAG against Retrieval Corruption

    Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, Prateek Mittal · PDF
  26. Certified Robustness in NLP Under Bounded Levenshtein Distance

    Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher · PDF
  27. Chained Tuning Leads to Biased Forgetting

    Megan Ung, Alicia Yi Sun, Samuel Bell, Levent Sagun, Adina Williams · PDF
  28. Consistency Checks for Language Model Forecasters

    Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Adam Shen, Daniel Paleka · PDF
  29. ContextCite: Attributing Model Generation to Context

    Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, Aleksander Madry · PDF
  30. CoSy: Evaluating Textual Explanations of Neurons

    Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina MC Höhne, Kirill Bykov · PDF
  31. Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors

    Peter Lorenz, Mario Ruben Fernandez, Jens Müller, Ullrich Koethe · PDF
  32. Decomposed evaluations of geographic disparities in text-to-image models

    Abhishek Sureddy, Dishant Padalia, Nandhinee Periyakaruppan, Oindrila Saha, Adina Williams, Adriana Romero-Soriano, Megan Richards, Polina Kirichenko, Melissa Hall · PDF
  33. DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

    June Suk Choi, Kyungmin Lee, Jongheon Jeong, Saining Xie, Jinwoo Shin, Kimin Lee · PDF
  34. Distillation based Robustness Verification with PAC Guarantees

    Patrick Indri, Peter Blohm, Anagha Athavale, Ezio Bartocci, Georg Weissenbacher, Matteo Maffei, Dejan Nickovic, Thomas Gärtner, SAGAR MALHOTRA · PDF
  35. DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints

    Andrew Zhao, Quentin Xu, Matthieu Lin, Shenzhi Wang, Yong-jin Liu, Zilong Zheng, Gao Huang · PDF
  36. Efficient Differentially Private Fine-Tuning of Diffusion Models

    Jing Liu, Andrew Lowy, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang · PDF
  37. Eliciting Black-Box Representations from LLMs through Self-Queries

    Dylan Sam, Marc Anton Finzi · PDF
  38. Enhancing Concept-based Learning with Logic

    Deepika Vemuri, Gautham Bellamkonda, Vineeth N. Balasubramanian · PDF
  39. Enhancing the Resilience of LLMs Against Grey-box Extractions

    Hanbo Huang, Yihan Li, Bowen Jiang, Bo Jiang, Lin Liu, Zhuotao Liu, Ruoyu Sun, Shiyu Liang · PDF
  40. Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models

    Yuzhu Cai, Sheng Yin, Yuxi Wei, Chenxin Xu, Weibo Mao, Felix Juefei-Xu, Siheng Chen, Yanfeng Wang · PDF
  41. Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference

    Catherine Huang, Martin Pawelczyk, Himabindu Lakkaraju · PDF
  42. Exploiting LLM Quantization

    Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev · PDF
  43. Exploring Scaling Trends in LLM Robustness

    Nikolaus H. R. Howe, Michał Zając, Ian R. McKenzie, Oskar John Hollinsworth, Pierre-Luc Bacon, Adam Gleave · PDF
  44. Fairness Through Controlled (Un)Awareness in Node Embeddings

    Dennis Vetter, Jasper Forth, Gemma Roig, Holger Dell · PDF
  45. Fairness through partial awareness: Evaluation of the addition of demographic information for bias mitigation methods

    Chung Peng Lee, Rachel Hong, Jamie Heather Morgenstern · PDF
  46. FairPFN: Transformers Can do Counterfactual Fairness

    Jake Robertson, Noah Hollmann, Noor Awad, Frank Hutter · PDF
  47. Generated Audio Detectors are Not Robust in Real-World Conditions

    Soumya Shaw, Ben Nassi, Lea Schönherr · PDF
  48. Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion

    Hossein Souri, Arpit Bansal, Hamid Kazemi, Liam H Fowl, Aniruddha Saha, Jonas Geiping, Andrew Gordon Wilson, Rama Chellappa, Tom Goldstein, Micah Goldblum · PDF
  49. Gone With the Bits: Benchmarking Bias in Facial Phenotype Degradation Under Low-Rate Neural Compression

    Tian Qiu, Arjun Nichani, Rasta Tadayon, Haewon Jeong · PDF
  50. Hummer: Towards Limited Competitive Preference Dataset

    Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Qingpei Guo, zujie wen, JUN ZHOU, Xiaotie Deng · PDF
  51. Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

    Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin · PDF
  52. Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-based Selection

    Somrita Ghosh, Yuelin Xu, Xiao Zhang · PDF
  53. In-Context Learning, Can It Break Safety?

    Sophie Xhonneux, David Dobre, Michael Noukhovitch, Jian Tang, Gauthier Gidel, Dhanya Sridhar · PDF
  54. Is ChatGPT Transforming Academics' Writing Style?

    Mingmeng GENG, Roberto Trotta · PDF
  55. Is My Data Safe? Predicting Instance-Level Membership Inference Success for White-box and Black-box Attacks

    Tobias Leemann, Bardh Prenkaj, Gjergji Kasneci · PDF
  56. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, Eric Wong · PDF
  57. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

    Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion · PDF
  58. Large Language Models as Misleading Assistants in Conversation

    Betty Li Hou, Kejian Shi, Jason Phang, James Aung, Steven Adler, Rosie Campbell · PDF
  59. Leveraging Multi-Color Spaces as a Defense Mechanism Against Model Inversion Attack

    Sofiane Ouaari, Ali Burak Ünal, Mete Akgün, Nico Pfeifer · PDF
  60. Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

    Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy Weissman, Prateek Mittal · PDF
  61. Manipulating Feature Visualizations with Gradient Slingshots

    Dilyara Bareeva, Marina MC Höhne, Alexander Warnecke, Lukas Pirch, Klaus Robert Muller, Konrad Rieck, Kirill Bykov · PDF
  62. Marginal Fairness Sliced Wasserstein Barycenter

    Khai Nguyen, Hai Nguyen, Nhat Ho · PDF
  63. Measuring Goal-Directedness

    Matt MacDermott, James Fox, Francesco Belardinelli, Tom Everitt · PDF
  64. Medical Unlearnable Examples: Securing Medical Data from Unauthorized Training via Sparsity-Aware Local Masking

    Weixiang Sun, Yixin Liu, Zhiling Yan, Kaidi Xu, Lichao Sun · PDF
  65. Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

    Francisco Eiras, Aleksandar Petrov, Philip Torr, M. Pawan Kumar, Adel Bibi · PDF
  66. Models That Prove Their Own Correctness

    Noga Amit, Shafi Goldwasser, Orr Paradise, Guy N. Rothblum · PDF
  67. Neural Interactive Proofs

    Lewis Hammond, Sam Adam-Day · PDF
  68. On the Calibration of Conditional-Value-at-Risk

    Rajeev Verma, Volker Fischer, Eric Nalisnick · PDF
  69. On the Robustness of Neural Networks Quantization against Data Poisoning Attacks

    Yiwei Lu, Yihan Wang, Guojun Zhang, Yaoliang Yu · PDF
  70. One-Shot Safety Alignment for Large Language Models via Optimal Dualization

    Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding · PDF
  71. Open LLMs are Necessary for Private Adaptations and Outperform their Closed Alternatives

    Vincent Hanke, Tom Blanchard, Franziska Boenisch, Iyiola Emmanuel Olatunji, Michael Backes, Adam Dziedzic · PDF
  72. OxonFair: A Flexible Toolkit for Algorithmic Fairness

    Eoin D. Delaney, Zihao Fu, Sandra Wachter, Brent Mittelstadt, Chris Russell · PDF
  73. POST: A Framework for Privacy of Soft-prompt Transfer

    Xun Wang, Jing Xu, Franziska Boenisch, Michael Backes, Adam Dziedzic · PDF
  74. PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

    Blazej Manczak, Eric Lin, Eliott Zemour, Vaikkunth Mugunthan · PDF
  75. Privacy Auditing of Large Language Models

    Ashwinee Panda, Xinyu Tang, Milad Nasr, Christopher A. Choquette-Choo, Prateek Mittal · PDF
  76. Private Attribute Inference from Images with Vision-Language Models

    Batuhan Tömekçe, Mark Vero, Robin Staab, Martin Vechev · PDF
  77. ProFeAT: Projected Feature Adversarial Training for Self-Supervised Learning of Robust Representations

    Sravanti Addepalli, Priyam Dey, Venkatesh Babu Radhakrishnan · PDF
  78. Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

    Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein · PDF
  79. Robust Knowledge Unlearning via Mechanistic Localizations

    Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite · PDF
  80. Robustness Analysis of AI Models in Critical Energy Systems

    Pantelis Dogoulis, matthieu jimenez, Maxime Cordy, Salah GHAMIZI, YVES LE TRAON · PDF
  81. Rule Based Rewards for Fine-Grained LLM Safety

    Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian D Kivlichan, Molly Lin, Alex Beutel, John Schulman, Lilian Weng · PDF
  82. Safer Reinforcement Learning by Going Off-policy: a Benchmark

    Igor Kuznetsov · PDF
  83. Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

    Valeriia Cherepanova, James Zou · PDF
  84. Towards Adaptive Attacks on Constrained Tabular Machine Learning

    Thibault Simonetto, Salah GHAMIZI, Maxime Cordy · PDF
  85. Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

    Rishika Bhagwatkar, Shravan Nayak, Reza Bayat, Alexis Roger, Daniel Z Kaplan, Pouya Bashivan, Irina Rish · PDF
  86. Towards Safe Large Language Models for Medicine

    Tessa Han, Aounon Kumar, Chirag Agarwal, Himabindu Lakkaraju · PDF
  87. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S Anderson, Yaron Singer, Amin Karbasi · PDF
  88. Uncovering a Culture of AI Grassroots Experimentation by Boston City Employees: Safety Risks and Mitigation

    Jude Ha, Audrey Xing-Yun Chang · PDF
  89. Unfamiliar Finetuning Examples Control How Language Models Hallucinate

    Katie Kang, Eric Wallace, Claire Tomlin, Aviral Kumar, Sergey Levine · PDF
  90. Using Large Language Models for Humanitarian Frontline Negotiation: Opportunities and Considerations

    Zilin Ma, Susannah Cheng Su, Nathan Zhao, Linn Bieske, Blake Bullwinkel, Yanyi Zhang, Jinglun Gao, Gekai Liao, Siyao Li, Ziqing Luo, Boxiang Wang, Zihan Wen, Yanrui Yang, Claude Bruderlein, Weiwei Pan · PDF
  91. Weak-to-Strong Jailbreaking on Large Language Models

    Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang · PDF
  92. Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

    Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo · PDF
  93. WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Nouha Dziri, Yejin Choi · PDF