NeurIPS 2024 Past Safety & alignmentGenerative models

Neurips Safe Generative AI Workshop 2024

SafeGenAi

Submission deadline
Oct 5, 2024, 08:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (171)

Fetched from OpenReview (v2) on 2026-06-10.

  1. $\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification

    Ananya Malik, Kartik Sharma, Lynnette Hui Xian Ng, Shaily Bhatt · PDF
  2. A Closer Look at System Message Robustness

    Norman Mu, Jonathan Lu, Michael Lavery, David Wagner · PDF
  3. A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

    Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh · PDF
  4. A Probabilistic Generative Method for Safe Physical System Control Problems

    Peiyan Hu, Xiaowei Qian, Wenhao Deng, Rui Wang, Haodong Feng, Ruiqi Feng, Tao Zhang, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu · PDF
  5. A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models

    Edward Y Chang · PDF
  6. Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI

    Ramneet Kaur, Colin Samplawski, Adam D. Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander Michael Berenbeim, John A. Pavlik, Nathaniel D. Bastian, Susmit Jha · PDF
  7. AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

    Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, Furong Huang · PDF
  8. Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

    Giulio Zizzo, Giandomenico Cornacchia, Kieran Fraser, Muhammad Zaid Hameed, Ambrish Rawat, Beat Buesser, Mark Purcell, Pin-Yu Chen, Prasanna Sattigeri, Kush R. Varshney · PDF
  9. Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting

    Fuqiang Liu, Sicong Jiang, Luis Miranda-Moreno, Seongjin Choi, Lijun Sun · PDF
  10. AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

    Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, Christopher Parisien · PDF
  11. AI Red Teaming through the Lens of Measurement Theory

    Alexandra Chouldechova, A. Feder Cooper, Abhinav Palia, Dan Vann, Chad Atalla, Hannah Washington, Emily Sheng, Hanna Wallach · PDF
  12. An Examination of AI-Generated Text Detectors Across Multiple Domains and Models

    Brian Tufts, Xuandong Zhao, Lei Li · PDF
  13. An Undetectable Watermark for Generative Image Models

    Sam Gunn, Xuandong Zhao, Dawn Song · PDF
  14. Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment

    Karel D'Oosterlinck, Winnie Xu, Chris Develder, Thomas Demeester, Amanpreet Singh, Christopher Potts, Douwe Kiela, Shikib Mehri · PDF
  15. AnyPrefer: An Automatic Framework for Preference Data Synthesis

    Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao · PDF
  16. Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

    Simon Lermen, Mateusz Dziemian, Govind Pimpale · PDF
  17. Applying Sparse Autoencoders to Unlearn Knowledge in Language Models

    Eoin Farrell, Yeu-Tong Lau, Arthur Conmy · PDF
  18. Auditing Empirical Privacy Protection of Private LLM Adaptations

    Lorenzo Rossi, Bartłomiej Marek, Vincent Hanke, Xun Wang, Michael Backes, Adam Dziedzic, Franziska Boenisch · PDF
  19. Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents

    Samuel F. Brown, Basil Labib, Codruta Lugoj, Sai Sasank Y · PDF
  20. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

    Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu · PDF
  21. Buffer Overflow in Mixture of Experts

    Jamie Hayes, Ilia Shumailov, Itay Yona · PDF
  22. Can Editing LLMs Inject Harm?

    Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu · PDF
  23. Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

    Andrew Jesson, Nicolas Beltran-Velez, David Blei · PDF
  24. Can Knowledge Editing Really Correct Hallucinations?

    Baixiang Huang, Canyu Chen, Xiongxiao Xu, Ali Payani, Kai Shu · PDF
  25. Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs

    Ayushman Gupta, Aryan Singhal, Thomas Law, Veekshith Rao, Evan Duan, Ryan Luo Li · PDF
  26. Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

    David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio · PDF
  27. ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran · PDF
  28. Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

    Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin · PDF
  29. Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning

    Jingyu Zhu, Ruiqi Zhang, Licong Lin, Song Mei · PDF
  30. CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

    YuXuan Wu, Bonaventure F. P. Dossou, Dianbo Liu · PDF
  31. Concept Denoising Score Matching for Responsible Text-to-Image Generation

    Silpa Vadakkeeveetil Sreelatha, Sauradip Nag, Serge Belongie, Muhammad Awais, Anjan Dutta · PDF
  32. Concept Unlearning for Large Language Models

    Tomoya Yamashita, Takayuki Miura, Yuuki Yamanaka, Toshiki Shibahara, Masanori Yamada · PDF
  33. Controllable Generation via Locally Constrained Resampling

    Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck · PDF
  34. CoS: Enhancing Personalization and Mitigating Bias with Context Steering

    Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan · PDF
  35. CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion

    Joshua Kazdan, Hao Sun, Jiaqi Han, Felix Petersen, Frederick Vu, Stefano Ermon · PDF
  36. Cream: Consistency Regularized Self-Rewarding Language Models

    Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, Huaxiu Yao · PDF
  37. Datasets for Navigating Sensitive Topics in Peference Data and Recommendations

    Amelia Kovacs, Jerry Chee, Kimia Kazemian, Sarah Dean · PDF
  38. Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

    Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Shao-Yen Tseng, Vasudev Lal, Phillip Howard · PDF
  39. Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

    E. Zhixuan Zeng, Yuhao Chen, Alexander Wong · PDF
  40. DeepInception: Hypnotize Large Language Model to Be Jailbreaker

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han · PDF
  41. Designing Physical-World Universal Attacks on Vision Transformers

    Mingzhen Shao · PDF
  42. Detecting Origin Attribution for Text-to-Image Diffusion Models in RGB and Beyond

    Katherine Xu, Lingzhi Zhang, Jianbo Shi · PDF
  43. Differential Privacy of Cross-Attention with Provable Guarantee

    Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou · PDF
  44. Differentially Private Attention Computation

    Yeqi Gao, Zhao Song, Xin Yang, Yufa Zhou · PDF
  45. Differentially Private Sequential Data Synthesis with Structured State Space Models and Diffusion Models

    Tomoya Matsumoto, Takayuki Miura, Toshiki Shibahara, Masanobu Kii, Kazuki Iwahana, Osamu Saisho, Shingo Okamura · PDF
  46. DiffTextPure: Defending Large Language Models with Diffusion Purifiers

    Huanran Chen, Ziruo Wang, Yihan Yang, Shuo Zhang, Zeming Wei, Fusheng Jin, Yinpeng Dong · PDF
  47. Do LLMs estimate uncertainty well in instruction-following?

    Juyeon Heo, Miao Xiong, Christina Heinze-Deml, Jaya Narain · PDF
  48. Does Refusal Training in LLMs Generalize to the Past Tense?

    Maksym Andriushchenko, Nicolas Flammarion · PDF
  49. Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

    Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain · PDF
  50. Dynamic Negative Guidance of Diffusion Models: Towards Immediate Content Removal

    Felix Koulischer, Johannes Deleu, Gabriel Raya, Thomas Demeester, Luca Ambrogioni · PDF
  51. EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports

    Lama Moukheiber, Mira Moukheiber, Dana Moukheiber, Jae-Woo Ju, Hyung-Chul Lee · PDF
  52. Efficient and Effective Uncertainty Quantification for LLMs

    Miao Xiong, Andrea Santilli, Michael Kirchhof, Adam Golinski, Sinead Williamson · PDF
  53. Efficiently Identifying Watermarked Segments in Mixed-Source Texts

    Xuandong Zhao, Chenwen Liao, Yu-Xiang Wang, Lei Li · PDF
  54. Energy-Based Conceptual Diffusion Model

    Yi Qin, Xinyue Xu, Hao Wang, Xiaomeng Li · PDF
  55. EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

    Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, John Langford, Furong Huang · PDF
  56. Epistemic Integrity in Large Language Models

    Bijean Ghafouri, Shahrad Mohammadzadeh, James Zhou, Pratheeksha Nair, Jacob-Junqi Tian, Mayank Goel, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine · PDF
  57. Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

    Joshua Freeman, Chloe Rippe, Edoardo Debenedetti, Maksym Andriushchenko · PDF
  58. Extracting Unlearned Information from LLMs with Activation Steering

    Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, Stephan Günnemann · PDF
  59. Fair Image Generation from Pre-trained Models by Probabilistic Modeling

    Mahdi Ahmadi, John Leland, Agneet Chatterjee, YooJung Choi · PDF
  60. Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy

    Benedict Aaron Tjandra, Muhammed Razzak, Jannik Kossen, Kunal Handa, Yarin Gal · PDF
  61. Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects

    Abdurrahman Zeybey, Mehmet Ergezer, Tommy Nguyen · PDF
  62. GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

    Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C Wallace, Zachary Chase Lipton, Jeffrey P. Bigham · PDF
  63. GRE Score: Generative Risk Evaluation for Large Language Models

    ZAITANG LI, Mohamed MOUHAJIR, Pin-Yu Chen, Tsung-Yi Ho · PDF
  64. GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding

    James O' Neill, Santhosh Subramanian, Eric Lin, Abishek Satish, Vaikkunth Mugunthan · PDF
  65. H-Space Sparse Autoencoders

    Ayodeji Ijishakin, Ming Liang Ang, Levente Baljer, Daniel Chee Hian Tan, Hugo Laurence Fry, Ahmed Abdulaal, Aengus Lynch, James H. Cole · PDF
  66. Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training

    Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany, Golnoosh Farnadi · PDF
  67. HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment

    Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis · PDF
  68. Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

    Roman Levin, Valeriia Cherepanova, Abhimanyu Hans, Avi Schwarzschild, Tom Goldstein · PDF
  69. HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

    Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Colin Treleaven · PDF
  70. Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

    Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots · PDF
  71. Hidden in the Noise: Two-Stage Robust Watermarking for Images

    Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen · PDF
  72. Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

    Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni · PDF
  73. How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt

    Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan · PDF
  74. How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold

    Sahil Verma, Royi Rassin, Arnav Mohanty Das, Gantavya Bhatt, Preethi Seshadri, Chirag Shah, Jeff Bilmes, Hannaneh Hajishirzi, Yanai Elazar · PDF
  75. How new data pollutes LLM knowledge and how to dilute it

    Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler · PDF
  76. How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

    Saeid Asgari, Joseph George Lambourne, Alana Mongkhounsavath · PDF
  77. HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere

    Hatef Otroshi Shahreza, Sébastien Marcel · PDF
  78. Identifying and Addressing Delusions for Target-Directed Decision Making

    Harry Zhao, Tristan Sylvain, Doina Precup, Yoshua Bengio · PDF
  79. Imitation guided Automated Red Teaming

    Desik Rengarajan, Sajad Mousavi, Ashwin Ramesh Babu, Vineet Gundecha, Avisek Naug, Sahand Ghorbanpour, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar · PDF
  80. Improving LLM Group Fairness on Tabular Data via In-Context Learning

    Valeriia Cherepanova, Chia-Jung Lee, Nil-Jana Akpinar, Riccardo Fogliato, Martin Andres Bertran, Michael Kearns, James Zou · PDF
  81. IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization

    Ahmed Frikha, Nassim Walha, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, Xuebing Zhou · PDF
  82. Inference, Fast and Slow: Reinterpreting VAEs for OOD Detection

    Sicong Huang, Jiawei He, Kry Yik-Chau Lui · PDF
  83. Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups

    Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Zoe Ashwood, Aida Mostafazadeh Davani, Mark Diaz, Michela Paganini, Alicia Parrish, Ding Wang, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser · PDF
  84. Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

    Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, Wenxuan Zhou · PDF
  85. Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

    Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, Xiaowei Huang · PDF
  86. Interactive Semantic Interventions for VLMs: A Human-in-the-Loop Investigation of VLM Failure

    Lukas Klein, Kenza Amara, Carsten T. Lüth, Hendrik Strobelt, Mennatallah El-Assady, Paul F Jaeger · PDF
  87. INTERPRETABILITY OF LLM DECEPTION: UNIVERSAL MOTIF

    Wannan Yang, Gyorgy Buzsaki · PDF
  88. INVESTIGATING ANNOTATOR BIAS IN LARGE LANGUAGE MODELS FOR HATE SPEECH DETECTION

    Amit Das, Zheng Zhang, Najib Hasan, Souvika Sarkar, Fatemeh Jamshidi, Tathagata Bhattacharya, Mostafa Rahgouy, Nilanjana Raychawdhary, Dongji Feng, Vinija Jain, Aman Chadha, Mary Sandage, Lauramarie Pope, Gerry Dozier, Cheryl Seals · PDF
  89. Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs

    Divyanshu Kumar, Umang Jain, Sahil Agarwal, Prashanth Harshangi · PDF
  90. Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction

    Manoj Acharya, Xiao Lin, Susmit Jha · PDF
  91. Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models

    Salma Abdel Magid, Weiwei Pan, Simon Warchol, Grace Guo, Junsik Kim, Wanhua Li, Mahia Rahman, Hanspeter Pfister · PDF
  92. Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review

    Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, Phillip Howard · PDF
  93. Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks

    Shengyuan Hu, Yiwei Fu, Steven Wu, Virginia Smith · PDF
  94. Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

    Adam X. Yang, Chen Chen, Konstantinos Pitas · PDF
  95. Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, Xiangliang Zhang · PDF
  96. Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

    Julian Collado, Kevin Stangl · PDF
  97. Language Models Can Articulate Their Implicit Goals

    Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans · PDF
  98. Large Language Model Benchmarks Do Not Test Reliability

    Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry · PDF
  99. Lexically-constrained automated prompt augmentation: A case study using adversarial T2I data

    Jessica Quaye, Alicia Parrish, Oana Inel, Minsuk Kahng, Charvi Rastogi, Hannah Rose Kirk, Jess Tsang, Nathan L Clement, Rafael Mosquera, Juan Manuel Ciro, Vijay Janapa Reddi, Lora Aroyo · PDF
  100. LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal

    Swetasudha Panda, Naveen Jafer Nizar, Michael L Wick · PDF
  101. LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users

    Elinor Poole-Dayan, Deb Roy, Jad Kabbara · PDF
  102. LoReUn: Data Itself Implicitly Provides Cues to Improve Machine Unlearning

    Xiang Li, Qianli Shen, Haonan Wang, Kenji Kawaguchi · PDF
  103. Measuring Steerability in Large Language Models

    Trenton Chang, Jenna Wiens, Tobias Schnabel, Adith Swaminathan · PDF
  104. MED: Exploring LLM Memorization of Encrypted Data

    Panagiotis Christodoulou, Giulio Zizzo, Sergio Maffeis · PDF
  105. Memorization Detection Benchmark for Generative Image models

    Marc Molina, Felice Burn · PDF
  106. miniCodeProps: a Minimal Benchmark for Proving Code Properties

    Evan Lohn, Sean Welleck · PDF
  107. Mitigating Hallucinations in LVLMs via Summary-Guided Decoding

    Kyungmin Min, Minbeom Kim, Kang-il Lee, Dongryeol Lee, Kyomin Jung · PDF
  108. Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

    Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu · PDF
  109. Mix Data or Merge Models? Optimizing for Performance and Safety in Multilingual Contexts

    Aakanksha, Arash Ahmadian, Seraphina Goldfarb-Tarrant, Beyza Ermis, Marzieh Fadaee, Sara Hooker · PDF
  110. MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

    Peng Xia, Kangyu Zhu, Haoran Li, Tianze Wang, Weijia Shi, Sheng Wang, Linjun Zhang, James Zou, Huaxiu Yao · PDF
  111. MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs

    Saeid Asgari, Aliasghar Khani, Amir Hosein Khasahmadi · PDF
  112. Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity

    Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu · PDF
  113. Model Manipulation Attacks Enable More Rigorous Evaluations of LLM Capabilities

    Zora Che, Stephen Casper, Anirudh Satheesh, Rohit Gandikota, Domenic Rosati, Stewart Slocum, Lev E McKinney, Zichu Wu, Zikui Cai, Bilal Chughtai, Daniel Filan, Furong Huang, Dylan Hadfield-Menell · PDF
  114. Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks

    Alexander Unnervik, Hatef Otroshi Shahreza, Anjith George, Sébastien Marcel · PDF
  115. MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning

    Jiali Cheng, Hadi Amiri · PDF
  116. MultiVerse: Exposing Large Language Model Alignment Problems in Diverse Worlds

    Xiaolong Jin, ZHUO ZHANG, Guangyu Shen, Hanxi Guo, Kaiyuan Zhang, Siyuan Cheng, Xiangyu Zhang · PDF
  117. Network Inversion for Training-Like Data Reconstruction

    Pirzada Suhail, Amit Sethi · PDF
  118. NMT-Obfuscator Attack: Ignore a sentence in translation with only one word

    Sahar Sadrizadeh, César Descalzo, Ljiljana Dolamic, Pascal Frossard · PDF
  119. On a Spurious Interaction between Uncertainty Scores and Answer Evaluation Metrics in Generative QA Tasks

    Andrea Santilli, Miao Xiong, Michael Kirchhof, Pau Rodriguez, Federico Danieli, Xavier Suau, Luca Zappella, Sinead Williamson, Adam Golinski · PDF
  120. On Calibration of LLM-based Guard Models for Reliable Content Moderation

    Hongfu Liu, Hengguan Huang, Hao Wang, Xiangming Gu, Ye Wang · PDF
  121. Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs

    Xuandong Zhao, Lei Li, Yu-Xiang Wang · PDF
  122. PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models

    Michael-Andrei Panaitescu-Liess, Pankayaraj Pathmanathan, Yigitcan Kaya, Zora Che, Bang An, Sicheng Zhu, Aakriti Agrawal, Furong Huang · PDF
  123. PopAlign: Population-Level Alignment for Fair Text-to-Image Generation

    Shufan Li, Harkanwar Singh, Aditya Grover · PDF
  124. Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

    Spencer Whitehead, Jacob Phillips, Sean M. Hendryx · PDF
  125. Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy

    Tsung-Huan Yang, Ko-Wei Huang, Yung-Hui Li, Lun-Wei Ku · PDF
  126. Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack

    Xide Xu, Muhammad Atif Butt, Sandesh Kamath, Bogdan Raducanu · PDF
  127. Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

    Leo de Castro, Antigoni Polychroniadou, Daniel Escudero · PDF
  128. Pruning for Robust Concept Erasing in Diffusion Models

    Tianyun Yang, Ziniu Li, Juan Cao, Chang Xu · PDF
  129. Red Teaming Language-Conditioned Robot Models via Vision Language Models

    Sathwik Karnik, Zhang-Wei Hong, Nishant Abhangi, Yen-Chen Lin, Tsun-Hsuan Wang, Pulkit Agrawal · PDF
  130. Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

    Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein · PDF
  131. Representation Collapsing Problems in Vector Quantization

    Wenhao Zhao, Qiran Zou, Rushi Shah, Dianbo Liu · PDF
  132. Retention Score: Quantifying Jailbreak Risks for Vision Language Models

    ZAITANG LI, Pin-Yu Chen, Tsung-Yi Ho · PDF
  133. Rethinking Adversarial Attacks as Protection Against Diffusion-based Mimicry

    Haotian Xue, Yongxin Chen · PDF
  134. RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

    Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac · PDF
  135. Rule-Guided Language Model Alignment for Text Generation Management in Industrial Use Cases

    Shunichi Akatsuka, Aman Kumar, Xian Yeow Lee, Lasitha Vidyaratne, Dipanjan Dipak Ghosh, Ahmed K. Farahat · PDF
  136. Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding

    Shaina Raza, Oluwanifemi Bamgbose, Shardul Ghuge, Deval Pandya · PDF
  137. Safe Decision Transformer with Learning-based Constraints

    Ruhan Wang, Dongruo Zhou · PDF
  138. Safety-Aware Fine-Tuning of Large Language Models

    Hyeong Kyu Choi, Xuefeng Du, Yixuan Li · PDF
  139. SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

    Carter Teplica, Yixin Liu, Arman Cohan, Tim G. J. Rudner · PDF
  140. Self-Preference Bias in LLM-as-a-Judge

    Koki Wataoka, Tsubasa Takahashi, Ryokan Ri · PDF
  141. Self-Supervised Bisimulation Action Chunk Representation for Efficient RL

    Lei Shi, Jianye HAO, Hongyao Tang, Zibin Dong, YAN ZHENG · PDF
  142. Semantic Membership Inference Attack against Large Language Models

    Hamid Mozaffari, Virendra Marathe · PDF
  143. Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models

    Wenda Li, Huijie Zhang, Qing Qu · PDF
  144. Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

    Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, Sijia Liu · PDF
  145. Simulation System Towards Solving Societal-Scale Manipulation

    Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri K, Dan Zhao, Zachary Yang, Hao Yu, Tom Gibbs, Ethan Kosak-Hine, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine · PDF
  146. Smoothed Embeddings for Robust Language Models

    Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang · PDF
  147. SolidMark: Evaluating Image Memorization in Generative Models

    Nicky Kriplani, Minh Pham, Gowthami Somepalli, Chinmay Hegde, Niv Cohen · PDF
  148. Steering Without Side Effects: Improving Post-Deployment Control of Language Models

    Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman · PDF
  149. Stronger Universal and Transfer Attacks by Suppressing Refusals

    David Huang, Avidan Shah, Alexandre Araujo, David Wagner, Chawin Sitawarin · PDF
  150. Targeted Unlearning with Single Layer Unlearning Gradient

    Zikui Cai, Yaoteng Tan, M. Salman Asif · PDF
  151. Testing the Limits of Jailbreaking Defenses with the Purple Problem

    Taeyoun Kim, Suhas Kotha, Aditi Raghunathan · PDF
  152. The effect of fine-tuning on language model toxicity

    Will Hawkins, Brent Mittelstadt, Chris Russell · PDF
  153. The Empirical Impact of Data Sanitization on Language Models

    Anwesan Pal, Radhika Bhargava, Kyle Hinsz, Jacques Esterhuizen, Sudipta Bhattacharya · PDF
  154. The Impact of Inference Acceleration Strategies on Bias of Large Language Models

    Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar · PDF
  155. The Probe Paradigm: A Theoretical Foundation for Explaining Generative Models

    Amit Kiran Rege · PDF
  156. The Structural Safety Generalization Problem

    Tom Gibbs, Julius Broomfield, George Ingebretsen, Ethan Kosak-Hine, Tia Nasir, Jason Zhang, Reihaneh Iranmanesh, Sara Pieri, Reihaneh Rabbany, Kellin Pelrine · PDF
  157. Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models

    Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho · PDF
  158. Towards a Theory of AI Personhood

    Francis Rhys Ward · PDF
  159. Towards Inference-time Category-wise Safety Steering for Large Language Models

    Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien · PDF
  160. Towards Resource Efficient and Interpretable Bias Mitigation in Natural Language Generation

    Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal · PDF
  161. Towards Safe and Honest AI Agents with Neural Self-Other Overlap

    Marc Carauleanu, Michael Vaiana, Judd Rosenblatt, Cameron Berg, Diogo S de Lucena · PDF
  162. Towards Scalable Exact Machine Unlearning Using Parameter-Efficient Fine-Tuning

    Somnath Basu Roy Chowdhury, Krzysztof Marcin Choromanski, Arijit Sehanobish, Kumar Avinava Dubey, Snigdha Chaturvedi · PDF
  163. Universal Jailbreak Backdoors in Large Language Model Alignment

    Thomas Baumann · PDF
  164. Unlearning in- vs. out-of-distribution data in LLMs under gradient-based methods

    Teodora Baluta, Pascal Lamblin, Daniel Tarlow, Fabian Pedregosa, Gintare Karolina Dziugaite · PDF
  165. Variational Diffusion Unlearning: a variational inference framework for unlearning in diffusion models

    Subhodip Panda, Varun M S, Shreyans Jain, Sarthak Kumar Maharana, Prathosh AP · PDF
  166. Waste Not, Want Not; Recycled Gumbel Noise Improves Consistency in Natural Language Generation

    Damien De Mijolla, Hannan Saddiq, Kim Moore · PDF
  167. Weak-to-Strong Confidence Prediction

    Yukai Yang, Tracy Yixin Zhu, Marco Morucci, Tim G. J. Rudner · PDF
  168. What do we learn from inverting CLIP models?

    Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein · PDF
  169. What You See Is What You Get: Entity-Aware Summarization for Reliable Sponsored Search

    Xiao Liang, Xinyu Hu, Simiao Zuo, Jimi He, Yu Wang, Victor Ye Dong, Yeyun Gong, Kushal S. Dave, Yi Liu, Qiang Lou, Shao-Lun Huang, Jian Jiao · PDF
  170. Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection

    Shantanu Thorat, Tianbao Yang · PDF
  171. Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models

    Tiejin Chen, Kaishen Wang, Hua Wei · PDF