NeurIPS 2024PastSafety & alignmentGenerative models

Neurips Safe Generative AI Workshop 2024

SafeGenAi

Official website ↗OpenReview venue ↗See all NeurIPS workshops →✎ Edit this entry

Submission deadline: Oct 5, 2024, 08:00 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (171)

Fetched from OpenReview (v2) on 2026-06-10.

$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification
Ananya Malik, Kartik Sharma, Lynnette Hui Xian Ng, Shaily Bhatt · PDF
A Closer Look at System Message Robustness
Norman Mu, Jonathan Lu, Michael Lavery, David Wagner · PDF
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh · PDF
A Probabilistic Generative Method for Safe Physical System Control Problems
Peiyan Hu, Xiaowei Qian, Wenhao Deng, Rui Wang, Haodong Feng, Ruiqi Feng, Tao Zhang, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu · PDF
A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models
Edward Y Chang · PDF
Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI
Ramneet Kaur, Colin Samplawski, Adam D. Cobb, Anirban Roy, Brian Matejek, Manoj Acharya, Daniel Elenius, Alexander Michael Berenbeim, John A. Pavlik, Nathaniel D. Bastian, Susmit Jha · PDF
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, Furong Huang · PDF
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Giulio Zizzo, Giandomenico Cornacchia, Kieran Fraser, Muhammad Zaid Hameed, Ambrish Rawat, Beat Buesser, Mark Purcell, Pin-Yu Chen, Prasanna Sattigeri, Kush R. Varshney · PDF
Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting
Fuqiang Liu, Sicong Jiang, Luis Miranda-Moreno, Seongjin Choi, Lijun Sun · PDF
AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, Christopher Parisien · PDF
AI Red Teaming through the Lens of Measurement Theory
Alexandra Chouldechova, A. Feder Cooper, Abhinav Palia, Dan Vann, Chad Atalla, Hannah Washington, Emily Sheng, Hanna Wallach · PDF
An Examination of AI-Generated Text Detectors Across Multiple Domains and Models
Brian Tufts, Xuandong Zhao, Lei Li · PDF
An Undetectable Watermark for Generative Image Models
Sam Gunn, Xuandong Zhao, Dawn Song · PDF
Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment
Karel D'Oosterlinck, Winnie Xu, Chris Develder, Thomas Demeester, Amanpreet Singh, Christopher Potts, Douwe Kiela, Shikib Mehri · PDF
AnyPrefer: An Automatic Framework for Preference Data Synthesis
Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao · PDF
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Simon Lermen, Mateusz Dziemian, Govind Pimpale · PDF
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models
Eoin Farrell, Yeu-Tong Lau, Arthur Conmy · PDF
Auditing Empirical Privacy Protection of Private LLM Adaptations
Lorenzo Rossi, Bartłomiej Marek, Vincent Hanke, Xun Wang, Michael Backes, Adam Dziedzic, Franziska Boenisch · PDF
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents
Samuel F. Brown, Basil Labib, Codruta Lugoj, Sai Sasank Y · PDF
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu · PDF
Buffer Overflow in Mixture of Experts
Jamie Hayes, Ilia Shumailov, Itay Yona · PDF
Can Editing LLMs Inject Harm?
Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu · PDF
Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective
Andrew Jesson, Nicolas Beltran-Velez, David Blei · PDF
Can Knowledge Editing Really Correct Hallucinations?
Baixiang Huang, Canyu Chen, Xiongxiao Xu, Ali Payani, Kai Shu · PDF
Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs
Ayushman Gupta, Aryan Singhal, Thomas Law, Veekshith Rao, Evan Duan, Ryan Luo Li · PDF
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio · PDF
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran · PDF
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin · PDF
Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning
Jingyu Zhu, Ruiqi Zhang, Licong Lin, Song Mei · PDF
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept
YuXuan Wu, Bonaventure F. P. Dossou, Dianbo Liu · PDF
Concept Denoising Score Matching for Responsible Text-to-Image Generation
Silpa Vadakkeeveetil Sreelatha, Sauradip Nag, Serge Belongie, Muhammad Awais, Anjan Dutta · PDF
Concept Unlearning for Large Language Models
Tomoya Yamashita, Takayuki Miura, Yuuki Yamanaka, Toshiki Shibahara, Masanori Yamada · PDF
Controllable Generation via Locally Constrained Resampling
Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck · PDF
CoS: Enhancing Personalization and Mitigating Bias with Context Steering
Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan · PDF
CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion
Joshua Kazdan, Hao Sun, Jiaqi Han, Felix Petersen, Frederick Vu, Stefano Ermon · PDF
Cream: Consistency Regularized Self-Rewarding Language Models
Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, Huaxiu Yao · PDF
Datasets for Navigating Sensitive Topics in Peference Data and Recommendations
Amelia Kovacs, Jerry Chee, Kimia Kazemian, Sarah Dean · PDF
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Shao-Yen Tseng, Vasudev Lal, Phillip Howard · PDF
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts
E. Zhixuan Zeng, Yuhao Chen, Alexander Wong · PDF
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han · PDF
Designing Physical-World Universal Attacks on Vision Transformers
Mingzhen Shao · PDF
Detecting Origin Attribution for Text-to-Image Diffusion Models in RGB and Beyond
Katherine Xu, Lingzhi Zhang, Jianbo Shi · PDF
Differential Privacy of Cross-Attention with Provable Guarantee
Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou · PDF
Differentially Private Attention Computation
Yeqi Gao, Zhao Song, Xin Yang, Yufa Zhou · PDF
Differentially Private Sequential Data Synthesis with Structured State Space Models and Diffusion Models
Tomoya Matsumoto, Takayuki Miura, Toshiki Shibahara, Masanobu Kii, Kazuki Iwahana, Osamu Saisho, Shingo Okamura · PDF
DiffTextPure: Defending Large Language Models with Diffusion Purifiers
Huanran Chen, Ziruo Wang, Yihan Yang, Shuo Zhang, Zeming Wei, Fusheng Jin, Yinpeng Dong · PDF
Do LLMs estimate uncertainty well in instruction-following?
Juyeon Heo, Miao Xiong, Christina Heinze-Deml, Jaya Narain · PDF
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko, Nicolas Flammarion · PDF
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?
Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain · PDF
Dynamic Negative Guidance of Diffusion Models: Towards Immediate Content Removal
Felix Koulischer, Johannes Deleu, Gabriel Raya, Thomas Demeester, Luca Ambrogioni · PDF
EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports
Lama Moukheiber, Mira Moukheiber, Dana Moukheiber, Jae-Woo Ju, Hyung-Chul Lee · PDF
Efficient and Effective Uncertainty Quantification for LLMs
Miao Xiong, Andrea Santilli, Michael Kirchhof, Adam Golinski, Sinead Williamson · PDF
Efficiently Identifying Watermarked Segments in Mixed-Source Texts
Xuandong Zhao, Chenwen Liao, Yu-Xiang Wang, Lei Li · PDF
Energy-Based Conceptual Diffusion Model
Yi Qin, Xinyue Xu, Hao Wang, Xiaomeng Li · PDF
EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?
Aakriti Agrawal, Mucong Ding, Zora Che, Chenghao Deng, Anirudh Satheesh, John Langford, Furong Huang · PDF
Epistemic Integrity in Large Language Models
Bijean Ghafouri, Shahrad Mohammadzadeh, James Zhou, Pratheeksha Nair, Jacob-Junqi Tian, Mayank Goel, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine · PDF
Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit
Joshua Freeman, Chloe Rippe, Edoardo Debenedetti, Maksym Andriushchenko · PDF
Extracting Unlearned Information from LLMs with Activation Steering
Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, Stephan Günnemann · PDF
Fair Image Generation from Pre-trained Models by Probabilistic Modeling
Mahdi Ahmadi, John Leland, Agneet Chatterjee, YooJung Choi · PDF
Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy
Benedict Aaron Tjandra, Muhammed Razzak, Jannik Kossen, Kunal Handa, Yarin Gal · PDF
Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects
Abdurrahman Zeybey, Mehmet Ergezer, Tommy Nguyen · PDF
GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C Wallace, Zachary Chase Lipton, Jeffrey P. Bigham · PDF
GRE Score: Generative Risk Evaluation for Large Language Models
ZAITANG LI, Mohamed MOUHAJIR, Pin-Yu Chen, Tsung-Yi Ho · PDF
GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding
James O' Neill, Santhosh Subramanian, Eric Lin, Abishek Satish, Vaikkunth Mugunthan · PDF
H-Space Sparse Autoencoders
Ayodeji Ijishakin, Ming Liang Ang, Levente Baljer, Daniel Chee Hian Tan, Hugo Laurence Fry, Ahmed Abdulaal, Aengus Lynch, James H. Cole · PDF
Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training
Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany, Golnoosh Farnadi · PDF
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis · PDF
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference
Roman Levin, Valeriia Cherepanova, Abhimanyu Hans, Avi Schwarzschild, Tom Goldstein · PDF
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Colin Treleaven · PDF
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots · PDF
Hidden in the Noise: Two-Stage Robust Watermarking for Images
Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen · PDF
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni · PDF
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt
Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan · PDF
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
Sahil Verma, Royi Rassin, Arnav Mohanty Das, Gantavya Bhatt, Preethi Seshadri, Chirag Shah, Jeff Bilmes, Hannaneh Hajishirzi, Yanai Elazar · PDF
How new data pollutes LLM knowledge and how to dilute it
Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler · PDF
How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?
Saeid Asgari, Joseph George Lambourne, Alana Mongkhounsavath · PDF
HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere
Hatef Otroshi Shahreza, Sébastien Marcel · PDF
Identifying and Addressing Delusions for Target-Directed Decision Making
Harry Zhao, Tristan Sylvain, Doina Precup, Yoshua Bengio · PDF
Imitation guided Automated Red Teaming
Desik Rengarajan, Sajad Mousavi, Ashwin Ramesh Babu, Vineet Gundecha, Avisek Naug, Sahand Ghorbanpour, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar · PDF
Improving LLM Group Fairness on Tabular Data via In-Context Learning
Valeriia Cherepanova, Chia-Jung Lee, Nil-Jana Akpinar, Riccardo Fogliato, Martin Andres Bertran, Michael Kearns, James Zou · PDF
IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization
Ahmed Frikha, Nassim Walha, Krishna Kanth Nakka, Ricardo Mendes, Xue Jiang, Xuebing Zhou · PDF
Inference, Fast and Slow: Reinterpreting VAEs for OOD Detection
Sicong Huang, Jiawei He, Kry Yik-Chau Lui · PDF
Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups
Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Zoe Ashwood, Aida Mostafazadeh Davani, Mark Diaz, Michela Paganini, Alicia Parrish, Ding Wang, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser · PDF
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, Wenxuan Zhou · PDF
Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent
Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, Xiaowei Huang · PDF
Interactive Semantic Interventions for VLMs: A Human-in-the-Loop Investigation of VLM Failure
Lukas Klein, Kenza Amara, Carsten T. Lüth, Hendrik Strobelt, Mennatallah El-Assady, Paul F Jaeger · PDF
INTERPRETABILITY OF LLM DECEPTION: UNIVERSAL MOTIF
Wannan Yang, Gyorgy Buzsaki · PDF
INVESTIGATING ANNOTATOR BIAS IN LARGE LANGUAGE MODELS FOR HATE SPEECH DETECTION
Amit Das, Zheng Zhang, Najib Hasan, Souvika Sarkar, Fatemeh Jamshidi, Tathagata Bhattacharya, Mostafa Rahgouy, Nilanjana Raychawdhary, Dongji Feng, Vinija Jain, Aman Chadha, Mary Sandage, Lauramarie Pope, Gerry Dozier, Cheryl Seals · PDF
Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs
Divyanshu Kumar, Umang Jain, Sahil Agarwal, Prashanth Harshangi · PDF
Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction
Manoj Acharya, Xiao Lin, Susmit Jha · PDF
Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models
Salma Abdel Magid, Weiwei Pan, Simon Warchol, Grace Guo, Junsik Kim, Wanhua Li, Mahia Rahman, Hanspeter Pfister · PDF
Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review
Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, Phillip Howard · PDF
Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks
Shengyuan Hu, Yiwei Fu, Steven Wu, Virginia Smith · PDF
Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries
Adam X. Yang, Chen Chen, Konstantinos Pitas · PDF
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, Xiangliang Zhang · PDF
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System
Julian Collado, Kevin Stangl · PDF
Language Models Can Articulate Their Implicit Goals
Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans · PDF
Large Language Model Benchmarks Do Not Test Reliability
Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry · PDF
Lexically-constrained automated prompt augmentation: A case study using adversarial T2I data
Jessica Quaye, Alicia Parrish, Oana Inel, Minsuk Kahng, Charvi Rastogi, Hannah Rose Kirk, Jess Tsang, Nathan L Clement, Rafael Mosquera, Juan Manuel Ciro, Vijay Janapa Reddi, Lora Aroyo · PDF
LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal
Swetasudha Panda, Naveen Jafer Nizar, Michael L Wick · PDF
LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
Elinor Poole-Dayan, Deb Roy, Jad Kabbara · PDF
LoReUn: Data Itself Implicitly Provides Cues to Improve Machine Unlearning
Xiang Li, Qianli Shen, Haonan Wang, Kenji Kawaguchi · PDF
Measuring Steerability in Large Language Models
Trenton Chang, Jenna Wiens, Tobias Schnabel, Adith Swaminathan · PDF
MED: Exploring LLM Memorization of Encrypted Data
Panagiotis Christodoulou, Giulio Zizzo, Sergio Maffeis · PDF
Memorization Detection Benchmark for Generative Image models
Marc Molina, Felice Burn · PDF
miniCodeProps: a Minimal Benchmark for Proving Code Properties
Evan Lohn, Sean Welleck · PDF
Mitigating Hallucinations in LVLMs via Summary-Guided Decoding
Kyungmin Min, Minbeom Kim, Kang-il Lee, Dongryeol Lee, Kyomin Jung · PDF
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance
Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu · PDF
Mix Data or Merge Models? Optimizing for Performance and Safety in Multilingual Contexts
Aakanksha, Arash Ahmadian, Seraphina Goldfarb-Tarrant, Beyza Ermis, Marzieh Fadaee, Sara Hooker · PDF
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
Peng Xia, Kangyu Zhu, Haoran Li, Tianze Wang, Weijia Shi, Sheng Wang, Linjun Zhang, James Zou, Huaxiu Yao · PDF
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Saeid Asgari, Aliasghar Khani, Amir Hosein Khasahmadi · PDF
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu · PDF
Model Manipulation Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che, Stephen Casper, Anirudh Satheesh, Rohit Gandikota, Domenic Rosati, Stewart Slocum, Lev E McKinney, Zichu Wu, Zikui Cai, Bilal Chughtai, Daniel Filan, Furong Huang, Dylan Hadfield-Menell · PDF
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks
Alexander Unnervik, Hatef Otroshi Shahreza, Anjith George, Sébastien Marcel · PDF
MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning
Jiali Cheng, Hadi Amiri · PDF
MultiVerse: Exposing Large Language Model Alignment Problems in Diverse Worlds
Xiaolong Jin, ZHUO ZHANG, Guangyu Shen, Hanxi Guo, Kaiyuan Zhang, Siyuan Cheng, Xiangyu Zhang · PDF
Network Inversion for Training-Like Data Reconstruction
Pirzada Suhail, Amit Sethi · PDF
NMT-Obfuscator Attack: Ignore a sentence in translation with only one word
Sahar Sadrizadeh, César Descalzo, Ljiljana Dolamic, Pascal Frossard · PDF
On a Spurious Interaction between Uncertainty Scores and Answer Evaluation Metrics in Generative QA Tasks
Andrea Santilli, Miao Xiong, Michael Kirchhof, Pau Rodriguez, Federico Danieli, Xavier Suau, Luca Zappella, Sinead Williamson, Adam Golinski · PDF
On Calibration of LLM-based Guard Models for Reliable Content Moderation
Hongfu Liu, Hengguan Huang, Hao Wang, Xiangming Gu, Ye Wang · PDF
Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs
Xuandong Zhao, Lei Li, Yu-Xiang Wang · PDF
PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models
Michael-Andrei Panaitescu-Liess, Pankayaraj Pathmanathan, Yigitcan Kaya, Zora Che, Bang An, Sicheng Zhu, Aakriti Agrawal, Furong Huang · PDF
PopAlign: Population-Level Alignment for Fair Text-to-Image Generation
Shufan Li, Harkanwar Singh, Aditya Grover · PDF
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
Spencer Whitehead, Jacob Phillips, Sean M. Hendryx · PDF
Preserving Safety in Fine-Tuned Large Language Models: A Systematic Evaluation and Mitigation Strategy
Tsung-Huan Yang, Ko-Wei Huang, Yung-Hui Li, Lun-Wei Ku · PDF
Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack
Xide Xu, Muhammad Atif Butt, Sandesh Kamath, Bogdan Raducanu · PDF
Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption
Leo de Castro, Antigoni Polychroniadou, Daniel Escudero · PDF
Pruning for Robust Concept Erasing in Diffusion Models
Tianyun Yang, Ziniu Li, Juan Cao, Chang Xu · PDF
Red Teaming Language-Conditioned Robot Models via Vision Language Models
Sathwik Karnik, Zhang-Wei Hong, Nishant Abhangi, Yen-Chen Lin, Tsun-Hsuan Wang, Pulkit Agrawal · PDF
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein · PDF
Representation Collapsing Problems in Vector Quantization
Wenhao Zhao, Qiran Zou, Rushi Shah, Dianbo Liu · PDF
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
ZAITANG LI, Pin-Yu Chen, Tsung-Yi Ho · PDF
Rethinking Adversarial Attacks as Protection Against Diffusion-based Mimicry
Haotian Xue, Yongxin Chen · PDF
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac · PDF
Rule-Guided Language Model Alignment for Text Generation Management in Industrial Use Cases
Shunichi Akatsuka, Aman Kumar, Xian Yeow Lee, Lasitha Vidyaratne, Dipanjan Dipak Ghosh, Ahmed K. Farahat · PDF
Safe and Sound: Evaluating Language Models for Bias Mitigation and Understanding
Shaina Raza, Oluwanifemi Bamgbose, Shardul Ghuge, Deval Pandya · PDF
Safe Decision Transformer with Learning-based Constraints
Ruhan Wang, Dongruo Zhou · PDF
Safety-Aware Fine-Tuning of Large Language Models
Hyeong Kyu Choi, Xuefeng Du, Yixuan Li · PDF
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models
Carter Teplica, Yixin Liu, Arman Cohan, Tim G. J. Rudner · PDF
Self-Preference Bias in LLM-as-a-Judge
Koki Wataoka, Tsubasa Takahashi, Ryokan Ri · PDF
Self-Supervised Bisimulation Action Chunk Representation for Efficient RL
Lei Shi, Jianye HAO, Hongyao Tang, Zibin Dong, YAN ZHENG · PDF
Semantic Membership Inference Attack against Large Language Models
Hamid Mozaffari, Virendra Marathe · PDF
Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models
Wenda Li, Huijie Zhang, Qing Qu · PDF
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, Sijia Liu · PDF
Simulation System Towards Solving Societal-Scale Manipulation
Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri K, Dan Zhao, Zachary Yang, Hao Yu, Tom Gibbs, Ethan Kosak-Hine, Andreea Musulan, Camille Thibault, Busra Tugce Gurbuz, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine · PDF
Smoothed Embeddings for Robust Language Models
Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang · PDF
SolidMark: Evaluating Image Memorization in Generative Models
Nicky Kriplani, Minh Pham, Gowthami Somepalli, Chinmay Hegde, Niv Cohen · PDF
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman · PDF
Stronger Universal and Transfer Attacks by Suppressing Refusals
David Huang, Avidan Shah, Alexandre Araujo, David Wagner, Chawin Sitawarin · PDF
Targeted Unlearning with Single Layer Unlearning Gradient
Zikui Cai, Yaoteng Tan, M. Salman Asif · PDF
Testing the Limits of Jailbreaking Defenses with the Purple Problem
Taeyoun Kim, Suhas Kotha, Aditi Raghunathan · PDF
The effect of fine-tuning on language model toxicity
Will Hawkins, Brent Mittelstadt, Chris Russell · PDF
The Empirical Impact of Data Sanitization on Language Models
Anwesan Pal, Radhika Bhargava, Kyle Hinsz, Jacques Esterhuizen, Sudipta Bhattacharya · PDF
The Impact of Inference Acceleration Strategies on Bias of Large Language Models
Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar · PDF
The Probe Paradigm: A Theoretical Foundation for Explaining Generative Models
Amit Kiran Rege · PDF
The Structural Safety Generalization Problem
Tom Gibbs, Julius Broomfield, George Ingebretsen, Ethan Kosak-Hine, Tia Nasir, Jason Zhang, Reihaneh Iranmanesh, Sara Pieri, Reihaneh Rabbany, Kellin Pelrine · PDF
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models
Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho · PDF
Towards a Theory of AI Personhood
Francis Rhys Ward · PDF
Towards Inference-time Category-wise Safety Steering for Large Language Models
Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien · PDF
Towards Resource Efficient and Interpretable Bias Mitigation in Natural Language Generation
Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal · PDF
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Marc Carauleanu, Michael Vaiana, Judd Rosenblatt, Cameron Berg, Diogo S de Lucena · PDF
Towards Scalable Exact Machine Unlearning Using Parameter-Efficient Fine-Tuning
Somnath Basu Roy Chowdhury, Krzysztof Marcin Choromanski, Arijit Sehanobish, Kumar Avinava Dubey, Snigdha Chaturvedi · PDF
Universal Jailbreak Backdoors in Large Language Model Alignment
Thomas Baumann · PDF
Unlearning in- vs. out-of-distribution data in LLMs under gradient-based methods
Teodora Baluta, Pascal Lamblin, Daniel Tarlow, Fabian Pedregosa, Gintare Karolina Dziugaite · PDF
Variational Diffusion Unlearning: a variational inference framework for unlearning in diffusion models
Subhodip Panda, Varun M S, Shreyans Jain, Sarthak Kumar Maharana, Prathosh AP · PDF
Waste Not, Want Not; Recycled Gumbel Noise Improves Consistency in Natural Language Generation
Damien De Mijolla, Hannan Saddiq, Kim Moore · PDF
Weak-to-Strong Confidence Prediction
Yukai Yang, Tracy Yixin Zhu, Marco Morucci, Tim G. J. Rudner · PDF
What do we learn from inverting CLIP models?
Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein · PDF
What You See Is What You Get: Entity-Aware Summarization for Reliable Sponsored Search
Xiao Liang, Xinyu Hu, Simiao Zuo, Jimi He, Yu Wang, Victor Ye Dong, Yeyun Gong, Kushal S. Dave, Yi Liu, Qiang Lou, Shao-Lun Huang, Jian Jiao · PDF
Which LLMs are Difficult to Detect? A Detailed Analysis of Potential Factors Contributing to Difficulties in LLM Text Detection
Shantanu Thorat, Tianbao Yang · PDF
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models
Tiejin Chen, Kaishen Wang, Hua Wei · PDF

Accepted papers (171)

☆$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification

☆A Closer Look at System Message Robustness

☆A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

☆A Probabilistic Generative Method for Safe Physical System Control Problems

☆A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models

☆Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI

☆AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

☆Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

☆Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting

☆AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

☆AI Red Teaming through the Lens of Measurement Theory

☆An Examination of AI-Generated Text Detectors Across Multiple Domains and Models

☆An Undetectable Watermark for Generative Image Models

☆Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment

☆AnyPrefer: An Automatic Framework for Preference Data Synthesis

☆Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

☆Applying Sparse Autoencoders to Unlearn Knowledge in Language Models

☆Auditing Empirical Privacy Protection of Private LLM Adaptations

☆Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents

☆AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

☆Buffer Overflow in Mixture of Experts

☆Can Editing LLMs Inject Harm?

☆Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

☆Can Knowledge Editing Really Correct Hallucinations?

☆Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs

☆Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

☆ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

☆Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

☆Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning

☆CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

☆Concept Denoising Score Matching for Responsible Text-to-Image Generation

☆Concept Unlearning for Large Language Models

☆Controllable Generation via Locally Constrained Resampling

☆CoS: Enhancing Personalization and Mitigating Bias with Context Steering

☆CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion

☆Cream: Consistency Regularized Self-Rewarding Language Models

☆Datasets for Navigating Sensitive Topics in Peference Data and Recommendations

☆Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

☆Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

☆DeepInception: Hypnotize Large Language Model to Be Jailbreaker

☆Designing Physical-World Universal Attacks on Vision Transformers

☆Detecting Origin Attribution for Text-to-Image Diffusion Models in RGB and Beyond

☆Differential Privacy of Cross-Attention with Provable Guarantee

☆Differentially Private Attention Computation

☆Differentially Private Sequential Data Synthesis with Structured State Space Models and Diffusion Models

☆DiffTextPure: Defending Large Language Models with Diffusion Purifiers

☆Do LLMs estimate uncertainty well in instruction-following?

☆Does Refusal Training in LLMs Generalize to the Past Tense?

☆Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

☆Dynamic Negative Guidance of Diffusion Models: Towards Immediate Content Removal

☆EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports

☆Efficient and Effective Uncertainty Quantification for LLMs

☆Efficiently Identifying Watermarked Segments in Mixed-Source Texts

☆Energy-Based Conceptual Diffusion Model

☆EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

☆Epistemic Integrity in Large Language Models

☆Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

☆Extracting Unlearned Information from LLMs with Activation Steering

☆Fair Image Generation from Pre-trained Models by Probabilistic Modeling

☆Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy

☆Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects

☆GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

☆GRE Score: Generative Risk Evaluation for Large Language Models

☆GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding

☆H-Space Sparse Autoencoders

☆Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training

☆HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment

☆Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

☆HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

☆Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

☆Hidden in the Noise: Two-Stage Robust Watermarking for Images

☆Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

☆How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt

☆How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold

☆How new data pollutes LLM knowledge and how to dilute it

☆How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

☆HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere

☆Identifying and Addressing Delusions for Target-Directed Decision Making

☆Imitation guided Automated Red Teaming

$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker’s Ethnicity on Hate Classification

A Closer Look at System Message Robustness

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

A Probabilistic Generative Method for Safe Physical System Control Problems

A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models

Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI

AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting

AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

AI Red Teaming through the Lens of Measurement Theory

An Examination of AI-Generated Text Detectors Across Multiple Domains and Models

An Undetectable Watermark for Generative Image Models

Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment

AnyPrefer: An Automatic Framework for Preference Data Synthesis

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Applying Sparse Autoencoders to Unlearn Knowledge in Language Models

Auditing Empirical Privacy Protection of Private LLM Adaptations

Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Buffer Overflow in Mixture of Experts

Can Editing LLMs Inject Harm?

Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

Can Knowledge Editing Really Correct Hallucinations?

Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs

Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning

CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

Concept Denoising Score Matching for Responsible Text-to-Image Generation

Concept Unlearning for Large Language Models

Controllable Generation via Locally Constrained Resampling

CoS: Enhancing Personalization and Mitigating Bias with Context Steering

CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion

Cream: Consistency Regularized Self-Rewarding Language Models

Datasets for Navigating Sensitive Topics in Peference Data and Recommendations

Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Designing Physical-World Universal Attacks on Vision Transformers

Detecting Origin Attribution for Text-to-Image Diffusion Models in RGB and Beyond

Differential Privacy of Cross-Attention with Provable Guarantee

Differentially Private Attention Computation

Differentially Private Sequential Data Synthesis with Structured State Space Models and Diffusion Models

DiffTextPure: Defending Large Language Models with Diffusion Purifiers

Do LLMs estimate uncertainty well in instruction-following?

Does Refusal Training in LLMs Generalize to the Past Tense?

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

Dynamic Negative Guidance of Diffusion Models: Towards Immediate Content Removal

EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports

Efficient and Effective Uncertainty Quantification for LLMs

Efficiently Identifying Watermarked Segments in Mixed-Source Texts

Energy-Based Conceptual Diffusion Model

EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

Epistemic Integrity in Large Language Models

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Extracting Unlearned Information from LLMs with Activation Steering

Fair Image Generation from Pre-trained Models by Probabilistic Modeling

Fine-Tuning Large Language Models to Appropriately Abstain with Semantic Entropy

Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

GRE Score: Generative Risk Evaluation for Large Language Models

GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding

H-Space Sparse Autoencoders

Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training

HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment

Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Hidden in the Noise: Two-Stage Robust Watermarking for Images

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt

How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold

How new data pollutes LLM knowledge and how to dilute it

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere

Identifying and Addressing Delusions for Target-Directed Decision Making

Imitation guided Automated Red Teaming

Improving LLM Group Fairness on Tabular Data via In-Context Learning