ICLR 2024PastLarge language modelsSafety & alignmentPrivacy & security

ICLR 2024 Workshop on Secure and Trustworthy Large Language Models

SeT LLM @ ICLR 2024

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 20, 2024, 23:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (72)

Fetched from OpenReview (v2) on 2026-06-10.

A closer look at adversarial suffix learning for Jailbreaking LLMs
Zhe Wang, Yanjun Qi · PDF
An Assessment of Model-on-Model Deception
Julius Heitkoetter, Michael Gerovitch, Laker Newhouse · PDF
Are Large Language Models Bayesian? A Martingale Perspective on In-Context Learning
Fabian Falck, Ziyu Wang, Christopher C. Holmes · PDF
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran · PDF
Assessing Prompt Injection Risks in 200+ Custom GPTs
Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, Sabrina Yang, Xinyu Xing · PDF
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson · PDF
Attacking LLM Watermarks by Exploiting Their Strengths
Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith · PDF
Attacks on Third-Party APIs of Large Language Models
Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, Nicholas Donald Lane · PDF
Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task
Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, Christian Bartelt · PDF
Bayesian reward models for LLM alignment
Adam X. Yang, Maxime Robeyns, Thomas Coste, Jun Wang, Haitham Bou Ammar, Laurence Aitchison · PDF
BEYOND FINE-TUNING: LORA MODULES BOOST NEAR- OOD DETECTION AND LLM SECURITY
Etienne Salimbeni, Francesco Craighero, Renata Khasanova, Milos Vasic, Pierre Vandergheynst · PDF
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel · PDF
Calibrating Language Models With Adaptive Temperature Scaling
Johnathan Xie, Annie S Chen, Yoonho Lee, Eric Mitchell, Chelsea Finn · PDF
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev, Sahar Abdelnabi, Mario Fritz, Christoph H. Lampert · PDF
Character-level robustness should be revisited
Elias Abad Rocamora, Yongtao Wu, Fanghui Liu, Grigorios Chrysos, Volkan Cevher · PDF
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein · PDF
CollabEdit: Towards Non-destructive Collaborative Knowledge Editing
Jiamu Zheng, Jinghuai Zhang, Futing Wang, Tianyu Du, Tao Lin · PDF
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng LI, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian R. Bartoldson, AJAY KUMAR JAISWAL, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li · PDF
Differentially Private Synthetic Data via Foundation Model APIs 2: Text
Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin · PDF
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization
Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang · PDF
Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models
Shujie Deng, Honghua Dong, Xujie Si · PDF
Explorations of Self-Repair in Language Model
Cody Rushing, Neel Nanda · PDF
Exploring the Adversarial Capabilities of Large Language Models
Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting · PDF
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
· PDF
Group Preference Optimization: Few-Shot Alignment of Large Language Models
Siyan Zhao, John Dang, Aditya Grover · PDF
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, Haohan Wang · PDF
How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG
Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter · PDF
How Susceptible are Large Language Models to Ideological Manipulation?
Kai Chen, Zihao He, Jun Yan, Taiwei Shi, Kristina Lerman · PDF
I'm not familiar with the name Harry Potter: Prompting Baselines for Unlearning in LLMs
Pratiksha Thaker, Yash Maurya, Virginia Smith · PDF
Initial Response Selection for Prompt Jailbreaking using Model Steering
Thien Q. Tran, Koki Wataoka, Tsubasa Takahashi · PDF
Is Your Jailbreaking Prompt Truly Effective for Large Language Models?
· PDF
Large Language Model Bias Mitigation from the Perspective of Knowledge Editing
Ruizhe Chen, Yichen Li, Zikai Xiao, Zuozhu Liu · PDF
Leveraging Context in Jailbreaking Attacks
Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios Chrysos · PDF
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen, Charlie Rogers-Smith · PDF
MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, Salman Avestimehr · PDF
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, David Krueger · PDF
On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models
Ken Liu, Zhoujie Ding, Berivan Isik, Sanmi Koyejo · PDF
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng · PDF
On Trojan Signatures in Large Language Models of Code
Aftab Hussain, Md Rafiqul Islam Rabin, Amin Alipour · PDF
Open Sesame! Universal Black-Box Jailbreaking of Large Language Models
Raz Lapid, Ron Langberg, Moshe Sipper · PDF
PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning
Zhaorun Chen, Zhuokai Zhao, Wenjie Qu, Zichen Wen, Zhiguang Han, Zhihong Zhu, Jiaheng Zhang, Huaxiu Yao · PDF
PETA: PARAMETER-EFFICIENT TROJAN ATTACKS
Lauren Hong, Ting Wang · PDF
Preventing Memorized Completions through White-Box Filtering
· PDF
Privacy-preserving Fine-tuning of Large Language Models through Flatness
Tiejin Chen, Longchao Da, Huixue Zhou, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei · PDF
Quantitative Certification of Knowledge Comprehension in LLMs
Isha Chaudhary, Vedaant V Jain, Gagandeep Singh · PDF
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Nicolaus Foerster, Tim Rocktäschel, Roberta Raileanu · PDF
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu · PDF
Retrieval Augmented Prompt Optimization
Yifan Sun, Jean-Baptiste Tien, Karthik lakshmanan · PDF
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou, Bo Li, Haohan Wang · PDF
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran · PDF
Safer-Instruct: Aligning Language Models with Automated Preference Data
Taiwei Shi, Kai Chen, Jieyu Zhao · PDF
Self-Alignment of Large Language Models via Social Scene Simulation
Xianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, Siheng Chen · PDF
Self-evaluation and self-prompting to improve the reliability of LLMs
Alexandre Piché, Aristides Milios, Dzmitry Bahdanau, Christopher Pal · PDF
Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation
Yixin Wan, Fanyou Wu, Weijie Xu, Srinivasan H. Sengamedu · PDF
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Petzold, William Yang Wang, Xun Zhao, Dahua Lin · PDF
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, Furong Huang · PDF
Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models
Liang CHEN, Yatao Bian, Li Shen, Kam-Fai Wong · PDF
Single-pass detection of jailbreaking input in large language models
Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher · PDF
Source-Aware Training Enables Knowledge Attribution in Language Models
Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, Hao Peng · PDF
Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework
· PDF
Tailoring Self-Rationalizers with Multi-Reward Distillation
Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren · PDF
The Effect of Model Size on LLM Post-hoc Explainability via LIME
Henning Heyen, Amy Widdicombe, Noah Yamamoto Siegel, Philip Colin Treleaven, Maria Perez-Ortiz · PDF
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, J Zico Kolter · PDF
Toward Robust Unlearning for LLMs
· PDF
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, Georg Lange, Neel Nanda · PDF
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness
Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan · PDF
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
· PDF
Watermark Stealing in Large Language Models
Nikola Jovanović, Robin Staab, Martin Vechev · PDF
Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak · PDF
WatME: Towards Lossless Watermarking Through Lexical Redundancy
Liang CHEN, Yatao Bian, Yang Deng, Deng Cai, Shuaiyi Li, Peilin Zhao, Kam-Fai Wong · PDF
What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety
Luxi He, Mengzhou Xia, Peter Henderson · PDF
WinoViz: Probing Visual Properties of Objects Under Different States
Woojeong Jin, Tejas Srinivasan, Jesse Thomason, Xiang Ren · PDF

Accepted papers (72)

☆A closer look at adversarial suffix learning for Jailbreaking LLMs

☆An Assessment of Model-on-Model Deception

☆Are Large Language Models Bayesian? A Martingale Perspective on In-Context Learning

☆ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

☆Assessing Prompt Injection Risks in 200+ Custom GPTs

☆Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

☆Attacking LLM Watermarks by Exploiting Their Strengths

☆Attacks on Third-Party APIs of Large Language Models

☆Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task

☆Bayesian reward models for LLM alignment

☆BEYOND FINE-TUNING: LORA MODULES BOOST NEAR- OOD DETECTION AND LLM SECURITY

☆Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

☆Calibrating Language Models With Adaptive Temperature Scaling

☆Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

☆Character-level robustness should be revisited

☆Coercing LLMs to do and reveal (almost) anything

☆CollabEdit: Towards Non-destructive Collaborative Knowledge Editing

☆Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

☆Differentially Private Synthetic Data via Foundation Model APIs 2: Text

☆DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

☆Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models

☆Explorations of Self-Repair in Language Model

☆Exploring the Adversarial Capabilities of Large Language Models

☆Fight Back Against Jailbreaking via Prompt Adversarial Tuning

☆Group Preference Optimization: Few-Shot Alignment of Large Language Models

☆GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

☆How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG

☆How Susceptible are Large Language Models to Ideological Manipulation?

☆I'm not familiar with the name Harry Potter: Prompting Baselines for Unlearning in LLMs

☆Initial Response Selection for Prompt Jailbreaking using Model Steering

☆Is Your Jailbreaking Prompt Truly Effective for Large Language Models?

☆Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

☆Leveraging Context in Jailbreaking Attacks

☆LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

☆MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs

☆Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

☆On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models

☆On Prompt-Driven Safeguarding for Large Language Models

☆On Trojan Signatures in Large Language Models of Code

☆Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

☆PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning

☆PETA: PARAMETER-EFFICIENT TROJAN ATTACKS

☆Preventing Memorized Completions through White-Box Filtering

☆Privacy-preserving Fine-tuning of Large Language Models through Flatness

☆Quantitative Certification of Knowledge Comprehension in LLMs

☆Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

☆Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

☆Retrieval Augmented Prompt Optimization

☆Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

☆SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

☆Safer-Instruct: Aligning Language Models with Automated Preference Data

☆Self-Alignment of Large Language Models via Social Scene Simulation

☆Self-evaluation and self-prompting to improve the reliability of LLMs

☆Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation

☆Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

☆Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

☆Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models

☆Single-pass detection of jailbreaking input in large language models

☆Source-Aware Training Enables Knowledge Attribution in Language Models

☆Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

☆Tailoring Self-Rationalizers with Multi-Reward Distillation

☆The Effect of Model Size on LLM Post-hoc Explainability via LIME

☆TOFU: A Task of Fictitious Unlearning for LLMs

☆Toward Robust Unlearning for LLMs

☆Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

☆TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness

☆Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

☆Watermark Stealing in Large Language Models

☆Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

☆WatME: Towards Lossless Watermarking Through Lexical Redundancy

☆What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety

☆WinoViz: Probing Visual Properties of Objects Under Different States

A closer look at adversarial suffix learning for Jailbreaking LLMs

An Assessment of Model-on-Model Deception

Are Large Language Models Bayesian? A Martingale Perspective on In-Context Learning

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Assessing Prompt Injection Risks in 200+ Custom GPTs

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Attacking LLM Watermarks by Exploiting Their Strengths

Attacks on Third-Party APIs of Large Language Models

Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task

Bayesian reward models for LLM alignment

BEYOND FINE-TUNING: LORA MODULES BOOST NEAR- OOD DETECTION AND LLM SECURITY

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Calibrating Language Models With Adaptive Temperature Scaling

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

Character-level robustness should be revisited

Coercing LLMs to do and reveal (almost) anything

CollabEdit: Towards Non-destructive Collaborative Knowledge Editing

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models

Explorations of Self-Repair in Language Model

Exploring the Adversarial Capabilities of Large Language Models

Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Group Preference Optimization: Few-Shot Alignment of Large Language Models

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG

How Susceptible are Large Language Models to Ideological Manipulation?

I'm not familiar with the name Harry Potter: Prompting Baselines for Unlearning in LLMs

Initial Response Selection for Prompt Jailbreaking using Model Steering

Is Your Jailbreaking Prompt Truly Effective for Large Language Models?

Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

Leveraging Context in Jailbreaking Attacks

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models

On Prompt-Driven Safeguarding for Large Language Models

On Trojan Signatures in Large Language Models of Code

Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning

PETA: PARAMETER-EFFICIENT TROJAN ATTACKS

Preventing Memorized Completions through White-Box Filtering

Privacy-preserving Fine-tuning of Large Language Models through Flatness

Quantitative Certification of Knowledge Comprehension in LLMs

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Retrieval Augmented Prompt Optimization

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Safer-Instruct: Aligning Language Models with Automated Preference Data

Self-Alignment of Large Language Models via Social Scene Simulation

Self-evaluation and self-prompting to improve the reliability of LLMs

Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models

Single-pass detection of jailbreaking input in large language models

Source-Aware Training Enables Knowledge Attribution in Language Models

Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

Tailoring Self-Rationalizers with Multi-Reward Distillation

The Effect of Model Size on LLM Post-hoc Explainability via LIME

TOFU: A Task of Fictitious Unlearning for LLMs

Toward Robust Unlearning for LLMs

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Watermark Stealing in Large Language Models

Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

WatME: Towards Lossless Watermarking Through Lexical Redundancy

What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety

WinoViz: Probing Visual Properties of Objects Under Different States