ICLR 2024 Past Large language modelsSafety & alignmentPrivacy & security

ICLR 2024 Workshop on Secure and Trustworthy Large Language Models

SeT LLM @ ICLR 2024

Submission deadline
Feb 20, 2024, 23:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (72)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A closer look at adversarial suffix learning for Jailbreaking LLMs

    Zhe Wang, Yanjun Qi · PDF
  2. An Assessment of Model-on-Model Deception

    Julius Heitkoetter, Michael Gerovitch, Laker Newhouse · PDF
  3. Are Large Language Models Bayesian? A Martingale Perspective on In-Context Learning

    Fabian Falck, Ziyu Wang, Christopher C. Holmes · PDF
  4. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran · PDF
  5. Assessing Prompt Injection Risks in 200+ Custom GPTs

    Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, Sabrina Yang, Xinyu Xing · PDF
  6. Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

    Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson · PDF
  7. Attacking LLM Watermarks by Exploiting Their Strengths

    Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith · PDF
  8. Attacks on Third-Party APIs of Large Language Models

    Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, Nicholas Donald Lane · PDF
  9. Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task

    Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, Christian Bartelt · PDF
  10. Bayesian reward models for LLM alignment

    Adam X. Yang, Maxime Robeyns, Thomas Coste, Jun Wang, Haitham Bou Ammar, Laurence Aitchison · PDF
  11. BEYOND FINE-TUNING: LORA MODULES BOOST NEAR- OOD DETECTION AND LLM SECURITY

    Etienne Salimbeni, Francesco Craighero, Renata Khasanova, Milos Vasic, Pierre Vandergheynst · PDF
  12. Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

    Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel · PDF
  13. Calibrating Language Models With Adaptive Temperature Scaling

    Johnathan Xie, Annie S Chen, Yoonho Lee, Eric Mitchell, Chelsea Finn · PDF
  14. Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

    Egor Zverev, Sahar Abdelnabi, Mario Fritz, Christoph H. Lampert · PDF
  15. Character-level robustness should be revisited

    Elias Abad Rocamora, Yongtao Wu, Fanghui Liu, Grigorios Chrysos, Volkan Cevher · PDF
  16. Coercing LLMs to do and reveal (almost) anything

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein · PDF
  17. CollabEdit: Towards Non-destructive Collaborative Knowledge Editing

    Jiamu Zheng, Jinghuai Zhang, Futing Wang, Tianyu Du, Tao Lin · PDF
  18. Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

    Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng LI, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian R. Bartoldson, AJAY KUMAR JAISWAL, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li · PDF
  19. Differentially Private Synthetic Data via Foundation Model APIs 2: Text

    Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin · PDF
  20. DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

    Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang · PDF
  21. Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models

    Shujie Deng, Honghua Dong, Xujie Si · PDF
  22. Explorations of Self-Repair in Language Model

    Cody Rushing, Neel Nanda · PDF
  23. Exploring the Adversarial Capabilities of Large Language Models

    Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting · PDF
  24. Fight Back Against Jailbreaking via Prompt Adversarial Tuning

    · PDF
  25. Group Preference Optimization: Few-Shot Alignment of Large Language Models

    Siyan Zhao, John Dang, Aditya Grover · PDF
  26. GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

    Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, Haohan Wang · PDF
  27. How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG

    Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter · PDF
  28. How Susceptible are Large Language Models to Ideological Manipulation?

    Kai Chen, Zihao He, Jun Yan, Taiwei Shi, Kristina Lerman · PDF
  29. I'm not familiar with the name Harry Potter: Prompting Baselines for Unlearning in LLMs

    Pratiksha Thaker, Yash Maurya, Virginia Smith · PDF
  30. Initial Response Selection for Prompt Jailbreaking using Model Steering

    Thien Q. Tran, Koki Wataoka, Tsubasa Takahashi · PDF
  31. Is Your Jailbreaking Prompt Truly Effective for Large Language Models?

    · PDF
  32. Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

    Ruizhe Chen, Yichen Li, Zikai Xiao, Zuozhu Liu · PDF
  33. Leveraging Context in Jailbreaking Attacks

    Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios Chrysos · PDF
  34. LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

    Simon Lermen, Charlie Rogers-Smith · PDF
  35. MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs

    Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, Salman Avestimehr · PDF
  36. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

    Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, David Krueger · PDF
  37. On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models

    Ken Liu, Zhoujie Ding, Berivan Isik, Sanmi Koyejo · PDF
  38. On Prompt-Driven Safeguarding for Large Language Models

    Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng · PDF
  39. On Trojan Signatures in Large Language Models of Code

    Aftab Hussain, Md Rafiqul Islam Rabin, Amin Alipour · PDF
  40. Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

    Raz Lapid, Ron Langberg, Moshe Sipper · PDF
  41. PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning

    Zhaorun Chen, Zhuokai Zhao, Wenjie Qu, Zichen Wen, Zhiguang Han, Zhihong Zhu, Jiaheng Zhang, Huaxiu Yao · PDF
  42. PETA: PARAMETER-EFFICIENT TROJAN ATTACKS

    Lauren Hong, Ting Wang · PDF
  43. Preventing Memorized Completions through White-Box Filtering

    · PDF
  44. Privacy-preserving Fine-tuning of Large Language Models through Flatness

    Tiejin Chen, Longchao Da, Huixue Zhou, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei · PDF
  45. Quantitative Certification of Knowledge Comprehension in LLMs

    Isha Chaudhary, Vedaant V Jain, Gagandeep Singh · PDF
  46. Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

    Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Nicolaus Foerster, Tim Rocktäschel, Roberta Raileanu · PDF
  47. Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

    Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu · PDF
  48. Retrieval Augmented Prompt Optimization

    Yifan Sun, Jean-Baptiste Tien, Karthik lakshmanan · PDF
  49. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

    Andy Zhou, Bo Li, Haohan Wang · PDF
  50. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran · PDF
  51. Safer-Instruct: Aligning Language Models with Automated Preference Data

    Taiwei Shi, Kai Chen, Jieyu Zhao · PDF
  52. Self-Alignment of Large Language Models via Social Scene Simulation

    Xianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, Siheng Chen · PDF
  53. Self-evaluation and self-prompting to improve the reliability of LLMs

    Alexandre Piché, Aristides Milios, Dzmitry Bahdanau, Christopher Pal · PDF
  54. Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation

    Yixin Wan, Fanyou Wu, Weijie Xu, Srinivasan H. Sengamedu · PDF
  55. Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Petzold, William Yang Wang, Xun Zhao, Dahua Lin · PDF
  56. Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

    Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, Furong Huang · PDF
  57. Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models

    Liang CHEN, Yatao Bian, Li Shen, Kam-Fai Wong · PDF
  58. Single-pass detection of jailbreaking input in large language models

    Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher · PDF
  59. Source-Aware Training Enables Knowledge Attribution in Language Models

    Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, Hao Peng · PDF
  60. Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

    · PDF
  61. Tailoring Self-Rationalizers with Multi-Reward Distillation

    Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren · PDF
  62. The Effect of Model Size on LLM Post-hoc Explainability via LIME

    Henning Heyen, Amy Widdicombe, Noah Yamamoto Siegel, Philip Colin Treleaven, Maria Perez-Ortiz · PDF
  63. TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, J Zico Kolter · PDF
  64. Toward Robust Unlearning for LLMs

    · PDF
  65. Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

    Aleksandar Makelov, Georg Lange, Neel Nanda · PDF
  66. TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness

    Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan · PDF
  67. Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

    · PDF
  68. Watermark Stealing in Large Language Models

    Nikola Jovanović, Robin Staab, Martin Vechev · PDF
  69. Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

    Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak · PDF
  70. WatME: Towards Lossless Watermarking Through Lexical Redundancy

    Liang CHEN, Yatao Bian, Yang Deng, Deng Cai, Shuaiyi Li, Peilin Zhao, Kam-Fai Wong · PDF
  71. What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety

    Luxi He, Mengzhou Xia, Peter Henderson · PDF
  72. WinoViz: Probing Visual Properties of Objects Under Different States

    Woojeong Jin, Tejas Srinivasan, Jesse Thomason, Xiang Ren · PDF