ICLR 2025 Past Large language models

ICLR 2025 Workshop on Building Trust in Language Models and Applications

BuildingTrust

Submission deadline
Feb 14, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (97)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

    Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh · PDF
  2. A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

    Gabriel Chua, Chan Shing Yee, Shaun Khoo · PDF
  3. A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

    Sophie Xhonneux, David Dobre, Mehrnaz Mofakhami, Leo Schwinn, Gauthier Gidel · PDF
  4. A Missing Testbed for LLM Pre-Training Membership Inference Attacks

    Mingjian Jiang, Ken Ziyu Liu, Sanmi Koyejo · PDF
  5. Adaptive Test-Time Intervention for Concept Bottleneck Models

    Matthew Shen, Aliyah R. Hsu, Abhineet Agarwal, Bin Yu · PDF
  6. AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks

    Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, Furong Huang · PDF
  7. AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

    Zikui Cai, Shayan Shabihi, Bang An, Zora Che, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein, Furong Huang · PDF
  8. AI Companions Are Not The Solution To Loneliness: Design Choices And Their Drawbacks

    Jonas B Raedler, Siddharth Swaroop, Weiwei Pan · PDF
  9. An Empirical Study on Prompt Compression for Large Language Models

    Zhang Zheng, Jinyi Li, Yihuai Lan, Xiang Wang, Hao Wang · PDF
  10. Analyzing Memorization in Large Language Models through the Lens of Model Attribution

    Tarun Ram Menta, Susmit Agrawal, Chirag Agarwal · PDF
  11. AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

    You-Ming Chang, Chen Yeh, Wei-Chen Chiu, Ning Yu · PDF
  12. Antipodal Pairing and Mechanistic Signals in Dense SAE Latents

    Alessandro Stolfo, Ben Peng Wu, Mrinmaya Sachan · PDF
  13. ASIDE: Architectural Separation of Instructions and Data in Language Models

    Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert · PDF
  14. Automated Capability Discovery via Model Self-Exploration

    Cong Lu, Shengran Hu, Jeff Clune · PDF
  15. Automated Feature Labeling with Token-Space Gradient Descent

    Julian Schulz, Seamus Fallows · PDF
  16. Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

    Maya Pavlova, Erik Brinkman, Krithika Iyer, Vítor Albiero, Joanna Bitton, Hailey Nguyen, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori · PDF
  17. BaxBench: Can LLMs Generate Correct and Secure Backends?

    Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev · PDF
  18. Black-Box Adversarial Attacks on LLM-Based Code Completion

    Slobodan Jenko, Niels Mündler, Jingxuan He, Mark Vero, Martin Vechev · PDF
  19. Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks

    Youze Wang, Wenbo Hu, Qin Li, Richang Hong · PDF
  20. Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

    Shichang Zhang, Tessa Han, Usha Bhalla, Himabindu Lakkaraju · PDF
  21. Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering

    Yuan Sui, Yufei He, Zifeng Ding, Bryan Hooi · PDF
  22. CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

    Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran · PDF
  23. Conformal Structured Prediction

    Botong Zhang, Shuo Li, Osbert Bastani · PDF
  24. Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty

    Brian Sui, Jessy Lin, Michelle Li, Anca Dragan, Dan Klein, Jacob Steinhardt · PDF
  25. Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings

    Saniya Karwa, Navpreet Singh · PDF
  26. Disentangling Sequence Memorization and General Capability in Large Language Models

    Gaurav Rohit Ghosal, Pratyush Maini, Aditi Raghunathan · PDF
  27. Do Multilingual LLMs Think In English?

    Lisa Schut, Yarin Gal, Sebastian Farquhar · PDF
  28. Dynaseal: A Backend-Controlled LLM API Key Distribution Scheme with Constrained Invocation Parameters

    Jiahao Zhao, Fan Wu, 南佳怡, 魏来, Yang YiChen · PDF
  29. Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

    Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, Sean O'Brien · PDF
  30. Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

    Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Marek Rei · PDF
  31. Evaluating Text Humanlikeness via Self-Similarity Exponent

    Ilya Pershin · PDF
  32. Evaluation of Large Language Models via Coupled Token Generation

    Nina L. Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez Rodriguez · PDF
  33. ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

    Chhavi Yadav, Evan Laufer, Dan Boneh, Kamalika Chaudhuri · PDF
  34. Fast Proxies for LLM Robustness Evaluation

    Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan Günnemann · PDF
  35. FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering

    Yuan Sui, Yufei He, Nian Liu, Xiaoxin He, Kun Wang, Bryan Hooi · PDF
  36. Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

    Justin Theodorus, V Swaytha, Shivani Gautam, Adam Ward, Mahir Shah, Cole Blondin, Kevin Zhu · PDF
  37. GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

    Advik Raj Basani, Xiao Zhang · PDF
  38. HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

    Zhiying Zhu, Yiming Yang, Zhiqing Sun · PDF
  39. Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

    Roman Levin, Valeriia Cherepanova, Abhimanyu Hans, Avi Schwarzschild, Tom Goldstein · PDF
  40. Hidden No More: Attacking and Defending Private Third-Party LLM Inference

    Arka Pal, Rahul Krishna Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum · PDF
  41. How Does Entropy Influence Modern Text-to-SQL Systems?

    Varun Kausika, chris lazar, Satya Saurabh Mishra, Saurabh Jha, Priyanka Pathak · PDF
  42. In-Context Meta Learning Induces Multi-Phase Circuit Emergence

    Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo · PDF
  43. Interpretable Steering of Large Language Models with Feature Guided Activation Additions

    Samuel Soo, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Ming YAN · PDF
  44. Justified Trust in AI Fairness Assessment using Existing Metadata Entities

    Alpay Sabuncuoglu, carsten maple · PDF
  45. Language Models Use Trigonometry to Do Addition

    Subhash Kantamneni, Max Tegmark · PDF
  46. Latent Adversarial Training Improves the Representation of Refusal

    Alexandra Abbas, Nora Petrova, Hélios Lyons, Natalia Perez-Campanero · PDF
  47. Learning Automata from Demonstrations, Examples, and Natural Language

    Marcell Vazquez-Chanlatte, Karim Elmaaroufi, Stefan Witwicki, Matei Zaharia, Sanjit A. Seshia · PDF
  48. LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

    Kunal Patil, Dylan Zhou, Yifan Sun, Karthik lakshmanan, Senthooran Rajamanoharan, Arthur Conmy · PDF
  49. LLMS LOST IN TRANSLATION: M-ALERT UNCOVERS CROSS-LINGUISTIC SAFETY GAPS

    Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting · PDF
  50. LM Agents May Fail to Act on Their Own Risk Knowledge

    Yuzhi Tang, Tianxiao Li, Elizabeth Li, Chris J. Maddison, Honghua Dong, Yangjun Ruan · PDF
  51. MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered

    Ishwara Vasista, Imran Mirza, Cole Huang, Rohan Rajasekhara Patil, Aslihan Akalin, Kevin Zhu, Sean O'Brien · PDF
  52. Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?

    Maciej Chrabaszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzcinski · PDF
  53. Measuring In-Context Computation Complexity via Hidden State Prediction

    Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber · PDF
  54. Mechanistic Anomaly Detection for "Quirky'' Language Models

    David O. Johnston, Arkajyoti Chakraborty, Nora Belrose · PDF
  55. MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

    Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma · PDF
  56. Mind the Gap: A Practical Attack on GGUF Quantization

    Kazuki Egashira, Robin Staab, Mark Vero, Jingxuan He, Martin Vechev · PDF
  57. MKA: Leveraging Cross-Lingual Consensus for Model Abstention

    Sharad Duwal · PDF
  58. Model Evaluations Need Rigorous and Transparent Human Baselines

    Kevin Wei, Patricia Paskov, Sunishchal Dev, Michael J Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande · PDF
  59. Monitoring LLM Agents for Sequentially Contextual Harm

    Chen Yueh-Han, Nitish Joshi, Yulin Chen, He He, Rico Angell · PDF
  60. No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data

    Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Krishnamurthy Dj Dvijotham · PDF
  61. On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality

    Hanbo Huang, Yihan Li, Bowen Jiang, Lin Liu, Bo Jiang, Ruoyu Sun, Zhuotao Liu, Shiyu Liang · PDF
  62. PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING

    Yixiong Hao, Ayush Panda, Stepan Shabalin, Sheikh Abdur Raheem Ali · PDF
  63. Private Retrieval Augmented Generation with Random Projection

    Dixi Yao, Tian Li · PDF
  64. Privately Learning from Graphs with Applications in Fine-tuning Large Pretrained Models

    Haoteng Yin, Rongzhe Wei, Eli Chien, Pan Li · PDF
  65. Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction

    Harit Vishwakarma, Thomas Cook, Alan Mishler, Niccolo Dalmasso, Natraj Raman, Sumitra Ganesh · PDF
  66. PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS

    Mansi Gupta, Nikhar Waghela, Sarthak Gupta, Shourya Goel, Sanjif Shanmugavelu · PDF
  67. Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific

    Akash Kundu, Adrianna Tan, Theodora Skeadas, Rumman Chowdhury, Sarah Amos · PDF
  68. Reliable and Efficient Amortized Model-based Evaluation

    Sang T. Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo · PDF
  69. Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity

    Prakhar Ganesh, Reza Shokri, Golnoosh Farnadi · PDF
  70. Rethinking LLM Bias Probing Using Lessons from the Social Sciences

    Kirsten Morehouse, Siddharth Swaroop, Weiwei Pan · PDF
  71. SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

    Aladin Djuhera, Swanand Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche · PDF
  72. Scalable Fingerprinting of Large Language Models

    Anshul Nasery, Jonathan Hayase, Creston Brooks, Peiyao Sheng, Himanshu Tyagi, Pramod Viswanath, Sewoong Oh · PDF
  73. Self-Ablating Transformers: More Interpretability, Less Sparsity

    Jeremias Lino Ferrao, Luhan Mikaelson, Keenan Pepper, Natalia Perez-Campanero · PDF
  74. Siege: Multi-Turn Jailbreaking of Large Language Models with Tree Search

    Andy Zhou, Ron Arel · PDF
  75. SPEX: Scaling Feature Interaction Explanations for LLMs

    Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Bin Yu, Kannan Ramchandran · PDF
  76. Steering Fine-Tuning Generalization with Targeted Concept Ablation

    Helena Casademunt, Caden Juang, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda · PDF
  77. StochasTok: Improving Fine-Grained Subword Understanding in LLMs

    Anya Sims, Cong Lu, Klara Kaleb, Jakob Nicolaus Foerster, Yee Whye Teh · PDF
  78. Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

    Hui Wei, Shenghua He, Tian Xia, Fei Liu, Andy Wong, Jingyang Lin, Mei Han · PDF
  79. Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting

    Fuqiang Liu, Sicong Jiang · PDF
  80. The Differences Between Direct Alignment Algorithms are a Blur

    Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov · PDF
  81. THE FUNDAMENTAL LIMITS OF LLM UNLEARNING: COMPLEXITY-THEORETIC BARRIERS AND PROVABLY OPTIMAL PROTOCOLS

    Aviral Srivastava · PDF
  82. The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

    Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr · PDF
  83. The Steganographic Potentials of Language Models

    Artem Karpov, Tinuade Adeleke, Seong Hah Cho, Natalia Perez-Campanero · PDF
  84. Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

    Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, Viswanathan Swaminathan · PDF
  85. ToolScan: A Benchmark For Characterizing Errors In Tool-Use LLMs

    Shirley Kokane, Ming Zhu, Tulika Manoj Awalgaonkar, Jianguo Zhang, Akshara Prabhakar, Thai Quoc Hoang, Zuxin Liu, Rithesh R N, Liangwei Yang, Weiran Yao, Juntao Tan, Zhiwei Liu, Huan Wang, Juan Carlos Niebles, Shelby Heinecke, Caiming Xiong, Silvio Savarese · PDF
  86. Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks

    Michael Wornow, Vaishnav Garodia, Vasilis Vassalos, Utkarsh Contractor · PDF
  87. Towards Effective Discrimination Testing for Generative AI

    Thomas P Zollo, Nikita Rajaneesh, Richard Zemel, Talia B. Gillis, Emily Black · PDF
  88. Towards Understanding Distilled Reasoning Models: A Representational Approach

    David D. Baek, Max Tegmark · PDF
  89. Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

    Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou · PDF
  90. Understanding (Un)Reliability of Steering Vectors in Language Models

    Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov · PDF
  91. UNLEARNING GEO-CULTURAL STEREOTYPES IN MULTILINGUAL LLMS

    Alireza Dehghanpour Farashah, Aditi Khandelwal, Negar Rostamzadeh, Golnoosh Farnadi · PDF
  92. UNLOCKING HIERARCHICAL CONCEPT DISCOVERY IN LANGUAGE MODELS THROUGH GEOMETRIC REGULARIZATION

    Ed Li, Junyu Ren · PDF
  93. Unnatural Languages Are Not Bugs but Features for LLMs

    Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J Zico Kolter, Michael Qizhe Shieh · PDF
  94. VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models

    Wenbo Hu, Shishen Gu, Youze Wang, Richang Hong · PDF
  95. Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis

    Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen · PDF
  96. Why Do Multiagent Systems Fail?

    Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica · PDF
  97. Working Memory Attack on LLMs

    Bibek Upadhayay, Vahid Behzadan, Amin Karbasi · PDF