ICLR 2025PastLarge language models

ICLR 2025 Workshop on Building Trust in Language Models and Applications

BuildingTrust

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 14, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (97)

Fetched from OpenReview (v2) on 2026-06-10.

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh · PDF
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Gabriel Chua, Chan Shing Yee, Shaun Khoo · PDF
A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens
Sophie Xhonneux, David Dobre, Mehrnaz Mofakhami, Leo Schwinn, Gauthier Gidel · PDF
A Missing Testbed for LLM Pre-Training Membership Inference Attacks
Mingjian Jiang, Ken Ziyu Liu, Sanmi Koyejo · PDF
Adaptive Test-Time Intervention for Concept Bottleneck Models
Matthew Shen, Aliyah R. Hsu, Abhineet Agarwal, Bin Yu · PDF
AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks
Pankayaraj Pathmanathan, Udari Madhushani Sehwag, Michael-Andrei Panaitescu-Liess, Furong Huang · PDF
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
Zikui Cai, Shayan Shabihi, Bang An, Zora Che, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein, Furong Huang · PDF
AI Companions Are Not The Solution To Loneliness: Design Choices And Their Drawbacks
Jonas B Raedler, Siddharth Swaroop, Weiwei Pan · PDF
An Empirical Study on Prompt Compression for Large Language Models
Zhang Zheng, Jinyi Li, Yihuai Lan, Xiang Wang, Hao Wang · PDF
Analyzing Memorization in Large Language Models through the Lens of Model Attribution
Tarun Ram Menta, Susmit Agrawal, Chirag Agarwal · PDF
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors
You-Ming Chang, Chen Yeh, Wei-Chen Chiu, Ning Yu · PDF
Antipodal Pairing and Mechanistic Signals in Dense SAE Latents
Alessandro Stolfo, Ben Peng Wu, Mrinmaya Sachan · PDF
ASIDE: Architectural Separation of Instructions and Data in Language Models
Egor Zverev, Evgenii Kortukov, Alexander Panfilov, Soroush Tabesh, Sebastian Lapuschkin, Wojciech Samek, Christoph H. Lampert · PDF
Automated Capability Discovery via Model Self-Exploration
Cong Lu, Shengran Hu, Jeff Clune · PDF
Automated Feature Labeling with Token-Space Gradient Descent
Julian Schulz, Seamus Fallows · PDF
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Maya Pavlova, Erik Brinkman, Krithika Iyer, Vítor Albiero, Joanna Bitton, Hailey Nguyen, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori · PDF
BaxBench: Can LLMs Generate Correct and Secure Backends?
Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev · PDF
Black-Box Adversarial Attacks on LLM-Based Code Completion
Slobodan Jenko, Niels Mündler, Jingxuan He, Mark Vero, Martin Vechev · PDF
Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks
Youze Wang, Wenbo Hu, Qin Li, Richang Hong · PDF
Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Shichang Zhang, Tessa Han, Usha Bhalla, Himabindu Lakkaraju · PDF
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Yuan Sui, Yufei He, Zifeng Ding, Bryan Hooi · PDF
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran · PDF
Conformal Structured Prediction
Botong Zhang, Shuo Li, Osbert Bastani · PDF
Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty
Brian Sui, Jessy Lin, Michelle Li, Anca Dragan, Dan Klein, Jacob Steinhardt · PDF
Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings
Saniya Karwa, Navpreet Singh · PDF
Disentangling Sequence Memorization and General Capability in Large Language Models
Gaurav Rohit Ghosal, Pratyush Maini, Aditi Raghunathan · PDF
Do Multilingual LLMs Think In English?
Lisa Schut, Yarin Gal, Sebastian Farquhar · PDF
Dynaseal: A Backend-Controlled LLM API Key Distribution Scheme with Constrained Invocation Parameters
Jiahao Zhao, Fan Wu, 南佳怡, 魏来, Yang YiChen · PDF
Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models
Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, Sean O'Brien · PDF
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study
Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Marek Rei · PDF
Evaluating Text Humanlikeness via Self-Similarity Exponent
Ilya Pershin · PDF
Evaluation of Large Language Models via Coupled Token Generation
Nina L. Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, Manuel Gomez Rodriguez · PDF
ExpProof : Operationalizing Explanations for Confidential Models with ZKPs
Chhavi Yadav, Evan Laufer, Dan Boneh, Kamalika Chaudhuri · PDF
Fast Proxies for LLM Robustness Evaluation
Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan Günnemann · PDF
FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering
Yuan Sui, Yufei He, Nian Liu, Xiaoxin He, Kun Wang, Bryan Hooi · PDF
Finding Sparse Autoencoder Representations Of Errors In CoT Prompting
Justin Theodorus, V Swaytha, Shivani Gautam, Adam Ward, Mahir Shah, Cole Blondin, Kevin Zhu · PDF
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Advik Raj Basani, Xiao Zhang · PDF
HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild
Zhiying Zhu, Yiming Yang, Zhiqing Sun · PDF
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference
Roman Levin, Valeriia Cherepanova, Abhimanyu Hans, Avi Schwarzschild, Tom Goldstein · PDF
Hidden No More: Attacking and Defending Private Third-Party LLM Inference
Arka Pal, Rahul Krishna Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum · PDF
How Does Entropy Influence Modern Text-to-SQL Systems?
Varun Kausika, chris lazar, Satya Saurabh Mishra, Saurabh Jha, Priyanka Pathak · PDF
In-Context Meta Learning Induces Multi-Phase Circuit Emergence
Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo · PDF
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Samuel Soo, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, Ming YAN · PDF
Justified Trust in AI Fairness Assessment using Existing Metadata Entities
Alpay Sabuncuoglu, carsten maple · PDF
Language Models Use Trigonometry to Do Addition
Subhash Kantamneni, Max Tegmark · PDF
Latent Adversarial Training Improves the Representation of Refusal
Alexandra Abbas, Nora Petrova, Hélios Lyons, Natalia Perez-Campanero · PDF
Learning Automata from Demonstrations, Examples, and Natural Language
Marcell Vazquez-Chanlatte, Karim Elmaaroufi, Stefan Witwicki, Matei Zaharia, Sanjit A. Seshia · PDF
LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders
Kunal Patil, Dylan Zhou, Yifan Sun, Karthik lakshmanan, Senthooran Rajamanoharan, Arthur Conmy · PDF
LLMS LOST IN TRANSLATION: M-ALERT UNCOVERS CROSS-LINGUISTIC SAFETY GAPS
Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting · PDF
LM Agents May Fail to Act on Their Own Risk Knowledge
Yuzhi Tang, Tianxiao Li, Elizabeth Li, Chris J. Maddison, Honghua Dong, Yangjun Ruan · PDF
MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered
Ishwara Vasista, Imran Mirza, Cole Huang, Rohan Rajasekhara Patil, Aslihan Akalin, Kevin Zhu, Sean O'Brien · PDF
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?
Maciej Chrabaszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzcinski · PDF
Measuring In-Context Computation Complexity via Hidden State Prediction
Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber · PDF
Mechanistic Anomaly Detection for "Quirky'' Language Models
David O. Johnston, Arkajyoti Chakraborty, Nora Belrose · PDF
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma · PDF
Mind the Gap: A Practical Attack on GGUF Quantization
Kazuki Egashira, Robin Staab, Mark Vero, Jingxuan He, Martin Vechev · PDF
MKA: Leveraging Cross-Lingual Consensus for Model Abstention
Sharad Duwal · PDF
Model Evaluations Need Rigorous and Transparent Human Baselines
Kevin Wei, Patricia Paskov, Sunishchal Dev, Michael J Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande · PDF
Monitoring LLM Agents for Sequentially Contextual Harm
Chen Yueh-Han, Nitish Joshi, Yulin Chen, He He, Rico Angell · PDF
No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data
Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Krishnamurthy Dj Dvijotham · PDF
On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality
Hanbo Huang, Yihan Li, Bowen Jiang, Lin Liu, Bo Jiang, Ruoyu Sun, Zhuotao Liu, Shiyu Liang · PDF
PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING
Yixiong Hao, Ayush Panda, Stepan Shabalin, Sheikh Abdur Raheem Ali · PDF
Private Retrieval Augmented Generation with Random Projection
Dixi Yao, Tian Li · PDF
Privately Learning from Graphs with Applications in Fine-tuning Large Pretrained Models
Haoteng Yin, Rongzhe Wei, Eli Chien, Pan Li · PDF
Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction
Harit Vishwakarma, Thomas Cook, Alan Mishler, Niccolo Dalmasso, Natraj Raman, Sumitra Ganesh · PDF
PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS
Mansi Gupta, Nikhar Waghela, Sarthak Gupta, Shourya Goel, Sanjif Shanmugavelu · PDF
Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific
Akash Kundu, Adrianna Tan, Theodora Skeadas, Rumman Chowdhury, Sarah Amos · PDF
Reliable and Efficient Amortized Model-based Evaluation
Sang T. Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo · PDF
Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity
Prakhar Ganesh, Reza Shokri, Golnoosh Farnadi · PDF
Rethinking LLM Bias Probing Using Lessons from the Social Sciences
Kirsten Morehouse, Siddharth Swaroop, Weiwei Pan · PDF
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera, Swanand Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche · PDF
Scalable Fingerprinting of Large Language Models
Anshul Nasery, Jonathan Hayase, Creston Brooks, Peiyao Sheng, Himanshu Tyagi, Pramod Viswanath, Sewoong Oh · PDF
Self-Ablating Transformers: More Interpretability, Less Sparsity
Jeremias Lino Ferrao, Luhan Mikaelson, Keenan Pepper, Natalia Perez-Campanero · PDF
Siege: Multi-Turn Jailbreaking of Large Language Models with Tree Search
Andy Zhou, Ron Arel · PDF
SPEX: Scaling Feature Interaction Explanations for LLMs
Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Bin Yu, Kannan Ramchandran · PDF
Steering Fine-Tuning Generalization with Targeted Concept Ablation
Helena Casademunt, Caden Juang, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda · PDF
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Anya Sims, Cong Lu, Klara Kaleb, Jakob Nicolaus Foerster, Yee Whye Teh · PDF
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei, Shenghua He, Tian Xia, Fei Liu, Andy Wong, Jingyang Lin, Mei Han · PDF
Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting
Fuqiang Liu, Sicong Jiang · PDF
The Differences Between Direct Alignment Algorithms are a Blur
Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov · PDF
THE FUNDAMENTAL LIMITS OF LLM UNLEARNING: COMPLEXITY-THEORETIC BARRIERS AND PROVABLY OPTIMAL PROTOCOLS
Aviral Srivastava · PDF
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr · PDF
The Steganographic Potentials of Language Models
Artem Karpov, Tinuade Adeleke, Seong Hah Cho, Natalia Perez-Campanero · PDF
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, Viswanathan Swaminathan · PDF
ToolScan: A Benchmark For Characterizing Errors In Tool-Use LLMs
Shirley Kokane, Ming Zhu, Tulika Manoj Awalgaonkar, Jianguo Zhang, Akshara Prabhakar, Thai Quoc Hoang, Zuxin Liu, Rithesh R N, Liangwei Yang, Weiran Yao, Juntao Tan, Zhiwei Liu, Huan Wang, Juan Carlos Niebles, Shelby Heinecke, Caiming Xiong, Silvio Savarese · PDF
Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks
Michael Wornow, Vaishnav Garodia, Vasilis Vassalos, Utkarsh Contractor · PDF
Towards Effective Discrimination Testing for Generative AI
Thomas P Zollo, Nikita Rajaneesh, Richard Zemel, Talia B. Gillis, Emily Black · PDF
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek, Max Tegmark · PDF
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou · PDF
Understanding (Un)Reliability of Steering Vectors in Language Models
Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov · PDF
UNLEARNING GEO-CULTURAL STEREOTYPES IN MULTILINGUAL LLMS
Alireza Dehghanpour Farashah, Aditi Khandelwal, Negar Rostamzadeh, Golnoosh Farnadi · PDF
UNLOCKING HIERARCHICAL CONCEPT DISCOVERY IN LANGUAGE MODELS THROUGH GEOMETRIC REGULARIZATION
Ed Li, Junyu Ren · PDF
Unnatural Languages Are Not Bugs but Features for LLMs
Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J Zico Kolter, Michael Qizhe Shieh · PDF
VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models
Wenbo Hu, Shishen Gu, Youze Wang, Richang Hong · PDF
Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis
Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen · PDF
Why Do Multiagent Systems Fail?
Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica · PDF
Working Memory Attack on LLMs
Bibek Upadhayay, Vahid Behzadan, Amin Karbasi · PDF

Accepted papers (97)

☆A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

☆A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

☆A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

☆A Missing Testbed for LLM Pre-Training Membership Inference Attacks

☆Adaptive Test-Time Intervention for Concept Bottleneck Models

☆AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks

☆AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

☆AI Companions Are Not The Solution To Loneliness: Design Choices And Their Drawbacks

☆An Empirical Study on Prompt Compression for Large Language Models

☆Analyzing Memorization in Large Language Models through the Lens of Model Attribution

☆AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

☆Antipodal Pairing and Mechanistic Signals in Dense SAE Latents

☆ASIDE: Architectural Separation of Instructions and Data in Language Models

☆Automated Capability Discovery via Model Self-Exploration

☆Automated Feature Labeling with Token-Space Gradient Descent

☆Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

☆BaxBench: Can LLMs Generate Correct and Secure Backends?

☆Black-Box Adversarial Attacks on LLM-Based Code Completion

☆Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks

☆Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

☆Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering

☆CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

☆Conformal Structured Prediction

☆Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty

☆Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings

☆Disentangling Sequence Memorization and General Capability in Large Language Models

☆Do Multilingual LLMs Think In English?

☆Dynaseal: A Backend-Controlled LLM API Key Distribution Scheme with Constrained Invocation Parameters

☆Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

☆Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

☆Evaluating Text Humanlikeness via Self-Similarity Exponent

☆Evaluation of Large Language Models via Coupled Token Generation

☆ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

☆Fast Proxies for LLM Robustness Evaluation

☆FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering

☆Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

☆GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

☆HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

☆Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

☆Hidden No More: Attacking and Defending Private Third-Party LLM Inference

☆How Does Entropy Influence Modern Text-to-SQL Systems?

☆In-Context Meta Learning Induces Multi-Phase Circuit Emergence

☆Interpretable Steering of Large Language Models with Feature Guided Activation Additions

☆Justified Trust in AI Fairness Assessment using Existing Metadata Entities

☆Language Models Use Trigonometry to Do Addition

☆Latent Adversarial Training Improves the Representation of Refusal

☆Learning Automata from Demonstrations, Examples, and Natural Language

☆LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

☆LLMS LOST IN TRANSLATION: M-ALERT UNCOVERS CROSS-LINGUISTIC SAFETY GAPS

☆LM Agents May Fail to Act on Their Own Risk Knowledge

☆MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered

☆Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?

☆Measuring In-Context Computation Complexity via Hidden State Prediction

☆Mechanistic Anomaly Detection for "Quirky'' Language Models

☆MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

☆Mind the Gap: A Practical Attack on GGUF Quantization

☆MKA: Leveraging Cross-Lingual Consensus for Model Abstention

☆Model Evaluations Need Rigorous and Transparent Human Baselines

☆Monitoring LLM Agents for Sequentially Contextual Harm

☆No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data

☆On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality

☆PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING

☆Private Retrieval Augmented Generation with Random Projection

☆Privately Learning from Graphs with Applications in Fine-tuning Large Pretrained Models

☆Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction

☆PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS

☆Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific

☆Reliable and Efficient Amortized Model-based Evaluation

☆Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity

☆Rethinking LLM Bias Probing Using Lessons from the Social Sciences

☆SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

☆Scalable Fingerprinting of Large Language Models

☆Self-Ablating Transformers: More Interpretability, Less Sparsity

☆Siege: Multi-Turn Jailbreaking of Large Language Models with Tree Search

☆SPEX: Scaling Feature Interaction Explanations for LLMs

☆Steering Fine-Tuning Generalization with Targeted Concept Ablation

☆StochasTok: Improving Fine-Grained Subword Understanding in LLMs

☆Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

☆Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

A Missing Testbed for LLM Pre-Training Membership Inference Attacks

Adaptive Test-Time Intervention for Concept Bottleneck Models

AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

AI Companions Are Not The Solution To Loneliness: Design Choices And Their Drawbacks

An Empirical Study on Prompt Compression for Large Language Models

Analyzing Memorization in Large Language Models through the Lens of Model Attribution

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

Antipodal Pairing and Mechanistic Signals in Dense SAE Latents

ASIDE: Architectural Separation of Instructions and Data in Language Models

Automated Capability Discovery via Model Self-Exploration

Automated Feature Labeling with Token-Space Gradient Descent

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

BaxBench: Can LLMs Generate Correct and Secure Backends?

Black-Box Adversarial Attacks on LLM-Based Code Completion

Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks

Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Conformal Structured Prediction

Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty

Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings

Disentangling Sequence Memorization and General Capability in Large Language Models

Do Multilingual LLMs Think In English?

Dynaseal: A Backend-Controlled LLM API Key Distribution Scheme with Constrained Invocation Parameters

Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study

Evaluating Text Humanlikeness via Self-Similarity Exponent

Evaluation of Large Language Models via Coupled Token Generation

ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Fast Proxies for LLM Robustness Evaluation

FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering

Finding Sparse Autoencoder Representations Of Errors In CoT Prompting

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

Hidden No More: Attacking and Defending Private Third-Party LLM Inference

How Does Entropy Influence Modern Text-to-SQL Systems?

In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Justified Trust in AI Fairness Assessment using Existing Metadata Entities

Language Models Use Trigonometry to Do Addition

Latent Adversarial Training Improves the Representation of Refusal

Learning Automata from Demonstrations, Examples, and Natural Language

LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders

LLMS LOST IN TRANSLATION: M-ALERT UNCOVERS CROSS-LINGUISTIC SAFETY GAPS

LM Agents May Fail to Act on Their Own Risk Knowledge

MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered

Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?

Measuring In-Context Computation Complexity via Hidden State Prediction

Mechanistic Anomaly Detection for "Quirky'' Language Models

MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

Mind the Gap: A Practical Attack on GGUF Quantization

MKA: Leveraging Cross-Lingual Consensus for Model Abstention

Model Evaluations Need Rigorous and Transparent Human Baselines

Monitoring LLM Agents for Sequentially Contextual Harm

No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data

On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality

PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING

Private Retrieval Augmented Generation with Random Projection

Privately Learning from Graphs with Applications in Fine-tuning Large Pretrained Models

Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction

PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS

Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific

Reliable and Efficient Amortized Model-based Evaluation

Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity

Rethinking LLM Bias Probing Using Lessons from the Social Sciences

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Scalable Fingerprinting of Large Language Models

Self-Ablating Transformers: More Interpretability, Less Sparsity

Siege: Multi-Turn Jailbreaking of Large Language Models with Tree Search

SPEX: Scaling Feature Interaction Explanations for LLMs

Steering Fine-Tuning Generalization with Targeted Concept Ablation

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting

The Differences Between Direct Alignment Algorithms are a Blur