NeurIPS 2025PastLarge language models

Lock-LLM Workshop: Prevent Unauthorized Knowledge Use from Large Language Models

NeurIPS Lock-LLM Workshop 2025

Official website ↗OpenReview venue ↗See all NeurIPS workshops →✎ Edit this entry

Submission deadline: Sep 18, 2025, 23:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (57)

Fetched from OpenReview (v2) on 2026-06-10.

A Granular Study of Safety Pretraining under Model Abliteration
Shashank Agnihotri, Jonas Jakubassa, Priyam Dey, Sachin Goyal, Bernt Schiele, Venkatesh Babu Radhakrishnan, Margret Keuper · PDF
AlignDP: Hybrid Differential Privacy with Rarity-Aware Protection for LLMs
Madhava Gaikwad · PDF
ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization
Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath · PDF
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Andrew Zagula, Aashray Reddy, Nicholas Saban · PDF
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya, Anirban Roy, Daniel Elenius, Brian Matejek, Adam D. Cobb, Susmit Jha · PDF
Breaking Distortion-free Watermarks in Large Language Models
Shayleen Reynolds, Hengzhi He, Dung Daniel Ngo, Saheed Obitayo, Niccolo Dalmasso, Guang Cheng, Vamsi K. Potluru, Manuela Veloso · PDF
Can Editing LLMs Inject Harm?
Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu · PDF
Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
Filip Sondej, Yushi Yang · PDF
Compressed but Compromised? A Study of Jailbreaking in Compressed LLMs
Satya Sai Srinath Namburi GNVV, Alex James Boyd, Andrew Warrington · PDF
Context-Masked Meta-Prompting for Privacy-Preserving LLM Adaptation in Finance
Sayash Raaj Hiraou · PDF
Cross-Modal Attention Guided Unlearning in Vision-Language Models
Karuna Bhaila, Aneesh Komanduri, Minh-Hao Van, Xintao Wu · PDF
Cryptographic Fingerprinting for Medical AI: A Proof-of-Concept Approach to Protecting Healthcare ML Models from API Extraction
Saaketh Bhojanam, Sohum Mehta · PDF
Differentially Private In-Context Learning with Nearest Neighbor Search
Antti Koskela, Tejas Kulkarni, Laith Yousef Zumot · PDF
DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge
Asmita Mohanty, Gezheng Kang, Lei Gao, Murali Annavaram · PDF
Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
Kaiwen Zhou, Ahmed Elgohary, A S M Iftekhar · PDF
Does Machine Unlearning Truly Remove Knowledge?
Haokun Chen, Yueqi Zhang, Yuan Bi, Yao Zhang, Tong Liu, Jinhe Bi, Jian Lan, Claudia Grosser, Denis Krompaß, Jindong Gu, Nassir Navab, Volker Tresp · PDF
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Pingzhi Li, Zhen Tan, Yu-Chao Huang, Huaizhi Qu, huan liu, Tianlong Chen · PDF
Economic Confidentiality without Secrets: Making Intercepted LLM-Agent Communications Unusable
Bolaji Makinde · PDF
Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?
Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, Nicholas D. Lane · PDF
Evaluating and Mitigating Contextual Vulnerabilities in LLMs: An Architectural Approach to Resisting Multi-Turn Jailbreaks
Adarsh Kumarappan, Ananya Mujoo · PDF
Evaluating Privacy Leakage From In-Context Learning
Hongyi Li, James Flemings, YoungJune, Murali Annavaram · PDF
Exploiting the Experts: Unauthorized Compression in MoE-LLMs
Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Dheeraj Kulshrestha, Rajiv Ramnath · PDF
How to Make LLMs Safer? Detecting and Editing Key Heads in LLMs
Kuan-Lin Chu, Chung-En Sun, Tsui-Wei Weng · PDF
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?
Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste · PDF
Jailbreak Distillation: Renewable Safety Benchmarking
Jingyu Zhang, Ahmed Elgohary, Xiawei Wang, A S M Iftekhar, Ahmed Magooda, Benjamin Van Durme, Daniel Khashabi, Kyle Jackson · PDF
Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models
Muhammad Haris Khan · PDF
LLMs can hide text in other text of the same length
Antonio Norelli, Michael M. Bronstein · PDF
LSMAS (LLM Security Modeling via Activation Steering)
Anthony Kuang, Ahmed Ismail, Ayo Akinkugbe, Kevin Zhu, Sean O'Brien · PDF
MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
Hyunjun Kim, Sejong Kim · PDF
MarkTune: Advancing the Quality-Detectability Pareto Frontier of Open-Weight LM Watermarking
Yizhou Zhao, Steven Wu, Adam Block · PDF
MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use
Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath · PDF
Model Immunization by Trapping Harmful Finetuning
Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Alvi Md Ishmam, Chia-Wei Tang, Chris Thomas · PDF
No Question, No Passage, No Problem: Investigating Artifact Exploitation and Reasoning in Multiple-Choice Reading Comprehension
Anthony Cui, Rohan Raj Butani, Theodore Oltean · PDF
OML: A Primitive for Reconciling Open Access with Owner Control in AI Model Distribution
Zerui Cheng, Edoardo Contente, Benjamin Tsengel Finch, Oleg Aleksandrovich Golev, Jonathan Hayase, Andrew Miller, Niusha Moshrefi, Anshul Nasery, Sewoong Oh, Himanshu Tyagi, Pramod Viswanath · PDF
On the Relationship Between Neural Tangent Kernel Frobenius Distance and Distillation Sample Complexity
Arnav Sharma, Ahmed Wez, Karthik Srikumar · PDF
PASTRAL: Privacy-aware AST and TRansformer-based Anomalous command-Line detection
Xiayan Ji, Ecenaz Erdemir, Kyuhong Park, Bhavna Soman, Yi Fan · PDF
Permissioned LLMs: Enforcing Access Control in Large Language Models
Bargav Jayaraman, Virendra Marathe, Hamid Mozaffari, William F. Shen, Krishnaram Kenthapadi · PDF
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong, Haihao Liu, Vasu Sharma, Kevin Zhu · PDF
Reasoning Models Can be Easily Hacked by Fake Reasoning Bias
Qian Wang, Zhenheng Tang, Nuo Chen, Wenxuan Wang, Bingsheng He · PDF
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang · PDF
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma · PDF
Scalable Fingerprinting of Large Language Models
Anshul Nasery, Jonathan Hayase, Creston Brooks, Peiyao Sheng, Himanshu Tyagi, Pramod Viswanath, Sewoong Oh · PDF
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From
Yao Tong, Haonan Wang, Siquan Li, Kenji Kawaguchi, Tianyang Hu · PDF
Sell Data to AI Algorithms Without Revealing It: Secure Data Valuation and Sharing via Homomorphic Encryption
Michael Yang, Ruijiang Gao, Zhiqiang Zheng · PDF
Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security
Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, Amir Houmansadr · PDF
The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave, Chris Cundy · PDF
Towards Controlled LLM Unlearning
William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane · PDF
Towards Quantization-Adversarial Reparameterizations
Raine Ma · PDF
Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM
Adarsh Kumarappan, Ayushi Mehrotra · PDF
Un-Distillable LLMs via Entropy-Perturbed Logits
Mithil Shah, Andrew Bae, Laksh Patel · PDF
Undistillable Open Language Models with Teacher Scrambling
Sebastian Dionicio, Aniq Elahi, Domenic Rosati, Hassan Sajjad · PDF
Unlearners Can Lie: Evaluating “Honesty” in LLM Unlearning
Renjie Gu, Jiazhen Du, Yihua Zhang, Sijia Liu · PDF
User Confidence-Fueled Stereotypes: Investigating Sycophantic Amplification of Implicit Bias in Language Models
Hannah You, Daniel Wang, Victor Chan, Mirabel Wang, Aslihan Akalin, Kevin Zhu · PDF
Who’s Your Judge? On the Detectability of LLM-Generated Judgments
Dawei Li, Zhen Tan, Chengshuai Zhao, Bohan Jiang, Baixiang Huang, Pingchuan Ma, Abdullah Alnaibari, Kai Shu, huan liu · PDF
Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning
Wassim Bouaziz, Mathurin VIDEAU, Nicolas Usunier, El-Mahdi El-Mhamdi · PDF
X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
Hyunjun Kim, Junwoo Ha, Haon Park, Sangyoon Yu · PDF
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs
Arjun Damerla, Anirudh Sekar, Rachel Sharma, Mrinal Agarwal, Jasmine Zhang, Akitsugu Tanaka · PDF

Accepted papers (57)

☆A Granular Study of Safety Pretraining under Model Abliteration

☆AlignDP: Hybrid Differential Privacy with Rarity-Aware Protection for LLMs

☆ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

☆AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

☆Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

☆Breaking Distortion-free Watermarks in Large Language Models

☆Can Editing LLMs Inject Harm?

☆Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

☆Compressed but Compromised? A Study of Jailbreaking in Compressed LLMs

☆Context-Masked Meta-Prompting for Privacy-Preserving LLM Adaptation in Finance

☆Cross-Modal Attention Guided Unlearning in Vision-Language Models

☆Cryptographic Fingerprinting for Medical AI: A Proof-of-Concept Approach to Protecting Healthcare ML Models from API Extraction

☆Differentially Private In-Context Learning with Nearest Neighbor Search

☆DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge

☆Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

☆Does Machine Unlearning Truly Remove Knowledge?

☆DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

☆Economic Confidentiality without Secrets: Making Intercepted LLM-Agent Communications Unusable

☆Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

☆Evaluating and Mitigating Contextual Vulnerabilities in LLMs: An Architectural Approach to Resisting Multi-Turn Jailbreaks

☆Evaluating Privacy Leakage From In-Context Learning

☆Exploiting the Experts: Unauthorized Compression in MoE-LLMs

☆How to Make LLMs Safer? Detecting and Editing Key Heads in LLMs

☆Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

☆Jailbreak Distillation: Renewable Safety Benchmarking

☆Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models

☆LLMs can hide text in other text of the same length

☆LSMAS (LLM Security Modeling via Activation Steering)

☆MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

☆MarkTune: Advancing the Quality-Detectability Pareto Frontier of Open-Weight LM Watermarking

☆MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use

☆Model Immunization by Trapping Harmful Finetuning

☆No Question, No Passage, No Problem: Investigating Artifact Exploitation and Reasoning in Multiple-Choice Reading Comprehension

☆OML: A Primitive for Reconciling Open Access with Owner Control in AI Model Distribution

☆On the Relationship Between Neural Tangent Kernel Frobenius Distance and Distillation Sample Complexity

☆PASTRAL: Privacy-aware AST and TRansformer-based Anomalous command-Line detection

☆Permissioned LLMs: Enforcing Access Control in Large Language Models

☆Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

☆Reasoning Models Can be Easily Hacked by Fake Reasoning Bias

☆SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

☆Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

☆Scalable Fingerprinting of Large Language Models

☆SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

☆Sell Data to AI Algorithms Without Revealing It: Secure Data Valuation and Sharing via Homomorphic Encryption

☆Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

☆The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

☆Towards Controlled LLM Unlearning

☆Towards Quantization-Adversarial Reparameterizations

☆Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

☆Un-Distillable LLMs via Entropy-Perturbed Logits

☆Undistillable Open Language Models with Teacher Scrambling

☆Unlearners Can Lie: Evaluating “Honesty” in LLM Unlearning

☆User Confidence-Fueled Stereotypes: Investigating Sycophantic Amplification of Implicit Bias in Language Models

☆Who’s Your Judge? On the Detectability of LLM-Generated Judgments

☆Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

☆X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

☆Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

A Granular Study of Safety Pretraining under Model Abliteration

AlignDP: Hybrid Differential Privacy with Rarity-Aware Protection for LLMs

ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Breaking Distortion-free Watermarks in Large Language Models

Can Editing LLMs Inject Harm?

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Compressed but Compromised? A Study of Jailbreaking in Compressed LLMs

Context-Masked Meta-Prompting for Privacy-Preserving LLM Adaptation in Finance

Cross-Modal Attention Guided Unlearning in Vision-Language Models

Cryptographic Fingerprinting for Medical AI: A Proof-of-Concept Approach to Protecting Healthcare ML Models from API Extraction

Differentially Private In-Context Learning with Nearest Neighbor Search

DistilLock: Safeguarding LLMs from Unauthorized Knowledge Distillation on the Edge

Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

Does Machine Unlearning Truly Remove Knowledge?

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Economic Confidentiality without Secrets: Making Intercepted LLM-Agent Communications Unusable

Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

Evaluating and Mitigating Contextual Vulnerabilities in LLMs: An Architectural Approach to Resisting Multi-Turn Jailbreaks

Evaluating Privacy Leakage From In-Context Learning

Exploiting the Experts: Unauthorized Compression in MoE-LLMs

How to Make LLMs Safer? Detecting and Editing Key Heads in LLMs

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Jailbreak Distillation: Renewable Safety Benchmarking

Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models

LLMs can hide text in other text of the same length

LSMAS (LLM Security Modeling via Activation Steering)

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

MarkTune: Advancing the Quality-Detectability Pareto Frontier of Open-Weight LM Watermarking

MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use

Model Immunization by Trapping Harmful Finetuning

No Question, No Passage, No Problem: Investigating Artifact Exploitation and Reasoning in Multiple-Choice Reading Comprehension

OML: A Primitive for Reconciling Open Access with Owner Control in AI Model Distribution

On the Relationship Between Neural Tangent Kernel Frobenius Distance and Distillation Sample Complexity

PASTRAL: Privacy-aware AST and TRansformer-based Anomalous command-Line detection

Permissioned LLMs: Enforcing Access Control in Large Language Models

Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Reasoning Models Can be Easily Hacked by Fake Reasoning Bias

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

Scalable Fingerprinting of Large Language Models

SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

Sell Data to AI Algorithms Without Revealing It: Secure Data Valuation and Sharing via Homomorphic Encryption

Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Towards Controlled LLM Unlearning

Towards Quantization-Adversarial Reparameterizations

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Un-Distillable LLMs via Entropy-Perturbed Logits

Undistillable Open Language Models with Teacher Scrambling

Unlearners Can Lie: Evaluating “Honesty” in LLM Unlearning

User Confidence-Fueled Stereotypes: Investigating Sycophantic Amplification of Implicit Bias in Language Models

Who’s Your Judge? On the Detectability of LLM-Generated Judgments

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs