ICLR 2025 Past Large language models
ICLR 2025 Workshop on Building Trust in Language Models and Applications
BuildingTrust
- Submission deadline
- Feb 14, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (97)
Fetched from OpenReview (v2) on 2026-06-10.
-
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
-
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
-
A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens
-
A Missing Testbed for LLM Pre-Training Membership Inference Attacks
-
Adaptive Test-Time Intervention for Concept Bottleneck Models
-
AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks
-
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
-
AI Companions Are Not The Solution To Loneliness: Design Choices And Their Drawbacks
-
An Empirical Study on Prompt Compression for Large Language Models
-
Analyzing Memorization in Large Language Models through the Lens of Model Attribution
-
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors
-
Antipodal Pairing and Mechanistic Signals in Dense SAE Latents
-
ASIDE: Architectural Separation of Instructions and Data in Language Models
-
Automated Capability Discovery via Model Self-Exploration
-
Automated Feature Labeling with Token-Space Gradient Descent
-
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
-
BaxBench: Can LLMs Generate Correct and Secure Backends?
-
Black-Box Adversarial Attacks on LLM-Based Code Completion
-
Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks
-
Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
-
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
-
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
-
Conformal Structured Prediction
-
Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty
-
Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings
-
Disentangling Sequence Memorization and General Capability in Large Language Models
-
Do Multilingual LLMs Think In English?
-
Dynaseal: A Backend-Controlled LLM API Key Distribution Scheme with Constrained Invocation Parameters
-
Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models
-
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study
-
Evaluating Text Humanlikeness via Self-Similarity Exponent
-
Evaluation of Large Language Models via Coupled Token Generation
-
ExpProof : Operationalizing Explanations for Confidential Models with ZKPs
-
Fast Proxies for LLM Robustness Evaluation
-
FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering
-
Finding Sparse Autoencoder Representations Of Errors In CoT Prompting
-
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
-
HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild
-
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference
-
Hidden No More: Attacking and Defending Private Third-Party LLM Inference
-
How Does Entropy Influence Modern Text-to-SQL Systems?
-
In-Context Meta Learning Induces Multi-Phase Circuit Emergence
-
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
-
Justified Trust in AI Fairness Assessment using Existing Metadata Entities
-
Language Models Use Trigonometry to Do Addition
-
Latent Adversarial Training Improves the Representation of Refusal
-
Learning Automata from Demonstrations, Examples, and Natural Language
-
LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders
-
LLMS LOST IN TRANSLATION: M-ALERT UNCOVERS CROSS-LINGUISTIC SAFETY GAPS
-
LM Agents May Fail to Act on Their Own Risk Knowledge
-
MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered
-
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?
-
Measuring In-Context Computation Complexity via Hidden State Prediction
-
Mechanistic Anomaly Detection for "Quirky'' Language Models
-
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
-
Mind the Gap: A Practical Attack on GGUF Quantization
-
MKA: Leveraging Cross-Lingual Consensus for Model Abstention
-
Model Evaluations Need Rigorous and Transparent Human Baselines
-
Monitoring LLM Agents for Sequentially Contextual Harm
-
No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data
-
On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality
-
PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING
-
Private Retrieval Augmented Generation with Random Projection
-
Privately Learning from Graphs with Applications in Fine-tuning Large Pretrained Models
-
Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction
-
PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS
-
Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific
-
Reliable and Efficient Amortized Model-based Evaluation
-
Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity
-
Rethinking LLM Bias Probing Using Lessons from the Social Sciences
-
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
-
Scalable Fingerprinting of Large Language Models
-
Self-Ablating Transformers: More Interpretability, Less Sparsity
-
Siege: Multi-Turn Jailbreaking of Large Language Models with Tree Search
-
SPEX: Scaling Feature Interaction Explanations for LLMs
-
Steering Fine-Tuning Generalization with Targeted Concept Ablation
-
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
-
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
-
Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting
-
The Differences Between Direct Alignment Algorithms are a Blur
-
THE FUNDAMENTAL LIMITS OF LLM UNLEARNING: COMPLEXITY-THEORETIC BARRIERS AND PROVABLY OPTIMAL PROTOCOLS
-
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
-
The Steganographic Potentials of Language Models
-
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
-
ToolScan: A Benchmark For Characterizing Errors In Tool-Use LLMs
-
Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks
-
Towards Effective Discrimination Testing for Generative AI
-
Towards Understanding Distilled Reasoning Models: A Representational Approach
-
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
-
Understanding (Un)Reliability of Steering Vectors in Language Models
-
UNLEARNING GEO-CULTURAL STEREOTYPES IN MULTILINGUAL LLMS
-
UNLOCKING HIERARCHICAL CONCEPT DISCOVERY IN LANGUAGE MODELS THROUGH GEOMETRIC REGULARIZATION
-
Unnatural Languages Are Not Bugs but Features for LLMs
-
VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models
-
Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis
-
Why Do Multiagent Systems Fail?
-
Working Memory Attack on LLMs