ICLR 2024 Past Large language modelsSafety & alignmentPrivacy & security
ICLR 2024 Workshop on Secure and Trustworthy Large Language Models
SeT LLM @ ICLR 2024
- Submission deadline
- Feb 20, 2024, 23:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (72)
Fetched from OpenReview (v2) on 2026-06-10.
-
A closer look at adversarial suffix learning for Jailbreaking LLMs
-
An Assessment of Model-on-Model Deception
-
Are Large Language Models Bayesian? A Martingale Perspective on In-Context Learning
-
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
-
Assessing Prompt Injection Risks in 200+ Custom GPTs
-
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
-
Attacking LLM Watermarks by Exploiting Their Strengths
-
Attacks on Third-Party APIs of Large Language Models
-
Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task
-
Bayesian reward models for LLM alignment
-
BEYOND FINE-TUNING: LORA MODULES BOOST NEAR- OOD DETECTION AND LLM SECURITY
-
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
-
Calibrating Language Models With Adaptive Temperature Scaling
-
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
-
Character-level robustness should be revisited
-
Coercing LLMs to do and reveal (almost) anything
-
CollabEdit: Towards Non-destructive Collaborative Knowledge Editing
-
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
-
Differentially Private Synthetic Data via Foundation Model APIs 2: Text
-
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization
-
Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models
-
Explorations of Self-Repair in Language Model
-
Exploring the Adversarial Capabilities of Large Language Models
-
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
-
Group Preference Optimization: Few-Shot Alignment of Large Language Models
-
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
-
How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG
-
How Susceptible are Large Language Models to Ideological Manipulation?
-
I'm not familiar with the name Harry Potter: Prompting Baselines for Unlearning in LLMs
-
Initial Response Selection for Prompt Jailbreaking using Model Steering
-
Is Your Jailbreaking Prompt Truly Effective for Large Language Models?
-
Large Language Model Bias Mitigation from the Perspective of Knowledge Editing
-
Leveraging Context in Jailbreaking Attacks
-
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
-
MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
-
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
-
On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models
-
On Prompt-Driven Safeguarding for Large Language Models
-
On Trojan Signatures in Large Language Models of Code
-
Open Sesame! Universal Black-Box Jailbreaking of Large Language Models
-
PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning
-
PETA: PARAMETER-EFFICIENT TROJAN ATTACKS
-
Preventing Memorized Completions through White-Box Filtering
-
Privacy-preserving Fine-tuning of Large Language Models through Flatness
-
Quantitative Certification of Knowledge Comprehension in LLMs
-
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
-
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
-
Retrieval Augmented Prompt Optimization
-
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
-
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
-
Safer-Instruct: Aligning Language Models with Automated Preference Data
-
Self-Alignment of Large Language Models via Social Scene Simulation
-
Self-evaluation and self-prompting to improve the reliability of LLMs
-
Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation
-
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
-
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
-
Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models
-
Single-pass detection of jailbreaking input in large language models
-
Source-Aware Training Enables Knowledge Attribution in Language Models
-
Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework
-
Tailoring Self-Rationalizers with Multi-Reward Distillation
-
The Effect of Model Size on LLM Post-hoc Explainability via LIME
-
TOFU: A Task of Fictitious Unlearning for LLMs
-
Toward Robust Unlearning for LLMs
-
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
-
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness
-
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
-
Watermark Stealing in Large Language Models
-
Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
-
WatME: Towards Lossless Watermarking Through Lexical Redundancy
-
What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety
-
WinoViz: Probing Visual Properties of Objects Under Different States