ICLR 2026 Past Large language modelsDatasets
ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models
ICLR 2026 Workshop DATA-FM
- Submission deadline
- Feb 8, 2026, 23:59 AoE (UTC−12) imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (124)
Fetched from OpenReview (v2) on 2026-06-10.
-
[Short] A Formal Language Benchmark for LLMs
-
[Short] Beyond Data Size: Exploring the Impact of Dataset Diversity and Density in Self-Distillation Learning
-
[Short] Downstream Effects of Translation Scale with Language Difficulty
-
[Short] DSL-Monkeys: Self-Generated In-Context Examples for Low-Resource GPU DSL Kernels
-
[Short] Exploration into gradient-based coreset methods for targeted subset selection
-
[Short] Few-Shot Cross-Table Data Mixture in Tabular In-Context Learning: Benefits, Failure Modes, and Alignment
-
[SHORT] Less is More: On Data Redundancy in VLA Training
-
[Short] Max It or Miss It: Benchmarking LLM On Solving Extremal Problems
-
[Short] Motion Attribution for Video Generation
-
[Short] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
-
[Short] STRIDE: Training data attribution can be estimated in activation space
-
[Short] Studying Memorization Dynamics in Large Language Models Across Pre-Training
-
[Short] Towards Large-Scale Heterogeneous Data Organization for Scientific Foundation Models: A Nuclear Fusion Case Study
-
[Short] Where Does Olmo Get Its Values?
-
[Short]ACTIVE L EARNING FOR S CALABLE DATA S ELECTION IN I NSTRUCTION T UNING
-
A Unified Theory of Random Projection for Influence Functions
-
Actor-curator: A Principled Approach to Online Data Selection for RL Post-training
-
AdaProb: Efficient Machine Unlearning via Adaptive Probability
-
Adaptive Structured Transformation: Mitigating Distribution Shift in Dense Retrieval Through Training-Time Preprocessing
-
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
-
AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
-
AI Scientist Via Synthetic Task Scaling
-
An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
-
Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model
-
Are Easier or Harder Examples Better? Rethinking Data Selection for Reward Models and Preference Optimization
-
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
-
Auditing Preference-Based Post-Training of LLMs via Strong Membership Inference Attacks
-
Benign Overfitting in Adversarial Training for Vision Transformers
-
Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models
-
Bridging the Sim-to-real Gap in RF Localization with Large-Scale Synthetic Pretraining
-
COMBATING DATA LAUNDERING IN LLM TRAINING
-
Configuration-to-Performance Scaling Law with Neural Ansatz
-
Context-Aware Criteria Generation with VLMs for Advertisement Ranking under Data Scarcity
-
Conv-to-Bench: Evaluating Language Models Via User–Assistant Dialogues In Code Tasks
-
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
-
Data Provenance for Image Auto-Regressive Generation
-
Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories
-
DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
-
Do RDB Foundation Models Even Need Data?
-
DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents
-
DUMP: Distribution-Level Curriculum Learning for RL-based LLM Post-training
-
EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
-
EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors
-
Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence
-
ESDAE: Evaluating Synthetic Data for Agent Evaluation
-
Evaluating Frontier Agents on End-to-End Investment Banking Workflows
-
Evaluating Language Models in Realistic Conversational Contexts
-
Federated Agent Reinforcement Learning
-
gen2seg: Generative Models Enable Generalizable Instance Segmentation
-
Geometry-Preserving Coresets for Quantized Foundation Models in Remote Sensing
-
GraphPFN: A Prior-Data Fitted Graph Foundation Model
-
Greedy Information Projection for LLM Data Selection
-
Guess the unified model: Domain and Linguistic Effects in Generated Images
-
GUIrilla: A Scalable Framework for Automated Desktop UI Exploration
-
Hierarchical Agenda Reasoning for Strategic Multi-Turn Dialogue Agents
-
Hubble: a Model Suite to Advance the Study of LLM Memorization
-
ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
-
In-Run Data Shapley for Adam Optimizer
-
Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning
-
Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering
-
Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
-
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
-
jina-vlm: Small Multilingual Vision Language Model
-
Language Self-Play For Data-Free Training
-
Learning from Synthetic Data Improves Multi-hop Reasoning
-
LEGALMIDM: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model
-
Less is More: Adaptive Coverage Sampling for Synthetic Training Data
-
Matched Data, Better Models: Target Aligned Data Filtering with Sparse Autoencoders
-
Measuring Dataset Diversity from a Geometric Perspective
-
Mix Early, Forget Less: Data Mixing During Pretraining Builds Resistance to Forgetting
-
MixAtlas: Uncertainty-aware Data Mixture for Multimodal LLM Midtraining
-
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
-
MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?
-
Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations
-
Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training
-
Multimodal Data Curation Through Ranked Retrieval
-
Non-Local Data Attribution for On-policy Reinforcement Learning
-
OASIS: Online Sample Selection for Continual Instruction Tuning
-
Olmix: A Framework for Data Mixing Throughout LM Development
-
On the Strengths and Weaknesses of Data for Open-Set Embodied Assistance
-
Open LLM Projects Should Allocate More Compute for Data Than Training
-
Optimal Splitting of Language Models from Mixtures to Specialized Domains
-
OPUS: Towards Principled and Scalable Data Selection for Large Language Model Pre-training in Every Iteration
-
OR-LLM-Bench: A Pipeline for Scalable and Verifiable Text-to-Optimization Synthesis
-
Overcoming the Scarcity of Verifiable Reasoning Data with Decision Pivots
-
PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models
-
Positive Mining from LLM Seeds: A Semi-Supervised Graph Based Approach to Train Rare Event Classifiers
-
Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
-
Private Linear Regression via a Down-Sensitivity to Privacy Reduction
-
Privileged Information Distillation for Language Models
-
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
-
Query-based Model Collaboration Enables Expert-level Clinical Text Augmentation
-
RelBench v2: A Large-Scale Benchmark and Relational Data Repository
-
Rescaled Influence Functions: Accurate Data Attribution in High Dimension
-
Resource-Adaptive Federated Text Generation with Differential Privacy
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
-
Rethinking Data Selection: The Importance of Coverage over Difficulty in Generative Fine-Tuning
-
ROSER: Few-Shot Robotic Sequence Retrieval for Scalable Robot Learning
-
RubricRobustness: Evaluating the Sensitivity of Rubrics-Based Benchmarks to Simple Perturbations
-
Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks
-
SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks
-
Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)
-
SynQuE: Estimating Synthetic Dataset Quality Without Annotations
-
Task Scarcity and Label Leakage in Relational Transfer Learning
-
Test-Time Meta-Adaptation with Self-Synthesis
-
The Capability Frontier: Benchmarks Miss 82% of Model Performance
-
The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs
-
The Era of Real-World Human Interaction: RL from User Conversations
-
The Silent Brush: Artistic Style Leakage in AI Art Generation
-
The Viability Boundary of Differential Privacy
-
Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
-
Toward Evaluating Model Collapse in LLMs: Insights from Continual Pretraining
-
Train Smarter, Not Longer: Memorization-Guided Data Reuse for Efficient LLM Training
-
TRIM: TOKEN-BUDGETED DATA MINING FOR INSTRUCTION TUNING
-
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
-
Understanding the Impact of Differentially Private Training on Memorization of Long-Tailed Data
-
Unified Evaluation of Table Embedding Methods Across Multiple Benchmark Scenarios
-
Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets
-
Verifying the Verifiers: Failure Attribution for Benchmark Diagnostics and Training Data Curation
-
Visual Compositional Tuning
-
VULCAN: Where Agents Learn by Living in Simulated Tool Environments
-
When do Score-Based Data Valuation Methods Work, and Why?
-
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
-
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning