ICLR 2024PastAI for scienceDatasets

ICLR 2024 Workshop on Data-centric Machine Learning Research (DMLR): Harnessing Momentum for Science

DMLR @ ICLR 2024

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 9, 2024, 12:00 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (85)

Fetched from OpenReview (v2) on 2026-06-10.

AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent
· PDF
Analyzing Diffusion Models on Synthesizing Training Datasets
· PDF
Annotating Ambiguous Images: General Annotation Strategy for High-Quality Data with Real-World Biomedical Validation
· PDF
Annotation Sensitivity: Drivers of Training Data Quality
· PDF
Atomic Data Groups: An issue in train-test splits for the real world as demonstrated through digital hardware design
· PDF
Autoregressive activity prediction for low-data drug discovery
· PDF
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
· PDF
Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data
· PDF
Bidirectional Long-Range Parser for Sequential Data Understanding
· PDF
Birbal: An efficient 7B instruct-model fine-tuned with curated datasets
· PDF
Building Scalable Video Understanding Benchmarks through Sports
· PDF
Calibrated prediction of scarce adverse drug reaction labels with conditional neural processes
· PDF
CLE-SMOTE: Addressing Extreme Imbalanced Data Classification with Contrastive Learning-Enhanced SMOTE
· PDF
Coactive Learning for Large Language Models using Implicit User Feedback
· PDF
Combining Time Series Modalities to Create Endpoint-driven Patient Records
· PDF
Computational Copyright: Towards A Royalty Model for AI Music Generation Platforms
· PDF
Corrective Machine Unlearning
· PDF
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
· PDF
Data Distribution Valuation
· PDF
Data-Efficient Multi-Modal Contrastive Learning: Prioritizing Data Quality over Quantity
· PDF
Denoising Drug Discovery ADMET Data for Improved Regression Task Performance
· PDF
Deploying Data Selection Techniques on Dynamic Datasets
· PDF
Distributional Dataset Distillation with Subtask Decomposition
· PDF
Empowering Large Language Models for Textual Data Augmentation
· PDF
Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research
· PDF
Enhanced Variational Autoencoder Estimation from Incomplete Data using Mixture Variational Families
· PDF
Environment-adjusted Topic Models
· PDF
Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training
· PDF
Feedback-guided Data Synthesis for Imbalanced Classification
· PDF
Fractals as Pre-training Datasets for Anomaly Detection and Localization
· PDF
From Categories to Classifier: Name-Only Continual Learning by Exploring the Web
· PDF
FTFT: efficient and robust Fine-Tuning by transFerring Training Dynamics
· PDF
Genetic Learning for Designing Sim-to-Real Data Augmentations
· PDF
GitChameleon: Breaking the version barrier for code generation models
· PDF
Graph Kernel Convolutions for Interpretable Classification
· PDF
GRASP-GCN: Graph-Shape Prioritization for Neural Architecture Search under Distribution Shifts
· PDF
H2O+: An Improved Framework for Hybrid Offline-and-Online RL with Dynamics Gaps
· PDF
Heterogeneous Normal Classes Pose a Challenge for Anomaly Detection
· PDF
Identifying Spurious Correlations Early in Training through the Lens of Simplicity Bias
· PDF
Improving Semantic Segmentation Models through Synthetic Data Generation via Diffusion Models
· PDF
Information Compensation: A Fix for Any-scale Dataset Distillation
· PDF
Interpretable Graph Neural Networks for Tabular Data
· PDF
Is a picture of a bird a bird? A mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models
· PDF
Is margin all you need? An extensive empirical study of deep active learning on tabular data
· PDF
Language Models as Science Tutors
· PDF
Learning Galaxy Intrinsic Alignment Correlations
· PDF
Learning representations of learning representations
· PDF
Learning to Rank for One-Round Active Learning
· PDF
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress
· PDF
LLM-Guided Counterfactual Data Generation for Fairer AI
· PDF
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
· PDF
Measuring Diversity in Datasets
· PDF
Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism
· PDF
Multi-model evaluation with labeled and unlabeled data
· PDF
On the Scalability of GNNs for Molecular Graphs
· PDF
One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support
· PDF
OODRobustBench: a benchmark and large-scale analysis of adversarial robustness under distribution shift
· PDF
Open Domain Generalization with a Single Network by Regularization Exploiting Pre-trained Features
· PDF
PointSAGE : Mesh-independent superresolution approach to fluid flow predictions
· PDF
PRE: Vision-Language Prompt Learning with Reparameterization Encoder
· PDF
Pretraining Probabilistic Models for Scalable Precision Agriculture
· PDF
Private Data Measurements for Decentralized Data Markets
· PDF
Pushing the Decision Boundaries: Discovering New Classes in Audio Data
· PDF
QualEval: Qualitative Evaluation for Model Improvement
· PDF
Quantifying the Importance of Data Alignment in Downstream Model Performance
· PDF
QuRating: Selecting High-Quality Data for Training Language Models
· PDF
Re-evaluating Retrosynthesis Algorithms with Syntheseus
· PDF
Retail-786k: a Large-Scale Dataset for Visual Entity Matching
· PDF
Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design
· PDF
Style-Content Disentanglement Under Conditional Shift
· PDF
The Science of Data Filtering: Data Curation cannot be Compute Agnostic
· PDF
TOTEM: Tokenized Time Series Embeddings for General Time Series Analysis
· PDF
Towards Algorithmic Fairness by means of Instance-level Data Re-weighting based on Shapley Values
· PDF
Towards Efficient Active Learning in NLP via Pretrained Representations
· PDF
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
· PDF
Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning
· PDF
Towards Robust Data Pruning
· PDF
Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift
· PDF
Unveiling the Intertwined Relationship Between Essential Sparsity and Robustness in Large Pre-trained Models
· PDF
Urban Sound Propagation: a Benchmark for 1-Step Generative Modeling of Complex Physical Systems
· PDF
Verified Training for Counterfactual Explanation Robustness under Data Shift
· PDF
VTruST: Controllable value function based subset selection for Data-Centric Trustworthy AI
· PDF
When is Off-Policy Evaluation Useful? A Data-Centric Perspective
· PDF
WINDSET: Weather Insights and Novel Data for Systematic Evaluation and Testing
· PDF
You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling
· PDF

Accepted papers (85)

☆AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

☆Analyzing Diffusion Models on Synthesizing Training Datasets

☆Annotating Ambiguous Images: General Annotation Strategy for High-Quality Data with Real-World Biomedical Validation

☆Annotation Sensitivity: Drivers of Training Data Quality

☆Atomic Data Groups: An issue in train-test splits for the real world as demonstrated through digital hardware design

☆Autoregressive activity prediction for low-data drug discovery

☆Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

☆Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

☆Bidirectional Long-Range Parser for Sequential Data Understanding

☆Birbal: An efficient 7B instruct-model fine-tuned with curated datasets

☆Building Scalable Video Understanding Benchmarks through Sports

☆Calibrated prediction of scarce adverse drug reaction labels with conditional neural processes

☆CLE-SMOTE: Addressing Extreme Imbalanced Data Classification with Contrastive Learning-Enhanced SMOTE

☆Coactive Learning for Large Language Models using Implicit User Feedback

☆Combining Time Series Modalities to Create Endpoint-driven Patient Records

☆Computational Copyright: Towards A Royalty Model for AI Music Generation Platforms

☆Corrective Machine Unlearning

☆CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

☆Data Distribution Valuation

☆Data-Efficient Multi-Modal Contrastive Learning: Prioritizing Data Quality over Quantity

☆Denoising Drug Discovery ADMET Data for Improved Regression Task Performance

☆Deploying Data Selection Techniques on Dynamic Datasets

☆Distributional Dataset Distillation with Subtask Decomposition

☆Empowering Large Language Models for Textual Data Augmentation

☆Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

☆Enhanced Variational Autoencoder Estimation from Incomplete Data using Mixture Variational Families

☆Environment-adjusted Topic Models

☆Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training

☆Feedback-guided Data Synthesis for Imbalanced Classification

☆Fractals as Pre-training Datasets for Anomaly Detection and Localization

☆From Categories to Classifier: Name-Only Continual Learning by Exploring the Web

☆FTFT: efficient and robust Fine-Tuning by transFerring Training Dynamics

☆Genetic Learning for Designing Sim-to-Real Data Augmentations

☆GitChameleon: Breaking the version barrier for code generation models

☆Graph Kernel Convolutions for Interpretable Classification

☆GRASP-GCN: Graph-Shape Prioritization for Neural Architecture Search under Distribution Shifts

☆H2O+: An Improved Framework for Hybrid Offline-and-Online RL with Dynamics Gaps

☆Heterogeneous Normal Classes Pose a Challenge for Anomaly Detection

☆Identifying Spurious Correlations Early in Training through the Lens of Simplicity Bias

☆Improving Semantic Segmentation Models through Synthetic Data Generation via Diffusion Models

☆Information Compensation: A Fix for Any-scale Dataset Distillation

☆Interpretable Graph Neural Networks for Tabular Data

☆Is a picture of a bird a bird? A mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models

☆Is margin all you need? An extensive empirical study of deep active learning on tabular data

☆Language Models as Science Tutors

☆Learning Galaxy Intrinsic Alignment Correlations

☆Learning representations of learning representations

☆Learning to Rank for One-Round Active Learning

☆Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

☆LLM-Guided Counterfactual Data Generation for Fairer AI

☆Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

☆Measuring Diversity in Datasets

☆Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism

☆Multi-model evaluation with labeled and unlabeled data

☆On the Scalability of GNNs for Molecular Graphs

☆One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support

☆OODRobustBench: a benchmark and large-scale analysis of adversarial robustness under distribution shift

☆Open Domain Generalization with a Single Network by Regularization Exploiting Pre-trained Features

☆PointSAGE : Mesh-independent superresolution approach to fluid flow predictions

☆PRE: Vision-Language Prompt Learning with Reparameterization Encoder

☆Pretraining Probabilistic Models for Scalable Precision Agriculture

☆Private Data Measurements for Decentralized Data Markets

☆Pushing the Decision Boundaries: Discovering New Classes in Audio Data

☆QualEval: Qualitative Evaluation for Model Improvement

☆Quantifying the Importance of Data Alignment in Downstream Model Performance

☆QuRating: Selecting High-Quality Data for Training Language Models

☆Re-evaluating Retrosynthesis Algorithms with Syntheseus

☆Retail-786k: a Large-Scale Dataset for Visual Entity Matching

☆Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

☆Style-Content Disentanglement Under Conditional Shift

☆The Science of Data Filtering: Data Curation cannot be Compute Agnostic

☆TOTEM: Tokenized Time Series Embeddings for General Time Series Analysis

☆Towards Algorithmic Fairness by means of Instance-level Data Re-weighting based on Shapley Values

☆Towards Efficient Active Learning in NLP via Pretrained Representations

☆Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

☆Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning

☆Towards Robust Data Pruning

☆Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift

☆Unveiling the Intertwined Relationship Between Essential Sparsity and Robustness in Large Pre-trained Models

AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

Analyzing Diffusion Models on Synthesizing Training Datasets

Annotating Ambiguous Images: General Annotation Strategy for High-Quality Data with Real-World Biomedical Validation

Annotation Sensitivity: Drivers of Training Data Quality

Atomic Data Groups: An issue in train-test splits for the real world as demonstrated through digital hardware design

Autoregressive activity prediction for low-data drug discovery

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

Bidirectional Long-Range Parser for Sequential Data Understanding

Birbal: An efficient 7B instruct-model fine-tuned with curated datasets

Building Scalable Video Understanding Benchmarks through Sports

Calibrated prediction of scarce adverse drug reaction labels with conditional neural processes

CLE-SMOTE: Addressing Extreme Imbalanced Data Classification with Contrastive Learning-Enhanced SMOTE

Coactive Learning for Large Language Models using Implicit User Feedback

Combining Time Series Modalities to Create Endpoint-driven Patient Records

Computational Copyright: Towards A Royalty Model for AI Music Generation Platforms

Corrective Machine Unlearning

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Data Distribution Valuation

Data-Efficient Multi-Modal Contrastive Learning: Prioritizing Data Quality over Quantity

Denoising Drug Discovery ADMET Data for Improved Regression Task Performance

Deploying Data Selection Techniques on Dynamic Datasets

Distributional Dataset Distillation with Subtask Decomposition

Empowering Large Language Models for Textual Data Augmentation

Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

Enhanced Variational Autoencoder Estimation from Incomplete Data using Mixture Variational Families

Environment-adjusted Topic Models

Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training

Feedback-guided Data Synthesis for Imbalanced Classification

Fractals as Pre-training Datasets for Anomaly Detection and Localization

From Categories to Classifier: Name-Only Continual Learning by Exploring the Web

FTFT: efficient and robust Fine-Tuning by transFerring Training Dynamics

Genetic Learning for Designing Sim-to-Real Data Augmentations

GitChameleon: Breaking the version barrier for code generation models

Graph Kernel Convolutions for Interpretable Classification

GRASP-GCN: Graph-Shape Prioritization for Neural Architecture Search under Distribution Shifts

H2O+: An Improved Framework for Hybrid Offline-and-Online RL with Dynamics Gaps

Heterogeneous Normal Classes Pose a Challenge for Anomaly Detection

Identifying Spurious Correlations Early in Training through the Lens of Simplicity Bias

Improving Semantic Segmentation Models through Synthetic Data Generation via Diffusion Models

Information Compensation: A Fix for Any-scale Dataset Distillation

Interpretable Graph Neural Networks for Tabular Data

Is a picture of a bird a bird? A mixed-methods approach to understanding diverse human perspectives and ambiguity in machine vision models

Is margin all you need? An extensive empirical study of deep active learning on tabular data

Language Models as Science Tutors

Learning Galaxy Intrinsic Alignment Correlations

Learning representations of learning representations

Learning to Rank for One-Round Active Learning

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

LLM-Guided Counterfactual Data Generation for Fairer AI

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

Measuring Diversity in Datasets

Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism

Multi-model evaluation with labeled and unlabeled data

On the Scalability of GNNs for Molecular Graphs

One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support

OODRobustBench: a benchmark and large-scale analysis of adversarial robustness under distribution shift

Open Domain Generalization with a Single Network by Regularization Exploiting Pre-trained Features

PointSAGE : Mesh-independent superresolution approach to fluid flow predictions

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Pretraining Probabilistic Models for Scalable Precision Agriculture

Private Data Measurements for Decentralized Data Markets

Pushing the Decision Boundaries: Discovering New Classes in Audio Data

QualEval: Qualitative Evaluation for Model Improvement

Quantifying the Importance of Data Alignment in Downstream Model Performance

QuRating: Selecting High-Quality Data for Training Language Models

Re-evaluating Retrosynthesis Algorithms with Syntheseus

Retail-786k: a Large-Scale Dataset for Visual Entity Matching

Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

Style-Content Disentanglement Under Conditional Shift

The Science of Data Filtering: Data Curation cannot be Compute Agnostic

TOTEM: Tokenized Time Series Embeddings for General Time Series Analysis

Towards Algorithmic Fairness by means of Instance-level Data Re-weighting based on Shapley Values

Towards Efficient Active Learning in NLP via Pretrained Representations

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Towards Quantifying the Effect of Datasets for Benchmarking: A Look at Tabular Machine Learning

Towards Robust Data Pruning

Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift

Unveiling the Intertwined Relationship Between Essential Sparsity and Robustness in Large Pre-trained Models

Urban Sound Propagation: a Benchmark for 1-Step Generative Modeling of Complex Physical Systems