ICLR 2025 Past Large language modelsDatasets

ICLR 2025 Workshop on Navigating and Addressing Data Problems for Foundation Models

ICLR 2025 Workshop Data Problems

Submission deadline
Feb 8, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (85)

Fetched from OpenReview (v2) on 2026-06-10.

  1. $f$-SCRUB: Unbounded Machine Unlearning Via $f$-divergences

    Amirhossein Bagheri, Radmehr Karimian, Gholamali Aminian · PDF
  2. A Missing Testbed for LLM Pre-Training Membership Inference Attacks

    Mingjian Jiang, Ken Ziyu Liu, Sanmi Koyejo · PDF
  3. A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

    Junwei Deng, Weijing Tang, Jiaqi W. Ma · PDF
  4. Abg-SciQA: A dataset for Understanding and Resolving Ambiguity in Scientific Questions

    Tiejin Chen, Kuan-Ru Liou, Mithun Shivakoti, Aaryan Gaur, Pragya Kumari, Meiqi Guo, Hua Wei · PDF
  5. ADSO: Adaptive Data Mixture & Scale Optimization. A Multi-Scale Multi-Fidelity Bayesian Optimization Approach.

    Andrew Wei Tung Siah, Haozhe Chen, C. Daniel Guetta, Tianyi Peng, Hongseok Namkoong, Tzu-Ching Yen · PDF
  6. Adversarial Attacks on Data Attribution

    Xinhe Wang, Pingbang Hu, Junwei Deng, Jiaqi W. Ma · PDF
  7. Aioli: A Unified Optimization Framework for Language Model Data Mixing

    Mayee F Chen, Michael Y. Hu, Nicholas Lourie, Kyunghyun Cho, Christopher Re · PDF
  8. Approximations to worst-case data dropping: unmasking failure modes

    Jenny Y. Huang, David R. Burt, Yunyi Shen, Tin D. Nguyen, Tamara Broderick · PDF
  9. Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

    Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein · PDF
  10. BenchAgents: Automated Benchmark Creation with Agent Interaction

    Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran · PDF
  11. Beyond ordinary Lipschitz constraints: Differentially Private optimization with TNC

    Difei Xu, Meng Ding, Zihang Xiang, Jinhui Xu, Di Wang · PDF
  12. Blind Baselines Beat Membership Inference Attacks for Foundation Models

    Debeshee Das, Jie Zhang, Florian Tramèr · PDF
  13. Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

    Shichang Zhang, Tessa Han, Usha Bhalla, Himabindu Lakkaraju · PDF
  14. Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning

    Wanyun Xie, Francesco Tonin, Volkan Cevher · PDF
  15. Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations

    Manuel Quintero, William T. Stephenson, Advik Shreekumar, Tamara Broderick · PDF
  16. Context-Guided Responsible Data Augmentation with Diffusion Models

    Khawar Islam, NAVEED AKHTAR · PDF
  17. Context-Parametric Inversion: Why Instruction Finetuning Can Worsen Context Reliance

    Sachin Goyal, Christina Baek, J Zico Kolter, Aditi Raghunathan · PDF
  18. Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

    Tianyuan Zou, Yang Liu, Peng Li, Yufei Xiong, Jianqing Zhang, Jingjing Liu, Ye Ouyang, Xiaozhou Ye, Yaqin Zhang · PDF
  19. D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff

    Ulyana Piterbarg, Kanishk Gandhi, Lerrel Pinto, Noah Goodman, Rob Fergus · PDF
  20. Data Efficient Pre-training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence

    Andreas Paraskeva, Max Johannes van Duijn, Maarten de Rijke, Suzan Verberne, Jan N. van Rijn · PDF
  21. Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

    Xinran Gu, Kaifeng Lyu, Jiazheng Li, Jingzhao Zhang · PDF
  22. Data-Efficient Supervised Fine-Tuning of Language Models Using Optimal Design

    Rohan Deb, Kiran Koshy Thekumparampil, Kousha Kalantari, Gaurush Hiranandani, Shoham Sabach, Branislav Kveton · PDF
  23. Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

    Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, Jin Song Dong · PDF
  24. Demystifying Long Chain-of-Thought Reasoning in LLMs

    Edward Yeo, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue · PDF
  25. Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model

    Zinan Lin, Tadas Baltrusaitis, Sergey Yekhanin · PDF
  26. Diversity Measurement and Subset Selection for Instruction Tuning Datasets

    Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim, Polina Golland, Rameswar Panda · PDF
  27. Domain-Specific Benchmarking of Vision-Language Models: A Task Augmentation Framework Using Metadata

    Tim Rädsch, Leon Mayer, Simon Pavicic, Ali Emre Kavur, Marcel Knopp, Barış Öztürk, Klaus Maier-Hein, Paul F Jaeger, Fabian Isensee, Annika Reinke, Lena Maier-hein · PDF
  28. DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

    Zhiliang Chen, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Bryan Kian Hsiang Low · PDF
  29. Editable Concept Bottleneck Models

    Lijie Hu, Chenyang Ren, Zhengyu Hu, Hongbin Lin, Cheng-Long Wang, Zhen Tan, Weimin Lyu, Jingfeng Zhang, Hui Xiong, Di Wang · PDF
  30. Enhancing Interpretability in Generative AI Through Search-Based Data Influence Analysis

    Theodoros Aivalis, Iraklis A. Klampanos, Antonis Troumpoukis, Joemon M. Jose · PDF
  31. Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

    Bettina Messmer, Vinko Sabolčec, Martin Jaggi · PDF
  32. Explaining Length Bias in LLM-Based Preference Evaluations

    Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Zhengyu Chen, Hui Xiong · PDF
  33. From Fairness to Truthfulness: Rethinking Data Valuation Design

    Dongyang Fan, Tyler J. Rotello, Sai Praneeth Karimireddy · PDF
  34. Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

    Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora · PDF
  35. How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning

    Yao Tong, Jiayuan Ye, Sajjad Zarifzadeh, Reza Shokri · PDF
  36. Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

    Qirun Dai, Dylan Zhang, Jiaqi W. Ma, Hao Peng · PDF
  37. Improving Multimodal Large Language Models in Low-Resource Language Contexts

    Yufei Gao, Feijiaying, Guohang Yan, Yunshi Lan · PDF
  38. Information-theoretic Quantification of Inherent Discrimination Bias in Training Data for Supervised Learning

    Sokrat Aldarmini, Mohamed S Nafea · PDF
  39. Investigating Memorization in Video Diffusion Models

    Chen Chen, Enhuai Liu, Daochang Liu, Mubarak Shah, Chang Xu · PDF
  40. KGGen: Text To Knowledge Graph

    Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Chris Cundy, Charilaos Kanatsoulis, Sanmi Koyejo · PDF
  41. Language Model Preference Evaluation with Multiple Weak Evaluators

    Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Hui Xiong, Ranjay Krishna · PDF
  42. Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

    Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun · PDF
  43. LoBAM: LoRA-Based Backdoor Attack on Model Merging

    Ming Yin, Jingyang Zhang, Jingwei Sun, Minghong Fang, Hai Helen Li, Yiran Chen · PDF
  44. MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

    Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma · PDF
  45. MMA: Benchmarking Multi-Modal Large Language Model in Ambiguity Contexts

    Ru Wang, Selena Song, Liang Ding, Mingming Gong, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo · PDF
  46. Model Collapse in the Self-Consuming Chain of Diffusion Finetuning: A Novel Perspective from Quantitative Trait Modeling

    Youngseok Yoon, Dainong Hu, Iain Weissburg, Yao Qin, Haewon Jeong · PDF
  47. Nepotistically Trained Generative Image Models Collapse

    Maty Bohacek, Hany Farid · PDF
  48. NICE: Non-Differentiable Evaluation Metric-Based Data Selection for Instruction Tuning

    Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Pang Wei Koh, Chuan-Sheng Foo, Bryan Kian Hsiang Low · PDF
  49. On the Power of Context-Enhanced Learning in LLMs

    Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora · PDF
  50. OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning

    Jiawei Zhou, Lei Chen · PDF
  51. PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

    Albert Gong, Kamilė Stankevičiūtė, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P Gomes, Kilian Q Weinberger · PDF
  52. PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts

    Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni · PDF
  53. Position: What's the next frontier for Data-centric AI? Data Savvy Agents!

    Nabeel Seedat, Jiashuo Liu, Mihaela van der Schaar · PDF
  54. Preserving Product Fidelity in Large Scale Image Recontextualization with Diffusion Models

    Ishaan Malhi, Praneet Dutta, Ellie Talius, Sally Ma, Brendan Driscoll, Krista Holden, Garima Pruthi, Arunachalam Narayanaswamy · PDF
  55. Privacy Attacks on Image AutoRegressive Models

    Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic · PDF
  56. Privacy Auditing for Large Language Models with Natural Identifiers

    Lorenzo Rossi, Bartłomiej Marek, Franziska Boenisch, Adam Dziedzic · PDF
  57. Proper Dataset Valuation by Pointwise Mutual Information

    SHURAN ZHENG, Xuan Qi, Rui Ray Chen, Yongchan Kwon, James Zou · PDF
  58. Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

    Yilun Kong, Hangyu Mao, Qi Zhao, Bin Zhang, Jingqing Ruan, Li Shen, Yongzhe Chang, Xueqian Wang, Rui Zhao, Dacheng Tao · PDF
  59. RepFair-QGAN: Alleviating Representation Bias in Quantum Generative Adversarial Networks Using Gradient Clipping

    Kamil Sabbagh, Hadi Salloum, Yaroslav Kholodov · PDF
  60. Revisiting Multi-Modal LLM Evaluation

    Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Singh Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan · PDF
  61. Revisiting Semi-supervised Adversarial Training via Noise-aware Online Robust Distillation

    Tsung-Han Wu, Hung-Ting Su, Shang-Tse Chen, Winston H. Hsu · PDF
  62. Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

    Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang · PDF
  63. RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation

    Yuefan Cao, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song · PDF
  64. Robust In-Context Learning via Multi-Armed Bandit-Based Partition Selection

    Varul Srivastava, Sankarshan Damle, Manisha Padala · PDF
  65. Rule-Based Rating and Selection of LLM Training Data

    Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu · PDF
  66. STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings

    Saksham Rastogi, Pratyush Maini, Danish Pruthi · PDF
  67. SubLIME*: Data Efficient Foundation Model Evaluation across Modalities, Languages and Benchmarks

    Mahammad Parwez Alam, Gayathri Saranathan, Cong Xu, Javier Aula-Blasco, Martin Foltin, Tarun Kumar, Soon Yee Wong, Suparna Bhattacharya · PDF
  68. Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative Models

    Sze Jue Yang, Chinh Duc La, Quang H Nguyen, Eugene Bagdasarian, Kok-Seng Wong, Anh Tuan Tran, Chee Seng Chan, Khoa D Doan · PDF
  69. Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

    Bowen Tan, Zheng Xu, Eric P. Xing, Zhiting Hu, Shanshan Wu · PDF
  70. Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training

    Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Ao Luo, Li Yao, Cunjian Chen, Julian McAuley, Hanqian Wu · PDF
  71. The Delta Learning Hypothesis: Preference Tuning on Weak Data Can Yield Strong Gains

    Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, Pang Wei Koh · PDF
  72. The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

    Yifan Sun, Han Wang, Dongbai Li, Gang Wang, Huan Zhang · PDF
  73. The surprising amount of arbitrariness in Shapley-value data valuation

    Hannah Diehl, Ashia C. Wilson · PDF
  74. TOWARD EFFICIENT INFLUENCE FUNCTION: DROPOUT AS A COMPRESSION TOOL

    Yuchen Zhang, Mohammad Mohammadi Amiri · PDF
  75. Towards Comprehensive Preference Data Collection for Reward Modeling

    Yulan Hu, Qingyang Li, Sheng Ouyang, Ge Chen, Jinman Zhao, Yong Liu · PDF
  76. Towards Human-Guided, Data-Centric LLM Co-Pilots

    Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar · PDF
  77. Towards Internet-Scale Training For Agents

    Brandon Trabucco, Gunnar A Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov · PDF
  78. Tracing the Misuse of Personalized Textual Embeddings for Text-to-Image Models

    Weitao Feng, Jiyan He, Jie Zhang, Tianyi Wei, Wenbo Zhou, Qing Guo, Weiming Zhang, Tianwei Zhang, Nenghai Yu · PDF
  79. Training and Evaluating Language Models with Template-based Data Generation

    Yifan Zhang · PDF
  80. TsKAN: A Transparent Architecture for Improving the Interpretability of Multivariate Time Series Forecasting

    Zechuan Chen, TianMing Sha, Ziyi Tang, Keze Wang · PDF
  81. Understanding Private Learning From Feature Perspective

    Meng Ding, Mingxi Lei, Shaopeng Fu, Di Wang, Jinhui Xu · PDF
  82. Unlocking Post-hoc Dataset Inference with Synthetic Data

    Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic · PDF
  83. Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models

    Vinith Menon Suriyakumar, Rohan Alur, Ayush Sekhari, Manish Raghavan, Ashia C. Wilson · PDF
  84. Utilizing Language Models For Synthetic Knowledge Graph Generation

    Shuran Fu, Peihua Mai, Zhang Jingqi, Yan Pang · PDF
  85. Why Does Private Fine-Tuning Resist Differential Privacy Noise? A Representation Learning Perspective

    Yue Zhao, Xia Yutong, Chendi Wang · PDF