ICLR 2024 Past Large language modelsDatasets

ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models

DPFM 2024

Submission deadline
Feb 13, 2024, 01:30 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (49)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Tale of Tails: Model Collapse as a Change of Scaling Laws

    Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe · PDF
  2. AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

    Tongzhou Mu, Yijie Guo, Jie Xu, Ankit Goyal, Hao Su, Dieter Fox, Animesh Garg · PDF
  3. Augmenting Math Word Problems via Iterative Question Composing

    Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew C Yao · PDF
  4. Autonomous Data Selection with Language Models for Mathematical Texts

    Yifan Zhang, Yifan Luo, Yang Yuan, Andrew C Yao · PDF
  5. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

    Avi Singh, John D Co-Reyes, Rishabh Agarwal · PDF
  6. CollabEdit: Towards Non-destructive Collaborative Knowledge Editing

    Jiamu Zheng, Jinghuai Zhang, Futing Wang, Tianyu Du, Tao Lin · PDF
  7. Computational Copyright: Towards A Royalty Model for AI Music Generation Platforms

    Junwei Deng, Jiaqi Ma · PDF
  8. Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates

    Avanika Narayan, Mayee F Chen, Kush Bhatia, Christopher Re · PDF
  9. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida Wang · PDF
  10. Data Alignment for Zero-Shot Concept Generation in Dermatology AI

    Soham Gadgil, Mahtab Bigverdi · PDF
  11. DELE: Data Efficient LLM Evaluation

    Gayathri Saranathan, Mahammad Parwez Alam, James Lim, Suparna Bhattacharya, Soon Yee Wong, Martin Foltin, Cong Xu · PDF
  12. Distributional Dataset Distillation with Subtask Decomposition

    Tian Qin, Zhiwei Deng, David Alvarez-Melis · PDF
  13. Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models

    Minhao Jiang, Ken Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo · PDF
  14. Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

    Florian E. Dorner, Moritz Hardt · PDF
  15. Efficient Global Data Attribution for Diffusion Models

    · PDF
  16. Enhancing Data Quality in Federated Fine-Tuning of Large Language Models

    Wanru Zhao, Yaxin Du, Nicholas Donald Lane, Siheng Chen, Yanfeng Wang · PDF
  17. Evaluating Large Language Models in an Emerging Domain: A Pilot Study in Decentralized Finance

    Joshua Carter Pearlson, Xiaoyuan Liu, Chengsong Huang, Kripa Ann George, Dawn Song, Chenguang Wang · PDF
  18. Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis

    Lukas Struppek, Dominik Hintersdorf, Felix Friedrich, Manuel Brack, Patrick Schramowski, Kristian Kersting · PDF
  19. Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

    Zhenting Qi, Hanlin Zhang, Eric P. Xing, Sham M. Kakade, Himabindu Lakkaraju · PDF
  20. Hallucination Augmented Recitations for Language Models

    Abdullatif Köksal, Renat Aksitov, Chung-Ching Chang · PDF
  21. How to Craft Backdoors with Unlabeled Data Alone?

    Yifei Wang, Wenhan Ma, Stefanie Jegelka, Yisen Wang · PDF
  22. Improving Practical Counterfactual Fairness with Limited Causal Knowledge

    Zeyu Zhou, Ruqi Bai, David I. Inouye · PDF
  23. Incentivizing Inclusive Data Contributions in Personalized Federated Learning

    Enpei Zhang, Jingyi Chai, Rui Ye, Yanfeng Wang, Siheng Chen · PDF
  24. Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases

    Elad Levi, Eli Brosh, Matan Friedmann · PDF
  25. Label-free Neural Semantic Image Synthesis

    Jiayi Wang, Kevin Alexander Laube, Yumeng Li, Jan Hendrik Metzen, Shin-I Cheng, Julio Borges, Anna Khoreva · PDF
  26. LESS: Selecting Influential Data for Targeted Instruction Tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen · PDF
  27. LongForm: Effective Instruction Tuning with Reverse Instructions

    Abdullatif Köksal, Timo Schick, Anna Korhonen, Hinrich Schuetze · PDF
  28. Model & Data Insights using Pre-trained Language Models

    Saeid Asgari, Aliasghar Khani, Amir Hosein Khasahmadi, Aditya Sanghi, Karl D.D. Willis, Ali Mahdavi Amiri · PDF
  29. Multimodal Dataset Upgrading: a New Challenge for Data Annotation

    Haiwen Huang, Dan Zhang, Andreas Geiger · PDF
  30. ON THE SCALABILITY OF GNNS FOR MOLECULAR GRAPHS

    Maciej Sypetkowski, Frederik Wenkel, Farimah Poursafaei, Nia Dickson, Karush Suri, Philip Fradkin, Dominique Beaini · PDF
  31. OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning

    Rui Ye, WenHao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, Siheng Chen · PDF
  32. Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

    Hritik Bansal, John Dang, Aditya Grover · PDF
  33. Perplexed by Perplexity: Perplexity-Based Pruning with Small Reference Models

    · PDF
  34. Pre-training Concept Frequency is predictive of CLIP Zero-shot Performance

    Vishaal Udandarao, Ameya Prabhu, Philip Torr, Adel Bibi, Samuel Albanie, Matthias Bethge · PDF
  35. Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

    Yuchen Li, Alexandre Kirchmeyer, Aashay Mehta, Yilong Qin, Boris Dadachev, Kishore A Papineni, Sanjiv Kumar, Andrej Risteski · PDF
  36. Prompt Optimization with Logged Bandit Data

    Haruka Kiyohara, Yuta Saito, Daniel Yiming Cao, Thorsten Joachims · PDF
  37. QuRating: Selecting High-Quality Data for Training Lanugage Models

    Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen · PDF
  38. Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly · PDF
  39. Scaling Laws for Downstream Task Performance of Large Language Models

    Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo · PDF
  40. Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

    Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, Furong Huang · PDF
  41. Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

    Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Nigel Collier, Yutong Lu · PDF
  42. The Science of Data Filtering: Data Curation cannot be Compute Agnostic

    Sachin Goyal, Pratyush Maini, Zachary Chase Lipton, Aditi Raghunathan, J Zico Kolter · PDF
  43. TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, J Zico Kolter · PDF
  44. Toward Data-driven Skill Identification for General-purpose Vision-language Models

    Anthony Tiong, Junqi Zhao, Junnan Li, Steven Hoi, Caiming Xiong, Boyang Li · PDF
  45. Towards Unbiased Evaluation of Detecting Unanswerable Questions in EHRSQL

    Yongjin Yang, Sihyeon Kim, SangMook Kim, Gyubok Lee, Se-Young Yun, Edward Choi · PDF
  46. VideoCon: Robust Video-Language Alignment via Contrast Captions

    Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover · PDF
  47. Virtual Classifier: A Reversed Approach for Robust Image Evaluation

    Jizhe Zhang, Yifei Wang, Yisen Wang · PDF
  48. West-of-N: Synthetic Preference Generation for Improved Reward Modeling

    Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn · PDF
  49. What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety

    Luxi He, Mengzhou Xia, Peter Henderson · PDF