ICLR 2026 Past Large language modelsDatasets

ICLR 2026 Workshop on Navigating and Addressing Data Problems for Foundation Models

ICLR 2026 Workshop DATA-FM

Submission deadline
Feb 8, 2026, 23:59 AoE (UTC−12)
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (124)

Fetched from OpenReview (v2) on 2026-06-10.

  1. [Short] A Formal Language Benchmark for LLMs

    Bishwamittra Ghosh, Krishna P. Gummadi, Evimaria Terzi · PDF
  2. [Short] Beyond Data Size: Exploring the Impact of Dataset Diversity and Density in Self-Distillation Learning

    Alvard Barseghyan, Ani Vanyan, Hakob Tamazyan, Hrant Khachatrian · PDF
  3. [Short] Downstream Effects of Translation Scale with Language Difficulty

    Aditya V. Kulkarni, Dharmam Savani, Ammar Ahmed Pallikonda Latheef, Pritam Mukherjee, Jacob M Luber, Paul Yi · PDF
  4. [Short] DSL-Monkeys: Self-Generated In-Context Examples for Low-Resource GPU DSL Kernels

    Nathan Paek, Simon Guo, Vishnu Sarukkai, Willy Chan, William Hu, Ethan Boneh, Simran Arora, Ludwig Schmidt, Kayvon Fatahalian, Azalia Mirhoseini · PDF
  5. [Short] Exploration into gradient-based coreset methods for targeted subset selection

    Evelyn Zhu, Neha Hulkund, Sara Beery · PDF
  6. [Short] Few-Shot Cross-Table Data Mixture in Tabular In-Context Learning: Benefits, Failure Modes, and Alignment

    Jia-Wei Liao, Kuan-Yu Chen, Yu-Chen Den, Tien-Hao Chang · PDF
  7. [SHORT] Less is More: On Data Redundancy in VLA Training

    Kevin Yang, Tony Yang · PDF
  8. [Short] Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

    Binxin Gao, Jingjun Han · PDF
  9. [Short] Motion Attribution for Video Generation

    Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine · PDF
  10. [Short] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics

    Zhengyang Qi, Charles Andrew Dickens, Derek Pham, Amanda Dsouza, Armin Parchami, Frederic Sala, Paroma Varma · PDF
  11. [Short] STRIDE: Training data attribution can be estimated in activation space

    Abir Harrasse, Rishit Dagli, Amir Abdullah, Zhijing Jin · PDF
  12. [Short] Studying Memorization Dynamics in Large Language Models Across Pre-Training

    Kaustubh Ponkshe, Raghav Singhal, Daniele Affinita, Martin Jaggi · PDF
  13. [Short] Towards Large-Scale Heterogeneous Data Organization for Scientific Foundation Models: A Nuclear Fusion Case Study

    Nathaniel Chen, Kouroche Bouchiat, Peter Steiner, Azarakhsh Jalalvand, SangKyeun Kim, Egemen Kolemen · PDF
  14. [Short] Where Does Olmo Get Its Values?

    Xiaoqing Sun, Arthur Conmy, Joshua Engels · PDF
  15. [Short]ACTIVE L EARNING FOR S CALABLE DATA S ELECTION IN I NSTRUCTION T UNING

    Lalchand Pandia · PDF
  16. A Unified Theory of Random Projection for Influence Functions

    Pingbang Hu, Yuzheng Hu, Jiaqi W. Ma, Han Zhao · PDF
  17. Actor-curator: A Principled Approach to Online Data Selection for RL Post-training

    Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue · PDF
  18. AdaProb: Efficient Machine Unlearning via Adaptive Probability

    Zihao Zhao, Yuchen Yang, Anjalie Field, Yinzhi Cao · PDF
  19. Adaptive Structured Transformation: Mitigating Distribution Shift in Dense Retrieval Through Training-Time Preprocessing

    Xinyan Velocity Yu, Harsh Jhamtani, Soham Dan, Benjamin Van Durme, Patrick Xia · PDF
  20. Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

    Prasoon Goyal, Sattvik Sahai, Michael Johnston, Hangjie Shi, Yao Lu, Shaohua liu, Anna Rumshisky, Rahul Gupta, Anna Gottardi, Desheng Zhang, Lavina Vaz, Leslie Ball, Lucy Hu, Luke Dai, Samyuth Sagi, Maureen Murray, Sankaranarayanan Ananthakrishnan · PDF
  21. AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

    Xianyang Liu, Yilin LIU, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jun Shu, Jiaheng Wei · PDF
  22. AI Scientist Via Synthetic Task Scaling

    Ziyang Cai, Harkirat Behl · PDF
  23. An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

    Qizhen Zhang, Ankush Garg, Jakob Nicolaus Foerster, Niladri S. Chatterji, Kshitiz Malik, Mike Lewis · PDF
  24. Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model

    Jacqueline He, Jonathan Hayase, Wen-tau Yih, Sewoong Oh, Luke Zettlemoyer, Pang Wei Koh · PDF
  25. Are Easier or Harder Examples Better? Rethinking Data Selection for Reward Models and Preference Optimization

    Kevin Christian Wibisono, Aya Abdelsalam Ismail, Pedro O. Pinheiro, Yixin Wang, Kyunghyun Cho, Natasa Tagasovska, Rajesh Ranganath · PDF
  26. ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

    Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Rayburn Caswell, Alex Pentland, Sercan O Arik, Chen-Yu Lee, Sayna Ebrahimi · PDF
  27. Auditing Preference-Based Post-Training of LLMs via Strong Membership Inference Attacks

    Lorenzo Rossi, Kaif Shaikh, Franziska Boenisch, Adam Dziedzic · PDF
  28. Benign Overfitting in Adversarial Training for Vision Transformers

    Jiaming Zhang, Meng Ding, Shaopeng Fu, Jingfeng Zhang, Di Wang · PDF
  29. Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models

    Reem I. Masoud, Chen Feng, Shunta Asano, Saied Alshahrani, Philip Colin Treleaven, Miguel R. D. Rodrigues · PDF
  30. Bridging the Sim-to-real Gap in RF Localization with Large-Scale Synthetic Pretraining

    Armen Manukyan, Rafayel Mkrtchyan, Ararat Saribekyan, Theofanis Raptis, Hrant Khachatrian · PDF
  31. COMBATING DATA LAUNDERING IN LLM TRAINING

    Muxing Li, Zesheng Ye, Sharon Li, Feng Liu · PDF
  32. Configuration-to-Performance Scaling Law with Neural Ansatz

    Huaqing Zhang, Kaiyue Wen, Tengyu Ma · PDF
  33. Context-Aware Criteria Generation with VLMs for Advertisement Ranking under Data Scarcity

    Kyungho Kim, Yeonje Choi, Gyurim Hwang, Sejin Chung, Hongseok Lee, Myeong Ho Song, Yeongho Kim, Sunwoo Kim, Jongha Lee, Juyeon Kim, Kijung Shin · PDF
  34. Conv-to-Bench: Evaluating Language Models Via User–Assistant Dialogues In Code Tasks

    Victor Moreli dos Santos, André Cerqueira Castro, Samuel Lopes de Souza Toledo, Bruno Moreira Lavalli Calura, Lisandra Cristina de Moura Menezes, Raul César Reis Mata, Telma Woerle de Lima Soares, Bryan Lincoln Marques de Oliveira · PDF
  35. Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

    Jiayuan Ye, Vitaly Feldman, Kunal Talwar · PDF
  36. Data Provenance for Image Auto-Regressive Generation

    Bihe Zhao, Louis Kerner, Michel Meintz, Tameem Bakr, Franziska Boenisch, Adam Dziedzic · PDF
  37. Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories

    Dang Nguyen, Nilay Naharas, Neslihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman · PDF
  38. DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

    Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh · PDF
  39. Do RDB Foundation Models Even Need Data?

    Linjie Xu, Yanlin Zhang, Quan Gan, Minjie Wang, David Wipf · PDF
  40. DSGym: A Standardized and Holistic Framework for Advancing Data Science Agents

    Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou · PDF
  41. DUMP: Distribution-Level Curriculum Learning for RL-based LLM Post-training

    Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, Wentian Zhao · PDF
  42. EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

    Shiva Krishna Reddy Malay, Shravan Nayak, Aman Tiwari, Jishnu Sethumadhavan Nair, Sathwik Tejaswi Madhusudhan, Sagar Davasam, Srinivas Sunkara, Sai Rajeswar · PDF
  43. EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors

    Amin Banayeeanzade, Qingchuan Yang, Deqing Fu, Spencer Hong, Erin Babinsky, Alfy Samuel, Anoop Kumar, Robin Jia, Sai Praneeth Karimireddy · PDF
  44. Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

    Bingji Yi, Qiyuan Liu, Yuwei Cheng, Haifeng Xu · PDF
  45. ESDAE: Evaluating Synthetic Data for Agent Evaluation

    Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti · PDF
  46. Evaluating Frontier Agents on End-to-End Investment Banking Workflows

    Elaine Lau, Rosemary Wei, Guram Gogia, Ronak Chaudhary, Yi Liu, Saed Qunbar, Hui Wen Goh, Scott Millslagle, Samuel Eshun Danquah, Punit Arani, Ray Epps, Markus Dücker, Abdullah Arif, Asrith Devalaraju, Varsha Sandadi, Haemi Nam, Sahil Bhaiwala, Skyler Wang, Anish Athalye, Jonas Mueller, Francisco Guzmán · PDF
  47. Evaluating Language Models in Realistic Conversational Contexts

    Ilija Subasic, Andrew Rabinovich, Zhao Chen · PDF
  48. Federated Agent Reinforcement Learning

    Canyu Chen, Kangyu Zhu, Zhaorun Chen, Zhanhui Zhou, Shizhe Diao, Yiping Lu, Tian Li, Manling Li, Dawn Song · PDF
  49. gen2seg: Generative Models Enable Generalizable Instance Segmentation

    Om Khangaonkar, Hamed Pirsiavash · PDF
  50. Geometry-Preserving Coresets for Quantized Foundation Models in Remote Sensing

    Tushar Shinde · PDF
  51. GraphPFN: A Prior-Data Fitted Graph Foundation Model

    Dmitry Eremeev, Oleg Platonov, Gleb Bazhenov, Artem Babenko, Liudmila Prokhorenkova · PDF
  52. Greedy Information Projection for LLM Data Selection

    Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu, Jian Jiao · PDF
  53. Guess the unified model: Domain and Linguistic Effects in Generated Images

    Jasin Cekinmez, Ryo Mitsuhashi, Yida Yin · PDF
  54. GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

    Sofiya Garkot, Maksym Shamrai, Ivan Synytsia, Mariya Hirna · PDF
  55. Hierarchical Agenda Reasoning for Strategic Multi-Turn Dialogue Agents

    Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Aviral Kumar, Sergey Levine · PDF
  56. Hubble: a Model Suite to Advance the Study of LLM Memorization

    Johnny Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Yixiang Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia · PDF
  57. ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

    Krishna Teja Chitty-Venkata, Murali Emani · PDF
  58. In-Run Data Shapley for Adam Optimizer

    Meng Ding, ZEQING ZHANG, Di Wang, Lijie Hu · PDF
  59. Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning

    Mohammed Sabry Mohammed, Anya Belz · PDF
  60. Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

    Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaël Gharbi, Kayvon Fatahalian · PDF
  61. Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

    J Rosser, Robert Kirk, Edward Grefenstette, Jakob Nicolaus Foerster, Laura Ruis · PDF
  62. Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

    Götz-Henrik Wiegand, Lorena Raichle, Rico Staedeli, Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh · PDF
  63. jina-vlm: Small Multilingual Vision Language Model

    Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao · PDF
  64. Language Self-Play For Data-Free Training

    Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, Chun-cheng Jason Chen · PDF
  65. Learning from Synthetic Data Improves Multi-hop Reasoning

    Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go, Johann Lee, Katie Z Luo, Carla P Gomes, Kilian Q Weinberger · PDF
  66. LEGALMIDM: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model

    Youngjoon Jang, Chanhee Park, Hyeonseok Moon, Young-kyoung Ham, jiwon moon, Jinhyeon Kim, JuKyung Jung, Heuiseok Lim · PDF
  67. Less is More: Adaptive Coverage Sampling for Synthetic Training Data

    Sasan Tavakkol, Max Springer, Mohammadhossein Bateni, Vincent Cohen-Addad, Neslihan Bulut, MohammadTaghi Hajiaghayi · PDF
  68. Matched Data, Better Models: Target Aligned Data Filtering with Sparse Autoencoders

    Arnav Mohanty Das, Gantavya Bhatt, Sahil Verma, Yiping Wang, Viswa Virinchi Muppirala, Jeff Bilmes · PDF
  69. Measuring Dataset Diversity from a Geometric Perspective

    Yang Ba, Mohammad Sadeq Abolhasani, Michelle V Mancenido, Rong Pan · PDF
  70. Mix Early, Forget Less: Data Mixing During Pretraining Builds Resistance to Forgetting

    Lawrence Feng, Gaurav Rohit Ghosal, Jacob Mitchell Springer, Ziqian Zhong, Aditi Raghunathan · PDF
  71. MixAtlas: Uncertainty-aware Data Mixture for Multimodal LLM Midtraining

    Bingbing Wen, Sirajul Salekin, Feiyang Kang, Bill Howe, Lucy Lu Wang, Javier Movellan, Manjot Bilkhu · PDF
  72. MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

    Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Vu Minh Chien, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son Vu, Jenia Jitsev · PDF
  73. MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

    Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A.R. Abu-Bakar, Zhaode Wang, chengfei lv, Haoji Hu, Huan Wang · PDF
  74. Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations

    Firas Darwish, George Nicholson, Aiden Doherty, Hang Yuan · PDF
  75. Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

    Wanyun Xie, Francesco Tonin, Volkan Cevher · PDF
  76. Multimodal Data Curation Through Ranked Retrieval

    Pratyush Muthukumar, Harshil Kotamreddy, Sarah Amiraslani, Tomo Kanazawa, Ramani Akkati, Shaan Jain, Andrew Mathau · PDF
  77. Non-Local Data Attribution for On-policy Reinforcement Learning

    Shixuan Liu, Yuzheng Hu, Han Zhao, Jiaqi W. Ma · PDF
  78. OASIS: Online Sample Selection for Continual Instruction Tuning

    Minjae Lee, Minhyuk Seo, Tingyu Qu, Tinne Tuytelaars, Jonghyun Choi · PDF
  79. Olmix: A Framework for Data Mixing Throughout LM Development

    Mayee F Chen, Tyler Murray, David Heineman, Matt Jordan, Hannaneh Hajishirzi, Christopher Re, Luca Soldaini, Kyle Lo · PDF
  80. On the Strengths and Weaknesses of Data for Open-Set Embodied Assistance

    Pradyumna Tambwekar, Andrew Silva, Deepak Edakkattil Gopinath, Jonathan DeCastro, Xiongyi Cui, Guy Rosman · PDF
  81. Open LLM Projects Should Allocate More Compute for Data Than Training

    Maximilian Idahl · PDF
  82. Optimal Splitting of Language Models from Mixtures to Specialized Domains

    Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Béthune, Angelos Katharopoulos, David Grangier · PDF
  83. OPUS: Towards Principled and Scalable Data Selection for Large Language Model Pre-training in Every Iteration

    Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang · PDF
  84. OR-LLM-Bench: A Pipeline for Scalable and Verifiable Text-to-Optimization Synthesis

    Zhiqi Gao, Albert Ge, Alexander Michael Berenbeim, Nathaniel D. Bastian, Frederic Sala · PDF
  85. Overcoming the Scarcity of Verifiable Reasoning Data with Decision Pivots

    Dongkyu Cho, Amy B.Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Hengrui Cai, Rui Song · PDF
  86. PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models

    Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi, Johannes Hoffart, Carlos Guestrin, Jure Leskovec · PDF
  87. Positive Mining from LLM Seeds: A Semi-Supervised Graph Based Approach to Train Rare Event Classifiers

    Sasan Tavakkol, Lin Chen, Max Springer, Abigail Schantz, Blaž Bratanič, Vincent Cohen-Addad, Mohammadhossein Bateni · PDF
  88. Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

    Anmol Goel, Cornelius Emde, Sangdoo Yun, Seong Joon Oh, Martin Gubri · PDF
  89. Private Linear Regression via a Down-Sensitivity to Privacy Reduction

    Ittai Rubinstein, Chris Ge, Samuel B. Hopkins · PDF
  90. Privileged Information Distillation for Language Models

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia · PDF
  91. propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

    Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries · PDF
  92. Query-based Model Collaboration Enables Expert-level Clinical Text Augmentation

    Dongkyu Cho, Miao Zhang, Gregory D Lyng, Rumi Chunara · PDF
  93. RelBench v2: A Large-Scale Benchmark and Relational Data Repository

    Justin Gu, Rishabh Ranjan, Charilaos I. Kanatsoulis, Haiming Tang, Martin Jurkovič, Valter Hudovernik, Mark Znidar, Pranshu Chaturvedi, Parth Shroff, Fengyu Li, Jure Leskovec · PDF
  94. Rescaled Influence Functions: Accurate Data Attribution in High Dimension

    Ittai Rubinstein, Samuel B. Hopkins · PDF
  95. Resource-Adaptive Federated Text Generation with Differential Privacy

    Jiayi Wang, John Gounley, Heidi Hanson · PDF
  96. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    Wanru Zhao, Yihong Chen, Yuzhi Tang, Wentao Ma, Shengchao Hu, Shell Xu Hu, Alex Iacob, Abhinav Mehrotra, Nicholas D. Lane · PDF
  97. Rethinking Data Selection: The Importance of Coverage over Difficulty in Generative Fine-Tuning

    Lalchand Pandia, Kanishka Misra, Allyson Ettinger · PDF
  98. ROSER: Few-Shot Robotic Sequence Retrieval for Scalable Robot Learning

    Zillur Rahman, Eddison Pham, Alejandro Daniel Noel, Cristian Meo · PDF
  99. RubricRobustness: Evaluating the Sensitivity of Rubrics-Based Benchmarks to Simple Perturbations

    Manasi Sharma, Brad Kenstler, Bing Liu · PDF
  100. Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

    Abhranil Chandra, Ayush Agrawal, Arian Hosseini, Sebastian Fischmeister, Rishabh Agarwal, Navin Goyal, Aaron Courville · PDF
  101. SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

    Srivatsa R Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, John Ling · PDF
  102. Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

    Nicholas Merchant, Haitz Sáez de Ocáriz Borde, Andrei Cristian Popescu, Carlos Garcia Jurado Suarez · PDF
  103. SynQuE: Estimating Synthetic Dataset Quality Without Annotations

    Arthur Chen, Victor Zhong · PDF
  104. Task Scarcity and Label Leakage in Relational Transfer Learning

    Francisco Galuppo Azevedo, Clarissa Lima Loures, Denis Oliveira Correa · PDF
  105. Test-Time Meta-Adaptation with Self-Synthesis

    Zeyneb N. Kaya, Nick Rui · PDF
  106. The Capability Frontier: Benchmarks Miss 82% of Model Performance

    Bradley Fowler, Ryan Smith, Daniel Thi Graviet, William Myers, Joshua Greaves, Narmeen Fatimah Oozeer, Antía García, Philip Quirke, Fazl Barez, Amir Abdullah, Shriyash Kaustubh Upadhyay · PDF
  107. The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

    Zhiliang Chen, Alfred Wei Lun Leong, Shao Yong Ong, Apivich Hemachandra, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low · PDF
  108. The Era of Real-World Human Interaction: RL from User Conversations

    Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason E Weston · PDF
  109. The Silent Brush: Artistic Style Leakage in AI Art Generation

    Ninad Joshi, Ashutosh Ranjan, Vivek Srivastava, Shirish Karande · PDF
  110. The Viability Boundary of Differential Privacy

    Arinbjörn Kolbeinsson, Benedikt Kolbeinsson · PDF
  111. Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

    Yassine Turki, Vinko Sabolčec, Bettina Messmer, Martin Jaggi · PDF
  112. Toward Evaluating Model Collapse in LLMs: Insights from Continual Pretraining

    Kristian Minchev, Anton Alexandrov, Martin Vechev, Nikola Konstantinov · PDF
  113. Train Smarter, Not Longer: Memorization-Guided Data Reuse for Efficient LLM Training

    Jingwei Zuo, Ilyas Chahed, Maksim Velikanov, Cong Zeng, Dhia Eddine Rhaiem, Pasquale Balsebre, Abhay Kumar, Younes Belkada, Hakim Hacid · PDF
  114. TRIM: TOKEN-BUDGETED DATA MINING FOR INSTRUCTION TUNING

    Md Muntaqim Meherab, SALMAN, Naimur Rahman, Md. Maruf Billah, Tanvirul Islam, Dr. Fernaz Narin Nur, Md. Hasanuzzaman Dipu · PDF
  115. TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

    Rakshith Sharma Srinivasa, Zora Che, Chen Bo Calvin Zhang, Diego A. Mares Buendia, Ernesto Gabriel Hernández Montoya, Jayeon Park, Dean Lee, Guillermo A. Mangialardi, Charmaine Ng, Ed-Yeremai Hernandez-Cardona, Anisha Gunjal, Yunzhong He, Bing Liu, Chen Xing · PDF
  116. Understanding the Impact of Differentially Private Training on Memorization of Long-Tailed Data

    Jiaming Zhang, Huanyi Xie, Meng Ding, Shaopeng Fu, Jinyan Liu, Di Wang · PDF
  117. Unified Evaluation of Table Embedding Methods Across Multiple Benchmark Scenarios

    Ali Younes, Saeed Ghoorchian, Maximilian Schambach, Johannes Höhne · PDF
  118. Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets

    Iris Dominguez-Catena, Mikel Galar, Daniel Paternain · PDF
  119. Verifying the Verifiers: Failure Attribution for Benchmark Diagnostics and Training Data Curation

    Jesse Hu, Pratyush Shukla, Ke Huang, Meji Abidoye · PDF
  120. Visual Compositional Tuning

    Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Esin Tureci, Olga Russakovsky · PDF
  121. VULCAN: Where Agents Learn by Living in Simulated Tool Environments

    Amir Saeidi, Chitta Baral, Ahmed Hassan Awadallah, Harkirat Behl · PDF
  122. When do Score-Based Data Valuation Methods Work, and Why?

    Kumar Kshitij Patel, Sai Praneeth Karimireddy, Raul Castro Fernandez, Manolis Zampetakis · PDF
  123. Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

    Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, Shiwei Liu · PDF
  124. Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

    Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Xingzhang Ren, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang · PDF