ICML 2025 Past Other

Tokenization Workshop

TokShop

Submission deadline
Jun 1, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (37)

Fetched from OpenReview (v2) on 2026-06-10.

  1. Adversarial Tokenization

    Renato Geh, Zilei Shao, Guy Van den Broeck · PDF
  2. BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

    Sander Land, Catherine Arnett · PDF
  3. Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

    Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith · PDF
  4. Byte Latent Transformer: Patches Scale Better Than Tokens

    Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini Iyer · PDF
  5. Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

    Preston Firestone, Shubham Ugare, Gagandeep Singh, Sasa Misailovic · PDF
  6. ByteSpan: Information-Driven Subword Tokenisation

    Zebulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery · PDF
  7. Canonical Autoregressive Generation

    Ivi Chatzi, Nina L. Corvelo Benz, Stratis Tsirtsis, Manuel Gomez Rodriguez · PDF
  8. CAT: Content-Adaptive Image Tokenization

    Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou · PDF
  9. Causal Estimation of Tokenisation Bias

    Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel · PDF
  10. Conditional Unigram Tokenization with Parallel Data

    Gianluca Vico, Jindřich Libovický · PDF
  11. Contextual morphologically-guided tokenization for pretrained Latin BERT models

    Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor · PDF
  12. Continuous Autoregressive Generation with Mixture of Gaussians

    Alex Quach, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Alexander Amini · PDF
  13. Continuous Chain of Thought Enables Parallel Exploration and Reasoning

    Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak · PDF
  14. Discrete JEPA: Learning Discrete Token Representations without Reconstruction

    Junyeob Baek, Hosung Lee, Christopher Hoang, Mengye Ren, Sungjin Ahn · PDF
  15. Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

    Yifan Hu, Ningyue Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W Schmidt, Chris Tanner · PDF
  16. Evaluating Morphological Alignment of Tokenizers in 70 Languages

    Catherine Arnett, Marisa Hudspeth, Brendan O'Connor · PDF
  17. FLEXITOKENS: Flexible Tokenization for Evolving Language Models

    Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar · PDF
  18. GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling

    Jaskaran Singh, Prabhav Sanga, ARUN KUMAR DUBEY · PDF
  19. HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

    Rongkun Xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang · PDF
  20. How Much is Enough? The Diminishing Returns of Tokenization Training Data

    Varshini Reddy, Craig W Schmidt, Yuval Pinter, Chris Tanner · PDF
  21. How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

    Disen Liao, Freda Shi · PDF
  22. InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability

    Kirill Semenov, Martin Popel · PDF
  23. Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

    Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez Rodriguez · PDF
  24. MorphTok: Morphologically Grounded Tokenization for Indic languages

    Maharaj Brahma, N J Karthika, Atul Kumar Singh, Devaraja Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar · PDF
  25. Motion-Focused Tokenization for Source-Free Video Domain Adaptation

    Tzu Ling Liu, Ian Stavness, Mrigank Rochan · PDF
  26. One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

    Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, Yu Yamaguchi · PDF
  27. Overcoming Vocabulary Constraints with Pixel-level Fallback

    Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva · PDF
  28. Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

    Marco Cognetta, David Pohl, Junyoung Lee, Naoaki Okazaki · PDF
  29. QuickMerge++: Token Merging with Autoregressive Prior

    Dong Liu, Yanxuan Yu · PDF
  30. Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs

    Greyson Brothers · PDF
  31. Sampling from Your Language Model One Byte at a Time

    Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh · PDF
  32. SuperBPE: Space Travel for Language Models

    Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi · PDF
  33. Tokenisation is NP-Complete

    Philip Whittington, Gregor Bachmann, Tiago Pimentel · PDF
  34. Tokenizing Nonverbal Communication in Salsa Dance

    Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim · PDF
  35. Watermarking Autoregressive Image Generation

    Nikola Jovanović, Ismail Labiad, Tomas Soucek, Martin Vechev, Pierre Fernandez · PDF
  36. You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models

    Mucong Ding, Sean Michael McLeish, Kazem Meidani, Igor Melnyk, Nam H Nguyen, C. Bayan Bruss, Furong Huang · PDF
  37. zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

    Saibo Geng, Nathan Ranchin, Yunzhen Yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West · PDF