ICML 2025PastNLP

Tokenization Workshop

TokShop

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: Jun 1, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (37)

Fetched from OpenReview (v2) on 2026-06-10.

Adversarial Tokenization
Renato Geh, Zilei Shao, Guy Van den Broeck · PDF
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Sander Land, Catherine Arnett · PDF
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith · PDF
Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini Iyer · PDF
Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Preston Firestone, Shubham Ugare, Gagandeep Singh, Sasa Misailovic · PDF
ByteSpan: Information-Driven Subword Tokenisation
Zebulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery · PDF
Canonical Autoregressive Generation
Ivi Chatzi, Nina L. Corvelo Benz, Stratis Tsirtsis, Manuel Gomez Rodriguez · PDF
CAT: Content-Adaptive Image Tokenization
Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou · PDF
Causal Estimation of Tokenisation Bias
Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel · PDF
Conditional Unigram Tokenization with Parallel Data
Gianluca Vico, Jindřich Libovický · PDF
Contextual morphologically-guided tokenization for pretrained Latin BERT models
Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor · PDF
Continuous Autoregressive Generation with Mixture of Gaussians
Alex Quach, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Alexander Amini · PDF
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak · PDF
Discrete JEPA: Learning Discrete Token Representations without Reconstruction
Junyeob Baek, Hosung Lee, Christopher Hoang, Mengye Ren, Sungjin Ahn · PDF
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
Yifan Hu, Ningyue Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W Schmidt, Chris Tanner · PDF
Evaluating Morphological Alignment of Tokenizers in 70 Languages
Catherine Arnett, Marisa Hudspeth, Brendan O'Connor · PDF
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar · PDF
GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling
Jaskaran Singh, Prabhav Sanga, ARUN KUMAR DUBEY · PDF
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling
Rongkun Xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang · PDF
How Much is Enough? The Diminishing Returns of Tokenization Training Data
Varshini Reddy, Craig W Schmidt, Yuval Pinter, Chris Tanner · PDF
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
Disen Liao, Freda Shi · PDF
InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability
Kirill Semenov, Martin Popel · PDF
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez Rodriguez · PDF
MorphTok: Morphologically Grounded Tokenization for Indic languages
Maharaj Brahma, N J Karthika, Atul Kumar Singh, Devaraja Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar · PDF
Motion-Focused Tokenization for Source-Free Video Domain Adaptation
Tzu Ling Liu, Ian Stavness, Mrigank Rochan · PDF
One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression
Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, Yu Yamaguchi · PDF
Overcoming Vocabulary Constraints with Pixel-level Fallback
Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva · PDF
Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation
Marco Cognetta, David Pohl, Junyoung Lee, Naoaki Okazaki · PDF
QuickMerge++: Token Merging with Autoregressive Prior
Dong Liu, Yanxuan Yu · PDF
Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs
Greyson Brothers · PDF
Sampling from Your Language Model One Byte at a Time
Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh · PDF
SuperBPE: Space Travel for Language Models
Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi · PDF
Tokenisation is NP-Complete
Philip Whittington, Gregor Bachmann, Tiago Pimentel · PDF
Tokenizing Nonverbal Communication in Salsa Dance
Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim · PDF
Watermarking Autoregressive Image Generation
Nikola Jovanović, Ismail Labiad, Tomas Soucek, Martin Vechev, Pierre Fernandez · PDF
You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models
Mucong Ding, Sean Michael McLeish, Kazem Meidani, Igor Melnyk, Nam H Nguyen, C. Bayan Bruss, Furong Huang · PDF
zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression
Saibo Geng, Nathan Ranchin, Yunzhen Yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West · PDF

Accepted papers (37)

☆Adversarial Tokenization

☆BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

☆Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

☆Byte Latent Transformer: Patches Scale Better Than Tokens

☆Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

☆ByteSpan: Information-Driven Subword Tokenisation

☆Canonical Autoregressive Generation

☆CAT: Content-Adaptive Image Tokenization

☆Causal Estimation of Tokenisation Bias

☆Conditional Unigram Tokenization with Parallel Data

☆Contextual morphologically-guided tokenization for pretrained Latin BERT models

☆Continuous Autoregressive Generation with Mixture of Gaussians

☆Continuous Chain of Thought Enables Parallel Exploration and Reasoning

☆Discrete JEPA: Learning Discrete Token Representations without Reconstruction

☆Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

☆Evaluating Morphological Alignment of Tokenizers in 70 Languages

☆FLEXITOKENS: Flexible Tokenization for Evolving Language Models

☆GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling

☆HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

☆How Much is Enough? The Diminishing Returns of Tokenization Training Data

☆How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

☆InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability

☆Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

☆MorphTok: Morphologically Grounded Tokenization for Indic languages

☆Motion-Focused Tokenization for Source-Free Video Domain Adaptation

☆One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

☆Overcoming Vocabulary Constraints with Pixel-level Fallback

☆Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

☆QuickMerge++: Token Merging with Autoregressive Prior

☆Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs

☆Sampling from Your Language Model One Byte at a Time

☆SuperBPE: Space Travel for Language Models

☆Tokenisation is NP-Complete

☆Tokenizing Nonverbal Communication in Salsa Dance

☆Watermarking Autoregressive Image Generation

☆You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models

☆zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

Adversarial Tokenization

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Byte Latent Transformer: Patches Scale Better Than Tokens

Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

ByteSpan: Information-Driven Subword Tokenisation

Canonical Autoregressive Generation

CAT: Content-Adaptive Image Tokenization

Causal Estimation of Tokenisation Bias

Conditional Unigram Tokenization with Parallel Data

Contextual morphologically-guided tokenization for pretrained Latin BERT models

Continuous Autoregressive Generation with Mixture of Gaussians

Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Discrete JEPA: Learning Discrete Token Representations without Reconstruction

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Evaluating Morphological Alignment of Tokenizers in 70 Languages

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling

HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

How Much is Enough? The Diminishing Returns of Tokenization Training Data

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

MorphTok: Morphologically Grounded Tokenization for Indic languages

Motion-Focused Tokenization for Source-Free Video Domain Adaptation

One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

Overcoming Vocabulary Constraints with Pixel-level Fallback

Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

QuickMerge++: Token Merging with Autoregressive Prior

Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs

Sampling from Your Language Model One Byte at a Time

SuperBPE: Space Travel for Language Models

Tokenisation is NP-Complete

Tokenizing Nonverbal Communication in Salsa Dance

Watermarking Autoregressive Image Generation

You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models

zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression