ICML 2025 Past Other
Tokenization Workshop
TokShop
- Submission deadline
- Jun 1, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (37)
Fetched from OpenReview (v2) on 2026-06-10.
-
Adversarial Tokenization
-
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
-
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
-
Byte Latent Transformer: Patches Scale Better Than Tokens
-
Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
-
ByteSpan: Information-Driven Subword Tokenisation
-
Canonical Autoregressive Generation
-
CAT: Content-Adaptive Image Tokenization
-
Causal Estimation of Tokenisation Bias
-
Conditional Unigram Tokenization with Parallel Data
-
Contextual morphologically-guided tokenization for pretrained Latin BERT models
-
Continuous Autoregressive Generation with Mixture of Gaussians
-
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
-
Discrete JEPA: Learning Discrete Token Representations without Reconstruction
-
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
-
Evaluating Morphological Alignment of Tokenizers in 70 Languages
-
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
-
GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling
-
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling
-
How Much is Enough? The Diminishing Returns of Tokenization Training Data
-
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
-
InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability
-
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
-
MorphTok: Morphologically Grounded Tokenization for Indic languages
-
Motion-Focused Tokenization for Source-Free Video Domain Adaptation
-
One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression
-
Overcoming Vocabulary Constraints with Pixel-level Fallback
-
Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation
-
QuickMerge++: Token Merging with Autoregressive Prior
-
Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs
-
Sampling from Your Language Model One Byte at a Time
-
SuperBPE: Space Travel for Language Models
-
Tokenisation is NP-Complete
-
Tokenizing Nonverbal Communication in Salsa Dance
-
Watermarking Autoregressive Image Generation
-
You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models
-
zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression