ICML 2024PastLarge language modelsEfficiencyML systems

Workshop on Efficient Systems for Foundation Models II @ ICML2024

ES-FoMo-II 2024

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: Jun 4, 2024, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (80)

Fetched from OpenReview (v2) on 2026-06-10.

AdaInf: Adaptive Inference for Resource-Constrained Foundation Models
Zhuoyan Xu, Khoi Duc Nguyen, Preeti Mukherjee, Somali Chaterji, Yingyu Liang, Yin Li · PDF
Adam-mini: Use Fewer Learning Rates To Gain More
Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun · PDF
AdaNF: Quantization Group Adaptive NormalFloat for Low Bit Fine-tuning of LLMs
Yeojoon Youn, Sehoon Kim, Suhong Moon, Sang Keun Choe, Ce Zhang · PDF
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Nicolaus Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Üstün, Acyr Locatelli · PDF
Block Verification Accelerates Speculative Decoding
Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Ahmad Beirami, Jae Hun Ro, Ananda Theertha Suresh · PDF
Can Transformers Solve Least Squares to High Precision?
Jerry Weihong Liu, Jessica Grogan, Owen M Dugan, Simran Arora, Atri Rudra, Christopher Re · PDF
Characterizing Prompt Compression Methods for Long Context Inference
Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami · PDF
CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules
Neelay Velingker, Jason Liu, Amish Sethi, William Dodds, Zhiqiu Xu, Saikat Dutta, Mayur Naik, Eric Wong · PDF
CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model
Meguru Yamazaki, Shivaram Venkataraman · PDF
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
Rickard Brüel Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon · PDF
DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation
Ahmad Mohammadshirazi, Ali Nosratifiroozsalari, Mengxi Zhou, Dheeraj Kulshrestha, Rajiv Ramnath · PDF
Does your data spark joy? Performance gains from domain upsampling at the end of training
Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, Jonathan Frankle · PDF
Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference
Oshin Dutta, Ritvik Gupta, Sumeet Agarwal · PDF
Efficient multi-prompt evaluation of LLMs
Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin · PDF
Efficient Training of Language Models with Compact and Consistent Next Token Distributions
Ashutosh Sathe, Sunita Sarawagi · PDF
Enhancing Stability for Large Models Training in Constrained Bandwidth Networks
Yun Dai, Tejas Dharamsi, Pin-Lun Hsu, Tao Song, Hamed Firooz · PDF
Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion
Filip Szatkowski, Bartosz Wójcik, Mikołaj Piórczyński, Simone Scardapane · PDF
Exploring and Improving Drafts in Blockwise Parallel Decoding
Taehyeon Kim, Ananda Theertha Suresh, Kishore A Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton · PDF
Exploring Monotonicity in Early-Exiting Language Models
Filipe Laitenberger, Max Belitsky, Denys Sheremet · PDF
ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement
Eashan Adhikarla, Kai Zhang, John Nicholson, Brian D. Davison · PDF
Exponential Quantum Communication Advantage in Distributed Inference and Learning
Hagay Michaeli, Dar Gilboa, Daniel Soudry, Jarrod Ryan McClean · PDF
Fast Adaptation and Robust Quantization of Multi-Modal Foundation Models from Associative Memory: A Case Study in SpeechLM
Shang Wu, Yen-Ju Lu, Haozheng Luo, Jerry Yao-Chieh Hu, Jiayi Wang, Najim Dehak, Jesus Villalba, Han Liu · PDF
Fast and Memory-Efficient Multi-Sequence Generation via Structured Masking
Daniel Mingyi Israel, Siyan Zhao, Guy Van den Broeck, Aditya Grover · PDF
Fast yet Safe: Early-Exiting with Risk Control
Metod Jazbec, Alexander Timans, Tin Hadži Veljković, Kaspar Sakmann, Dan Zhang, Christian A. Naesseth, Eric Nalisnick · PDF
Fewer Truncations Improve Language Modeling
Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto · PDF
GPTVQ: The Blessing of Dimensionality for LLM Quantization
Mart Van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Artem Bolshakov, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough · PDF
GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
Aashiq Muhamed, Oscar Li, David Woodruff, Mona T. Diab, Virginia Smith · PDF
Hardware-Efficient Quantization for Green Custom Foundation Models
Toshiaki Koike-Akino, Chang Meng, Volkan Cevher, Giovanni De Micheli · PDF
HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
Darren Yan Key, Andy He, Mason Bulling, Andrew Chang, Skyler Shapiro, Everett Lee · PDF
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Jordan Juravsky, Bradley Brown, Ryan Saul Ehrlich, Daniel Y Fu, Christopher Re, Azalia Mirhoseini · PDF
Implicit Optimization Bias of Next-token Prediction in Linear Models
Christos Thrampoulidis · PDF
In Defense of Structural Sparse Adapters for Concurrent LLM Serving
Junda Su, Zirui Liu, Zeju Qiu, Weiyang Liu, Zhaozhuo Xu · PDF
Janus: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences
Krithik Ramesh, Sameed Muneeb Siddiqui, Michael Mitzenmacher, Pardis Sabeti · PDF
Just read twice: closing the recall gap for recurrent language models
Simran Arora, Aman Timalsina, Aaryan Singhal, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, Christopher Re · PDF
LAuReL: Learned Augmented Residual Layer
Gaurav Menghani, Ravi Kumar, Sanjiv Kumar · PDF
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi · PDF
Learned Best-Effort LLM Serving
Siddharth Jha, Coleman Richard Charles Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer · PDF
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy Weissman, Prateek Mittal · PDF
Low Rank Quantization-Aware Training for LLMs
Yelysei Bondarenko, Riccardo Del Chiaro, Markus Nagel · PDF
Low-rank Linearization of Large Language Models
Michael Zhang, Aaryan Singhal, Benjamin Frederick Spector, Simran Arora, Christopher Re · PDF
Mamba-PTQ: Outlier Channels in Recurrent Large Language Models
Alessandro Pierro, Steven Abreu · PDF
MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang, YUCHENG LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu · PDF
Mobile and Edge Evaluation of Large Language Models
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi · PDF
MoRe Fine-Tuning with 10x Fewer Parameters
Wenxuan Tan, Nicholas Roberts, Tzu-Heng Huang, Jitian Zhao, John Cooper, Samuel Guo, Chengyu Duan, Frederic Sala · PDF
NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming
guray ozen · PDF
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
Sami Jaghouar, Johannes Hagemann · PDF
OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Seyed Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari · PDF
Optimised Grouped-Query Attention Mechanism for Transformers
Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George Anthony Constantinides, Yiren Zhao · PDF
Optimistic Verifiable Training by Controlling Hardware Nondeterminism
Megha Srivastava, Simran Arora, Dan Boneh · PDF
OutEffHop: A Principled Outlier-Efficient Attention Layer from Dense Associative Memory Models
Haozheng Luo, Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu · PDF
Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs
Davide Paglieri, Saurabh Dash, Tim Rocktäschel, Jack Parker-Holder · PDF
Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones
Mehrnaz Mofakhami, Reza Bayat, Ioannis Mitliagkas, Joao Monteiro, Valentina Zantedeschi · PDF
PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications
Kshitij Bhardwaj · PDF
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
Siyan Zhao, Daniel Mingyi Israel, Guy Van den Broeck, Aditya Grover · PDF
Pretrained Hybrids with MAD Skills
Nicholas Roberts, Samuel Guo, Zhiqi Gao, Satya Sai Srinath Namburi GNVV, Sonia Cromp, Chengjun Wu, Chengyu Duan, Frederic Sala · PDF
Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones
Andrey Zhmoginov, Jihwan Lee, Mark Sandler · PDF
Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation
Harry Dong, Beidi Chen, Yuejie Chi · PDF
Quantum-PEFT: Ultra parameter-efficient fine-tuning
Toshiaki Koike-Akino, Francesco Tonin, Yongtao Wu, Leyla Naz Candogan, Volkan Cevher · PDF
Revealing the Utilized Rank of Subspaces of Learning in Neural Networks
Isha Garg, Christian Koguchi, Eshan Verma, Daniel Ulbricht · PDF
Revisiting Cascaded Ensembles for Efficient Inference
Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith · PDF
Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA
Shuangyi Chen, Yue Ju, Hardik Dalal, Zhongwen Zhu, Ashish J Khisti · PDF
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben allal, Leandro Von Werra, Martin Jaggi · PDF
Scavenging Hyena: Distilling Transformers into Long Convolution Models
Tokiniaina Raharison Ralambomihanta, Shahrad Mohammadzadeh, Sami Nur Islam, Wassim Jabbour, Laurence Liang · PDF
Seeded LoRA: Collaborative Fine-Tuning Through Seed Initialization of Adapters
Alejandro R. Salamanca, Ahmet Üstün, Nicki Skafte Detlefsen, Tim Dettmers · PDF
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Re · PDF
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang, Xudong Guo, Mengdi Wang · PDF
SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors
Vijay Lingam, Atula Tejaswi Neerkaje, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, sujay sanghavi · PDF
Task Addition and Weight Disentanglement in Closed-Vocabulary Models
Adam Hazimeh, Alessandro Favero, Pascal Frossard · PDF
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, Tri Dao · PDF
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki M Asano, Babak Ehteshami Bejnordi · PDF
TinyAgent: Quantization-aware Model Compression and Adaptation for On-device LLM Agent Deployment
Jason Kong, Lanxiang Hu, Flavio Ponzina, Tajana Rosing · PDF
Towards Efficient Large-Scale Language-3D Representation Learning
Shentong Mo, Xiaogang Xu, Tongzhou Wang, Antonio Torralba, Shuang Li · PDF
Towards smaller language models via layer looping
Sabri Eyuboglu, Dylan Zinsley, Jon Saad-Falcon, Simran Arora, Atri Rudra, James Zou, Christopher Re · PDF
Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing
Boris Radovič, Mohammed Aljahdali, Marco Canini, Veljko Pejović, Zuhair Khayyat · PDF
Training-Free Acceleration of ViTs with Delayed Spatial Merging
Jung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi, Massoud Pedram · PDF
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas Hofmann · PDF
Unlocking the Global Synergies in Low-Rank Adapters
Zixi Zhang, Cheng Zhang, Xitong Gao, Robert D. Mullins, George Anthony Constantinides, Yiren Zhao · PDF
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo · PDF
xLSTM: Extended Long Short-Term Memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter · PDF
Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity
Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, Zhaozhuo Xu · PDF

Accepted papers (80)

☆AdaInf: Adaptive Inference for Resource-Constrained Foundation Models

☆Adam-mini: Use Fewer Learning Rates To Gain More

☆AdaNF: Quantization Group Adaptive NormalFloat for Low Bit Fine-tuning of LLMs

☆BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

☆Block Verification Accelerates Speculative Decoding

☆Can Transformers Solve Least Squares to High Precision?

☆Characterizing Prompt Compression Methods for Long Context Inference

☆CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules

☆CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model

☆Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

☆DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation

☆Does your data spark joy? Performance gains from domain upsampling at the end of training

☆Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference

☆Efficient multi-prompt evaluation of LLMs

☆Efficient Training of Language Models with Compact and Consistent Next Token Distributions

☆Enhancing Stability for Large Models Training in Constrained Bandwidth Networks

☆Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

☆Exploring and Improving Drafts in Blockwise Parallel Decoding

☆Exploring Monotonicity in Early-Exiting Language Models

☆ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement

☆Exponential Quantum Communication Advantage in Distributed Inference and Learning

☆Fast Adaptation and Robust Quantization of Multi-Modal Foundation Models from Associative Memory: A Case Study in SpeechLM

☆Fast and Memory-Efficient Multi-Sequence Generation via Structured Masking

☆Fast yet Safe: Early-Exiting with Risk Control

☆Fewer Truncations Improve Language Modeling

☆GPTVQ: The Blessing of Dimensionality for LLM Quantization

☆GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

☆Hardware-Efficient Quantization for Green Custom Foundation Models

☆HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

☆Hydragen: High-Throughput LLM Inference with Shared Prefixes

☆Implicit Optimization Bias of Next-token Prediction in Linear Models

☆In Defense of Structural Sparse Adapters for Concurrent LLM Serving

☆Janus: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences

☆Just read twice: closing the recall gap for recurrent language models

☆LAuReL: Learned Augmented Residual Layer

☆LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

☆Learned Best-Effort LLM Serving

☆Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

☆Low Rank Quantization-Aware Training for LLMs

☆Low-rank Linearization of Large Language Models

☆Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

☆MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

☆Mobile and Edge Evaluation of Large Language Models

☆MoRe Fine-Tuning with 10x Fewer Parameters

☆NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming

☆OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

☆OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

☆Optimised Grouped-Query Attention Mechanism for Transformers

☆Optimistic Verifiable Training by Controlling Hardware Nondeterminism

☆OutEffHop: A Principled Outlier-Efficient Attention Layer from Dense Associative Memory Models

☆Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs

☆Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

☆PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

☆Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

☆Pretrained Hybrids with MAD Skills

☆Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

☆Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

☆Quantum-PEFT: Ultra parameter-efficient fine-tuning

☆Revealing the Utilized Rank of Subspaces of Learning in Neural Networks

☆Revisiting Cascaded Ensembles for Efficient Inference

☆Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA

☆Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

☆Scavenging Hyena: Distilling Transformers into Long Convolution Models

☆Seeded LoRA: Collaborative Fine-Tuning Through Seed Initialization of Adapters

☆Simple linear attention language models balance the recall-throughput tradeoff

☆SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

☆SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors

☆Task Addition and Weight Disentanglement in Closed-Vocabulary Models

☆The Mamba in the Llama: Distilling and Accelerating Hybrid Models

☆Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

☆TinyAgent: Quantization-aware Model Compression and Adaptation for On-device LLM Agent Deployment

☆Towards Efficient Large-Scale Language-3D Representation Learning

☆Towards smaller language models via layer looping

☆Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing

☆Training-Free Acceleration of ViTs with Delayed Spatial Merging

☆Understanding and Minimising Outlier Features in Neural Network Training

☆Unlocking the Global Synergies in Low-Rank Adapters

☆Why Transformers Need Adam: A Hessian Perspective

☆xLSTM: Extended Long Short-Term Memory

AdaInf: Adaptive Inference for Resource-Constrained Foundation Models

Adam-mini: Use Fewer Learning Rates To Gain More

AdaNF: Quantization Group Adaptive NormalFloat for Low Bit Fine-tuning of LLMs

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Block Verification Accelerates Speculative Decoding

Can Transformers Solve Least Squares to High Precision?

Characterizing Prompt Compression Methods for Long Context Inference

CLAM: Unifying Finetuning, Quantization, and Pruning by Chaining LLM Adapter Modules

CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation

Does your data spark joy? Performance gains from domain upsampling at the end of training

Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference

Efficient multi-prompt evaluation of LLMs

Efficient Training of Language Models with Compact and Consistent Next Token Distributions

Enhancing Stability for Large Models Training in Constrained Bandwidth Networks

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Exploring and Improving Drafts in Blockwise Parallel Decoding

Exploring Monotonicity in Early-Exiting Language Models

ExpoMamba: Exploiting Frequency SSM Blocks for Efficient and Effective Image Enhancement

Exponential Quantum Communication Advantage in Distributed Inference and Learning

Fast Adaptation and Robust Quantization of Multi-Modal Foundation Models from Associative Memory: A Case Study in SpeechLM

Fast and Memory-Efficient Multi-Sequence Generation via Structured Masking

Fast yet Safe: Early-Exiting with Risk Control

Fewer Truncations Improve Language Modeling

GPTVQ: The Blessing of Dimensionality for LLM Quantization

GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Hardware-Efficient Quantization for Green Custom Foundation Models

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Implicit Optimization Bias of Next-token Prediction in Linear Models

In Defense of Structural Sparse Adapters for Concurrent LLM Serving

Janus: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences

Just read twice: closing the recall gap for recurrent language models

LAuReL: Learned Augmented Residual Layer

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Learned Best-Effort LLM Serving

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Low Rank Quantization-Aware Training for LLMs

Low-rank Linearization of Large Language Models

Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Mobile and Edge Evaluation of Large Language Models

MoRe Fine-Tuning with 10x Fewer Parameters

NVDSL: Simplifying Tensor Cores with Python-Driven MLIR Metaprogramming

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

Optimised Grouped-Query Attention Mechanism for Transformers

Optimistic Verifiable Training by Controlling Hardware Nondeterminism

OutEffHop: A Principled Outlier-Efficient Attention Layer from Dense Associative Memory Models

Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Pretrained Hybrids with MAD Skills

Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

Quantum-PEFT: Ultra parameter-efficient fine-tuning

Revealing the Utilized Rank of Subspaces of Learning in Neural Networks

Revisiting Cascaded Ensembles for Efficient Inference

Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Scavenging Hyena: Distilling Transformers into Long Convolution Models

Seeded LoRA: Collaborative Fine-Tuning Through Seed Initialization of Adapters

Simple linear attention language models balance the recall-throughput tradeoff

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors

Task Addition and Weight Disentanglement in Closed-Vocabulary Models

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

TinyAgent: Quantization-aware Model Compression and Adaptation for On-device LLM Agent Deployment

Towards Efficient Large-Scale Language-3D Representation Learning

Towards smaller language models via layer looping

Train your cake and eat it too! Repurposing collaborative training to tailor LLMs to private data without sharing

Training-Free Acceleration of ViTs with Delayed Spatial Merging

Understanding and Minimising Outlier Features in Neural Network Training

Unlocking the Global Synergies in Low-Rank Adapters

Why Transformers Need Adam: A Hessian Perspective

xLSTM: Extended Long Short-Term Memory

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity