arxiv:2509.01190

Efficient Large Language Models with Zero-Shot Adjustable Acceleration

Published on Sep 1, 2025

Authors:

Abstract

A novel training and inference method for large language models that enables dynamic hardware utilization adjustment during inference without additional fine-tuning, achieving up to 11x speedup.

AI-generated summary

Using Large Language Models (LLMs) in real-world applications presents significant challenges, particularly in balancing computational efficiency with model performance. Optimizing acceleration after fine-tuning and during inference is critical for building efficient architectures. This paper introduces Zero-Shot Adjustable Acceleration, a novel training and inference method that dynamically adjusts hardware utilization during inference without requiring additional fine-tuning. The proposed approach is applied to recent LLMs and evaluated across multiple classification and text generation tasks. Experimental results demonstrate that the method supports a wide range of zero-shot acceleration and achieves up to 11x speedup compared to the baseline.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2509.01190

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.01190 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.01190 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.01190 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.