arxiv:2511.01937

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Published on Nov 2, 2025

· Submitted by

Abdelaziz Bounhar on Nov 5, 2025

MBZUAI-IFM Paris Lab

Upvote

Authors:

Abdelaziz Bounhar ,

Ahmad Chamma ,

Guokan Shang

Abstract

Retaining and up-weighting moderately easy problems in RLVR pipelines for LLMs reduces output verbosity without explicit length penalization.

AI-generated summary

Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a model that conflates ``thinking longer'' with ``thinking better''. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \emph{emergent brevity for free}: the model learns to solve harder problems without inflating the output length, despite the absence of any explicit length penalization. RLVR experiments using this approach on Qwen3-4B-Thinking-2507 (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at https://github.com/MBZUAI-Paris/Frugal-AI{GitHub}, with datasets and models on https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc{Hugging Face}.

View arXiv page View PDF Project page GitHub 11 Add to collection

Community

BounharAbdelaziz

Paper author Paper submitter Nov 5, 2025

TL;DR: 🤖 Faster. Smarter. Frugal. and BETTER!
Our open-source RL-trained math model reduces verbosity by ~2× without losing accuracy (actually improving on some hard reasoning benchmarks like Omni-Hard) showing that easy problems can implicitly regularize length during RL.

Code is publicly available on Github.

Model and Data are publicly available on Hugging Face.