Prescriptive Scaling Laws for Data Constrained Training
Abstract
A modified scaling law accounts for data repetition effects and provides compute-optimal training strategies for data-constrained scenarios.
Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay (λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.
Community
A scaling law that describes language model behavior under data repetition.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition (2026)
- Test-Time Scaling Makes Overtraining Compute-Optimal (2026)
- The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data (2026)
- Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder (2026)
- Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs (2026)
- Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design (2026)
- To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.01640 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper