Regress, Don’t Guess – Number Token Loss

Abstract

While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the L_p norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives.

Why do we need the Number Token Loss (NTL)?

Cross Entropy is nominal-scale and thus assigns equal loss to all incorrect predictions. This makes sense for normal tokens but not for number tokens:

With a ground truth token 4, predicting 3 or 9 should not give equal loss 🤔😱
NTL fixes this! 🚀💪

For all number tokens, NTL increases with distance from ground truth just like a regression loss. But it doesnt need an extra head, it allows computing a regression-like loss directly on a token head. We propose two schemes:
NTL-WAS – Wasserstein-1 distance between predicted and one-hot number distributions (see plot above).
NTL-MSE – Dot-product expectation of numeric value with squared error (most intuitive but has some undesired local minima)

Key Contributions & Results

Model-agnostic: NTL is just a loss → applicable to any LM (e.g., Transformer, Mamba) in any architecture (encoder-decoder, decoder-only).
Plug-and-play: NTL requires only a mapping from tokens to numeric values and works with digit-level and multi-digit tokenizations.
No computational overhead: NTL adds only ~1% compute time to loss calculation which is negligible over a full training step.
Consistently improves performance: NTL outperforms plain cross entropy across multiple architectures and math benchmarks.
Performs true regression: On regression tasks a LM head with NTL matches a dedicated regression head.
Does not harm text tasks: On pure text tasks, NTL has zero effect, thus it behaves like good old cross entropy.
Scales to large models: Even Granite 3.2 2B and T5-3B benefit heavily from NTL on math tasks like GSM8K.

Citation

@inproceedings{zausinger2025regress,
  title   = {Regress, Don't Guess – A Regression-like Loss on Number Tokens for Language Models},
  author  = {Jonas Zausinger and Lars Pennig and Anamarija Kozina and Sean Sdahl
             and Julian Sikora and Adrian Dendorfer and Timofey Kuznetsov
             and Mohamad Hagog and Nina Wiedemann and Kacper Chlodny
             and Vincent Limbach and Anna Ketteler and Thorben Prein
             and Vishwa Mohan Singh and Michael Danziger and Jannis Born},
  booktitle = {Proc. of the 42nd International Conference on Machine Learning (ICML)},
  year    = {2025},
  url     = {https://github.com/tum-ai/number-token-loss}
}

Regress, Don’t Guess
A Regression-like Loss on Number Tokens for Language Models

2025

Jonas Zausinger

Lars Pennig

Anamarija Kozina

Sean Sdahl

Julian Sikora

Adrian Dendorfer

Timofey Kuznetsov

Mohamad Hagog

Nina Wiedemann

Kacper Chlodny

Vincent Limbach

Anna Ketteler

Thorben Prein

Vishwa Mohan Singh

Michael Danziger

Jannis Born

The Number Token Loss (NTL) augments cross-entropy to improve Language Models in numerical tasks.

Abstract

Why do we need the Number Token Loss (NTL)?

Key Contributions & Results

Citation