Regress, Don’t Guess
A Regression-like Loss on Number Tokens for Language Models

ICML logo 2025

Jonas Zausinger

TUM.ai / TUM

Lars Pennig

TUM.ai / TUM / Helmholtz / MCML

Anamarija Kozina

TUM.ai / TUM

Sean Sdahl

TUM.ai / TUM

Julian Sikora

TUM.ai / TUM

Adrian Dendorfer

TUM.ai / TUM

Timofey Kuznetsov

TUM.ai / TUM

Mohamad Hagog

TUM.ai / LMU

Nina Wiedemann

ETH Zurich

Kacper Chlodny

TUM.ai / TUM

Vincent Limbach

TUM.ai / TUM

Anna Ketteler

TUM.ai / TUM

Thorben Prein

TUM.ai / TUM

Vishwa Mohan Singh

TUM.ai / LMU

Michael Danziger

IBM Research Europe

Jannis Born

IBM Research Europe

ICML logo Paper Code Video (TBA) Streamlit logoDemo

The Number Token Loss (NTL) augments cross-entropy to improve Language Models in numerical tasks.
Number Token Loss schematic
Number Token Loss (NTL) is the Wasserstein-1 distance between the predicted and real distribution over the number tokens.

Abstract

While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the Lp norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives.

Why do we need the Number Token Loss (NTL)?

Cross Entropy is nominal-scale and thus assigns equal loss to all incorrect predictions. This makes sense for normal tokens but not for number tokens:

With a ground truth token 4, predicting 3 or 9 should not give equal loss 🤔😱
NTL fixes this! 🚀💪

For all number tokens, NTL increases with distance from ground truth just like a regression loss. But it doesnt need an extra head, it allows computing a regression-like loss directly on a token head. We propose two schemes:
NTL-WAS – Wasserstein-1 distance between predicted and one-hot number distributions (see plot above).
NTL-MSE – Dot-product expectation of numeric value with squared error (most intuitive but has some undesired local minima)

Number Token Loss VS. Cross Entropy

Key Contributions & Results

Citation

@inproceedings{zausinger2025regress,
  title   = {Regress, Don't Guess – A Regression-like Loss on Number Tokens for Language Models},
  author  = {Jonas Zausinger and Lars Pennig and Anamarija Kozina and Sean Sdahl
             and Julian Sikora and Adrian Dendorfer and Timofey Kuznetsov
             and Mohamad Hagog and Nina Wiedemann and Kacper Chlodny
             and Vincent Limbach and Anna Ketteler and Thorben Prein
             and Vishwa Mohan Singh and Michael Danziger and Jannis Born},
  booktitle = {Proc. of the 42nd International Conference on Machine Learning (ICML)},
  year    = {2025},
  url     = {https://github.com/tum-ai/number-token-loss}
}