TUM.ai / TUM
TUM.ai / TUM / Helmholtz / MCML
TUM.ai / TUM
TUM.ai / TUM
TUM.ai / TUM
TUM.ai / TUM
TUM.ai / TUM
TUM.ai / LMU
ETH Zurich
TUM.ai / TUM
TUM.ai / TUM
TUM.ai / TUM
TUM.ai / TUM
TUM.ai / LMU
IBM Research Europe
IBM Research Europe
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the Lp norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives.
Cross Entropy is nominal-scale and thus assigns equal loss to all incorrect predictions. This makes sense for normal tokens but not for number tokens:
4
, predicting 3
or 9
should not give equal loss 🤔😱
For all number tokens, NTL increases with distance from ground truth just like a regression loss.
But it doesnt need an extra head, it allows computing a regression-like loss directly on a token head.
We propose two schemes:
NTL-WAS – Wasserstein-1 distance between predicted and one-hot number distributions (see plot above).
NTL-MSE – Dot-product expectation of numeric value with squared error (most intuitive but has some undesired local minima)
@inproceedings{zausinger2025regress, title = {Regress, Don't Guess – A Regression-like Loss on Number Tokens for Language Models}, author = {Jonas Zausinger and Lars Pennig and Anamarija Kozina and Sean Sdahl and Julian Sikora and Adrian Dendorfer and Timofey Kuznetsov and Mohamad Hagog and Nina Wiedemann and Kacper Chlodny and Vincent Limbach and Anna Ketteler and Thorben Prein and Vishwa Mohan Singh and Michael Danziger and Jannis Born}, booktitle = {Proc. of the 42nd International Conference on Machine Learning (ICML)}, year = {2025}, url = {https://github.com/tum-ai/number-token-loss} }