Number Tokenization

Summary

Number tokenization covers how models encode scalar numeric values as tokens, embeddings, or coefficient-space representations. In this wiki, the topic matters for point-wise time-series embeddings and for auxiliary numeric values such as exogenous variables, control inputs, interventions, metadata, and numeric prompts.

What The Wiki Currently Believes

EIDOS is the time-series anchor: each univariate scalar sample is mapped into a point-wise latent token through a sine-activated gated linear unit. The sine activation supplies bounded periodic basis responses, while the gate selects useful responses and preserves the original temporal resolution.

FlowState is not a number tokenizer, but it is part of the same design space. It encodes time-series histories into coefficient space and uses a functional basis decoder to sample continuous forecasts at a requested temporal resolution.

Pre-trained Large Language Models Use Fourier Features To Compute Addition provides the mechanistic motivation for Fourier number embeddings: pretrained LLMs can use low-frequency components for magnitude approximation and high-frequency components for modular arithmetic.

FoNE turns that observation into an explicit single-token number embedding using digit-aligned sine/cosine features. It is a strong proposal for compact, smooth, periodic scalar representations, especially for addition-like structure.

Convergent Evolution adds the diagnostic caution: visible Fourier spikes are widespread across model families and even raw number-token frequencies, but they are not sufficient for functional modular geometry. The page should distinguish spectral convergence from geometric convergence.

BitTokens takes the opposite route: expose the IEEE 754 binary representation directly so the model can learn bit-level algorithms for comparison and arithmetic. It argues that Fourier embeddings are elegant but not general-purpose for multiplication and division.

TabM adds a static-tabular branch of the same design space. Its numerical feature embeddings are not text-number tokens: they are typed per-feature encoders for continuous table columns. The useful options are raw scalar inputs after preprocessing, LinearReLUEmbeddings, updated piecewise-linear embeddings with feature-specific bins, and periodic embeddings with learned frequencies and cosine/sine activations.

Continuous Feature Embedding Options

TabM and its rtdl_num_embeddings dependency are useful because they separate several choices that are often blurred together:

  • Raw scalar input keeps each numeric feature as one scalar after preprocessing. This is cheap and strong enough for many MLP-style baselines, but it gives the model no explicit local basis over a feature’s value range.
  • Linear-ReLU embeddings map each scalar feature through a per-feature linear projection plus ReLU. This is the lightest learned non-linear feature interface.
  • Piecewise-linear embeddings compute feature-specific bins, encode the scalar by its position within those bins, then learn an embedding over the piecewise-linear representation. This is attractive when local thresholds, monotone regions, or quantile structure matter.
  • Periodic embeddings / PLR learn frequencies, apply cosine/sine features, then project them with an outer linear layer. This is the tabular analogue closest to Fourier-style number embeddings, but its semantics are feature-specific rather than digit-specific.
  • Fixed piecewise-linear encoding and plain linear embeddings are available in the embedding package as related options, but the current TabM implementation only accepts LinearReLUEmbeddings, PiecewiseLinearEmbeddings, and PeriodicEmbeddings as num_embeddings.

Design Implications

Numeric values should not be treated as one uniform modality. A time-series observation, an exogenous variable, a numeric control input, and a causal intervention can all be scalar numbers, but they impose different requirements on geometry, decoding, smoothness, and exactness.

Smooth point-wise embeddings are attractive for noisy observations and continuous trajectories. Fourier or other periodic bases are natural for cyclic values, phase, and digit/modular structure. Bit-level encodings are attractive when the model must perform exact arithmetic or preserve a wide numeric range. Continuous basis decoders are useful when the output itself should be a resampleable function rather than a fixed horizon of tokens.

Typed continuous feature embeddings are the more natural default for auxiliary numeric values whose identity is known in advance. A blood-glucose value, a price, a temperature, a drug dose, a control setpoint, and a calendar feature may all be numeric, but their bins, scales, periodicities, and safe extrapolation behavior differ. TabM-style per-feature embeddings make that distinction explicit, while FoNE and BitTokens focus on representing free-standing numerals.

Evidence

The current evidence is mixed by domain. EIDOS and FlowState support numeric time-series representation design through forecasting results. FoNE and BitTokens support specialized number representations through controlled language-model arithmetic experiments. Convergent Evolution sharpens the Fourier branch by showing that spectral periodicity can be a superficial artifact unless residue classes are geometrically usable. TabM supports typed continuous-feature embeddings through static tabular prediction benchmarks. The 2024 Fourier-features paper explains why pretrained language models may already contain useful periodic number structure, but it does not prove that the same mechanism is sufficient for time-series foundation models.

Relation To Foundation TSFM Agenda

This page maps to the point-wise numeric embedding and auxiliary numeric interface slots in the Foundation Time-Series Model Research Agenda.

Agenda slotVerdictEvidenceMissing pieces
Point-wise numeric embeddingspartially closesEIDOS keeps scalar observations at native temporal resolution; FlowState adds coefficient-space forecasting.Need high-dimensional, irregular, and action-conditioned evidence.
Context interfaceadjacentTyped feature embeddings clarify how numeric exogenous variables, units, doses, setpoints, and metadata may differ from free-standing numerals. Convergent Evolution adds that representation diagnostics should test usable geometry, not only periodic spectra.Needs explicit interfaces for measurement units, channel semantics, and typed numeric context.
Causal structure, counterfactuals, and controlwarningNumeric control inputs and intervention intensities matter, but number tokenization alone does not make a model action-conditioned.Needs datasets and models where numeric actions or control inputs change future trajectories.
Generation and editing fidelityadjacentContinuous bases and point-wise encodings preserve more numeric detail than text tokenization.Needs direct generation/editing benchmarks with dense-detail preservation.

Open Questions

  • Which scalar encoding should be used for known future exogenous variables versus controllable numeric actions or interventions?
  • Can a time-series foundation model route between point-wise smooth embeddings, Fourier bases, bit-level number tokens, and learned continuous bases?
  • Which numeric representation probes distinguish spectral periodicity from task-usable geometry for TSFM scalar values?
  • When should auxiliary numeric values use feature-specific piecewise-linear bins rather than universal Fourier or bit-level number tokens?
  • How should measurement units, missingness, uncertainty, normalization, and sign be represented in auxiliary numeric values?
  • Does exact arithmetic help forecasting and world modeling, or only symbolic numeric tasks?