Pre-trained Large Language Models Use Fourier Features To Compute Addition

Source

Core Claim

This paper argues that pretrained LLMs compute simple addition through Fourier features: low-frequency components approximate answer magnitude, while high-frequency components support modular classification such as unit-digit or parity decisions.

Key Contributions

  • Studies fine-tuned GPT-2-XL on addition over numbers that fit into single GPT-2 tokens.
  • Uses layer-wise readouts to show the model progressively refines answers rather than directly retrieving memorized sums.
  • Applies Fourier analysis to MLP and attention logits, finding sparse periodic components.
  • Shows causal importance by filtering low- and high-frequency components.
  • Identifies pretrained number-token embeddings as a source of useful Fourier-like inductive bias.

Method Notes

This is the mechanistic upstream source for FoNE. FoNE turns the descriptive observation into an explicit embedding design: if pretrained LLMs naturally organize number tokens with Fourier-like components, a number embedding can build those components directly.

The time-series analogy is limited but useful. Periodic components are natural for scalar values with modular or cyclic structure, but a sensor value, exogenous variable, control input, or intervention may require different geometry depending on whether the model needs interpolation, exact arithmetic, or causal sensitivity.

Convergent Evolution later sharpens this boundary: similar Fourier spikes can appear in embeddings or raw token frequencies while mod- linear probes remain near chance.

Evidence And Results

The paper reports that a fine-tuned GPT-2-XL model reaches high addition accuracy and that its internal computation decomposes into approximation and classification components. MLP layers primarily approximate magnitude with lower-frequency features, while attention layers primarily contribute modular operations with higher-frequency features.

The pretraining comparison is central: models trained from scratch do not show the same Fourier-feature pattern and achieve lower accuracy, while introducing pretrained token embeddings improves performance.

Limitations

The experiments focus on addition and numbers constrained by GPT-2 tokenization. The mechanism should not be generalized to all numeric reasoning, multiplication, time-series forecasting, or auxiliary numeric value encoding without additional evidence.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Point-wise numeric embeddingsadjacentShows pretrained number-token embeddings and hidden states contain sparse Fourier components useful for addition.Mechanism is tied to LLM token numbers and not tested as a time-series scalar embedding.
Representation qualitywarningLow-frequency components approximate magnitude while high-frequency components support modular classification.Evidence is mostly bounded addition; it may not transfer to continuous sensors or control inputs.
Benchmarks: what level of modeling is tested?insufficient evidenceArithmetic interpretability explains one numeric operation family.No forecasting, generation, context, or action-conditioned time-series benchmark.

Open Questions

  • Do similar Fourier-like number features appear for continuous sensor values, exogenous variables, or control inputs in time-series models?
  • Which numeric operations need low-frequency magnitude approximation, high-frequency modular classification, bit-level logic, or all three?
  • Can Fourier-feature analysis diagnose failures in point-wise time-series embeddings?