Pre-trained Large Language Models Use Fourier Features To Compute Addition

Source

Raw Markdown: paper_llms-use-fourier-features-addition-2024.md
PDF: paper_llms-use-fourier-features-addition-2024.pdf
Preprint: arXiv 2406.03445

Core Claim

This paper argues that pretrained LLMs compute simple addition through Fourier features: low-frequency components approximate answer magnitude, while high-frequency components support modular classification such as unit-digit or parity decisions.

Key Contributions

Studies fine-tuned GPT-2-XL on addition over numbers that fit into single GPT-2 tokens.
Uses layer-wise readouts to show the model progressively refines answers rather than directly retrieving memorized sums.
Applies Fourier analysis to MLP and attention logits, finding sparse periodic components.
Shows causal importance by filtering low- and high-frequency components.
Identifies pretrained number-token embeddings as a source of useful Fourier-like inductive bias.

Method Notes

This is the mechanistic upstream source for FoNE. FoNE turns the descriptive observation into an explicit embedding design: if pretrained LLMs naturally organize number tokens with Fourier-like components, a number embedding can build those components directly.

The time-series analogy is limited but useful. Periodic components are natural for scalar values with modular or cyclic structure, but a sensor value, exogenous variable, control input, or intervention may require different geometry depending on whether the model needs interpolation, exact arithmetic, or causal sensitivity.

Convergent Evolution later sharpens this boundary: similar Fourier spikes can appear in embeddings or raw token frequencies while mod- $T$ linear probes remain near chance.

Evidence And Results

The paper reports that a fine-tuned GPT-2-XL model reaches high addition accuracy and that its internal computation decomposes into approximation and classification components. MLP layers primarily approximate magnitude with lower-frequency features, while attention layers primarily contribute modular operations with higher-frequency features.

The pretraining comparison is central: models trained from scratch do not show the same Fourier-feature pattern and achieve lower accuracy, while introducing pretrained token embeddings improves performance.

Limitations

The experiments focus on addition and numbers constrained by GPT-2 tokenization. The mechanism should not be generalized to all numeric reasoning, multiplication, time-series forecasting, or auxiliary numeric value encoding without additional evidence.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Point-wise numeric embeddings	adjacent	Shows pretrained number-token embeddings and hidden states contain sparse Fourier components useful for addition.	Mechanism is tied to LLM token numbers and not tested as a time-series scalar embedding.
Representation quality	warning	Low-frequency components approximate magnitude while high-frequency components support modular classification.	Evidence is mostly bounded addition; it may not transfer to continuous sensors or control inputs.
Benchmarks: what level of modeling is tested?	insufficient evidence	Arithmetic interpretability explains one numeric operation family.	No forecasting, generation, context, or action-conditioned time-series benchmark.

Links Into The Wiki

Open Questions

Do similar Fourier-like number features appear for continuous sensor values, exogenous variables, or control inputs in time-series models?
Which numeric operations need low-frequency magnitude approximation, high-frequency modular classification, bit-level logic, or all three?
Can Fourier-feature analysis diagnose failures in point-wise time-series embeddings?

Alex Open Research Wiki

Explorer

Pre-trained Large Language Models Use Fourier Features To Compute Addition

Pre-trained Large Language Models Use Fourier Features To Compute Addition

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks