Pre-trained Large Language Models Use Fourier Features To Compute Addition
Source
- Raw Markdown: paper_llms-use-fourier-features-addition-2024.md
- PDF: paper_llms-use-fourier-features-addition-2024.pdf
- Preprint: arXiv 2406.03445
Core Claim
This paper argues that pretrained LLMs compute simple addition through Fourier features: low-frequency components approximate answer magnitude, while high-frequency components support modular classification such as unit-digit or parity decisions.
Key Contributions
- Studies fine-tuned GPT-2-XL on addition over numbers that fit into single GPT-2 tokens.
- Uses layer-wise readouts to show the model progressively refines answers rather than directly retrieving memorized sums.
- Applies Fourier analysis to MLP and attention logits, finding sparse periodic components.
- Shows causal importance by filtering low- and high-frequency components.
- Identifies pretrained number-token embeddings as a source of useful Fourier-like inductive bias.
Method Notes
This is the mechanistic upstream source for FoNE. FoNE turns the descriptive observation into an explicit embedding design: if pretrained LLMs naturally organize number tokens with Fourier-like components, a number embedding can build those components directly.
The time-series analogy is limited but useful. Periodic components are natural for scalar values with modular or cyclic structure, but a sensor value, exogenous variable, control input, or intervention may require different geometry depending on whether the model needs interpolation, exact arithmetic, or causal sensitivity.
Convergent Evolution later sharpens this boundary: similar Fourier spikes can appear in embeddings or raw token frequencies while mod- linear probes remain near chance.
Evidence And Results
The paper reports that a fine-tuned GPT-2-XL model reaches high addition accuracy and that its internal computation decomposes into approximation and classification components. MLP layers primarily approximate magnitude with lower-frequency features, while attention layers primarily contribute modular operations with higher-frequency features.
The pretraining comparison is central: models trained from scratch do not show the same Fourier-feature pattern and achieve lower accuracy, while introducing pretrained token embeddings improves performance.
Limitations
The experiments focus on addition and numbers constrained by GPT-2 tokenization. The mechanism should not be generalized to all numeric reasoning, multiplication, time-series forecasting, or auxiliary numeric value encoding without additional evidence.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Point-wise numeric embeddings | adjacent | Shows pretrained number-token embeddings and hidden states contain sparse Fourier components useful for addition. | Mechanism is tied to LLM token numbers and not tested as a time-series scalar embedding. |
| Representation quality | warning | Low-frequency components approximate magnitude while high-frequency components support modular classification. | Evidence is mostly bounded addition; it may not transfer to continuous sensors or control inputs. |
| Benchmarks: what level of modeling is tested? | insufficient evidence | Arithmetic interpretability explains one numeric operation family. | No forecasting, generation, context, or action-conditioned time-series benchmark. |
Links Into The Wiki
Open Questions
- Do similar Fourier-like number features appear for continuous sensor values, exogenous variables, or control inputs in time-series models?
- Which numeric operations need low-frequency magnitude approximation, high-frequency modular classification, bit-level logic, or all three?
- Can Fourier-feature analysis diagnose failures in point-wise time-series embeddings?