# Exploration: Fine-Tuning With Parameter Decomposition

## Provenance

- Source type: LessWrong experiment write-up, official Goodfire X/Twitter thread, Lee Sharkey VPD announcement thread, and official Goodfire `param-decomp` code repository.
- Primary write-up: Lucius Bushnaq, ["Exploration: fine-tuning with parameter decomposition"](https://www.lesswrong.com/posts/ieoWstubDQWLrMnhH/exploration-fine-tuning-with-parameter-decomposition), published on 2026-06-25.
- Official Goodfire experiment thread: <https://x.com/GoodfireAI/status/2070181051801235463>, posted 2026-06-25, with correction <https://x.com/GoodfireAI/status/2070193318177419557>.
- VPD announcement thread by Lee Sharkey: <https://x.com/leedsharkey/status/2051717264286609516>, posted 2026-05-05.
- Official code repository: <https://github.com/goodfire-ai/param-decomp>, public MIT-licensed repository from Goodfire.
- VPD paper / interactive report: <https://www.goodfire.ai/research/interpreting-lm-parameters>.
- VPD summary: <https://www.goodfire.ai/research/vpd-explainer>.
- SPD arXiv paper: <https://arxiv.org/abs/2506.20790>, submitted 2025-06-25 and revised 2025-09-04.
- Snapshot date: 2026-06-26.
- Local source artifacts: `source_lesswrong_exploration.html`, `source_lesswrong_exploration_text.txt`, `source_goodfire_vpd_paper.html`, `source_goodfire_vpd_paper_text.txt`, `source_goodfire_vpd_explainer.html`, `source_goodfire_spd_blog.html`, `source_arxiv_spd_abs.html`, `source_github_readme.md`, `source_github_pyproject.toml`, `source_github_repo.json`, `source_github_releases.json`, `x_provided_posts_2026-06-26.json`, `x_thread_goodfireai_2070181051801235463.json`, `x_thread_leedsharkey_2051717264286609516.json`, and selected `assets/` media thumbnails/images.

## Source Status

This is not a peer-reviewed paper. The primary result is an author-reported one-day hackathon exploration built on Goodfire's VPD parameter decomposition of a small language model. It is credible as a Goodfire-authored project snapshot because the LessWrong write-up, Goodfire X thread, Lee Sharkey VPD thread, Goodfire research pages, and official `goodfire-ai/param-decomp` repository all cross-reference the same method family and codebase.

Treat the claims as early mechanism-generating evidence for weight-space model editing, not as a settled scaling result. The target model is a 67M-parameter, four-layer language model trained on The Pile; it is far smaller than frontier language models and is not a time-series model.

## User-Provided X Snapshots

The three user-provided X statuses were extracted through the authenticated X API and preserved locally.

### Goodfire German-removal thread

Goodfire's 2026-06-25 thread says the team removed a 67M-parameter language model's ability to predict German text by fine-tuning only a scalar factor on one decomposed weight subcomponent. The thread frames this as an early Silico hackathon exploration of parameter decomposition for targeted, predictable model editing.

The thread's key points are:

- The edit uses **parameter decomposition**, a method that divides weight matrices into interpretable, sparsely activating components.
- German was selected because it appeared to be the model's strongest non-English language.
- The edit matched LoRA's German-removal effect with fewer German tokens and fewer off-target effects on English, French, Spanish, and Italian.
- The authors explicitly call out a caveat: the method indirectly uses tokens already spent when decomposing the model and interpreting subcomponents; if the decomposition is reusable, that cost can be amortized over many edits.
- Interpretability mattered operationally: the initial top-16 component selection included many foreign-language-in-general components, but autointerp labels led them to select a single German-specific component instead.
- Goodfire posted a correction that the off-target-effect bars in one plot had been displayed 0.01 nats above the true means.
- In replies, Goodfire confirmed that the parameter decomposition repository is public at `goodfire-ai/param-decomp`.

### Lee Sharkey VPD thread

Lee Sharkey's 2026-05-05 thread announces Goodfire's VPD work as a shift from decomposing **activations** to decomposing **weights**. The thread says VPD:

- identifies the parts of a model's parameters that are causally used on a given input through causal ablations;
- finds parameter components with individual computational roles such as emoticon prediction or gender identification;
- handles attention computations natively, including computations distributed across attention heads;
- supports attribution graphs over parameter components; and
- enables hand-editing a model's neural algorithm, with an emoticon-completion edit as the proof of concept.

The thread also notes compute uncertainty in a later reply: Lee Sharkey gives a low-confidence estimate that decomposing the target model cost roughly one to four times the compute used to train the target model and may scale roughly linearly with target-model compute, while emphasizing that this is based on vibes rather than measured scaling laws.

## Primary Experiment

The LessWrong post's TL;DR is that the team can destroy a 67M-parameter language model's ability to predict German text by fine-tuning a single number: the scalar prefactor on one German-related rank-1 parameter subcomponent.

The method starts from the VPD decomposition of the same 67M model used in Goodfire's VPD paper. VPD rewrites each weight matrix as a sum of rank-1 subcomponents plus a residual $\Delta$ component:

$$
W^l \approx \sum_c U^l_c (V^l_c)^\top + \Delta^l.
$$

The fine-tuning experiment does not add new LoRA matrices. Instead, it treats selected subcomponent masks or scalar prefactors as the only trainable parameters:

- $m_c > 1$ amplifies a subcomponent;
- $0 \le m_c < 1$ suppresses it;
- $m_c < 0$ inverts it.

This restricts adaptation to reweighting existing decomposed circuits rather than learning new ones.

## Experiment Design

The target task was to degrade German prediction while preserving English.

The initial procedure was:

1. Rank subcomponents by the difference between their average causal importance on German text and English text.
2. Select the 16 most German-specific subcomponents.
3. Fine-tune those 16 mask values with an objective that increases German cross-entropy while using a KL penalty to keep English predictions near the original model.
4. Compare against rank-1 and rank-4 LoRA adapters trained with the same broad objective.

The score asks how close German cross-entropy gets to chance, about 10.83 nats for a near-uniform output distribution, while keeping English cross-entropy increase small.

## Results Reported By The Source

The author reports three notable outcomes.

1. **Top-16 subcomponent tuning beats rank-1 LoRA in the low-data regime but damages other languages.** With low German-token budgets, the top-16 subcomponent edit pushes German closer to chance with less English damage than rank-1 LoRA. At larger German-token budgets, LoRA catches up or overtakes. Because the preservation term only used English, both methods often damaged French, Spanish, and Italian.

2. **Autointerp labels changed the experiment.** Only one selected subcomponent label was exclusively German-specific: `h.3.attn.v_proj:513`, labeled as German text and names. Many other selected components were about non-English or foreign-language text generally. Dropping those broad components and tuning only the German-labeled component improved precision.

3. **The one-scalar edit is the striking result.** The inverting single-component solution reached German-at-chance behavior with about four German training tokens and less than 0.10 nats of English cross-entropy damage, while the LoRA baselines required about 32 German tokens for the same English-damage target. The single-component edit preserved French and Spanish much better than the LoRAs or the top-16 component fine-tune, but still damaged Italian, which the post traces to the component's read/write behavior involving both German and Italian function words.

Appendix experiments in the post report that rank-4 LoRA improves over rank-1 LoRA somewhat but does not change the qualitative conclusion, and that a localized rank-1 LoRA on the same layer-3 attention value matrix still causes larger off-target damage than the one-scalar component edit.

## Repository Snapshot

The official `goodfire-ai/param-decomp` README describes the repository as training tools for parameter decomposition on neural networks, with a compact implementation under `nano_param_decomp/`.

The repository README links two method-lineage references:

- **VPD paper / interactive report:** `https://www.goodfire.ai/research/interpreting-lm-parameters`, with a `vpd-paper` code release and canonical 4L-Pile run `goodfire/spd/runs/s-55ea3f9b`.
- **SPD paper:** `https://arxiv.org/abs/2506.20790`, with a `v1` code release.

The README says the repository contains two Python distributions:

- `param-decomp`, the core library imported as `param_decomp`;
- `param-decomp-lab`, in-repo experiment, app, postprocessing, and CLI tooling imported as `param_decomp_lab`.

It provides experiment entrypoints for toy models, residual MLPs, and language models: `pd-tms`, `pd-resid-mlp`, and `pd-lm`. The repository metadata snapshot identifies the project as public, MIT-licensed, and owned by the `goodfire-ai` GitHub organization.

## Relationship To VPD And SPD

The experiment depends on the VPD paper's decomposed 67M language model. The VPD report argues that parameter decomposition can find mechanistically simple, causally meaningful, parameter-faithful components in a small language model; it reports decomposition of 24 non-embedding weight matrices into rank-1 subcomponents, native attention-layer decomposition, attribution graphs, and a hand-written emoticon model edit.

SPD is the earlier arXiv method in the same lineage. Its abstract frames SPD as a more scalable and robust alternative to APD for linear parameter decomposition, bridging causal mediation analysis and network decomposition methods. The Goodfire SPD blog emphasizes the parameter-space view: activation methods see the shadows computations cast on activations, while parameter decomposition aims to inspect the weights where those computations live.

## Local Interpretation Notes

The most important knowledge-base takeaway is not merely that a toy German capability can be suppressed. The stronger hypothesis is:

```text
A reusable weight-space decomposition can turn some model edits from adding new adapter capacity into selecting and rescaling existing causal subcomponents.
```

That matters because it separates three adaptation mechanisms that often get blurred together:

- **LoRA / adapter learning:** add a new low-rank update that may interact broadly with nearby languages or capabilities.
- **Sparse full-parameter post-training:** discover or exploit a sparse set of parameters changed by an optimizer.
- **Decomposition-basis editing:** expose a learned component basis first, then edit scalar gates or masks over interpreted subcomponents.

The experiment is a concrete counterpoint to generic "low-rank updates are interpretable" stories. The single scalar edit is lower-dimensional than LoRA because the decomposition already paid a large upfront cost to identify a meaningful component basis. The correct comparison therefore needs to include decomposition cost, amortization over many edits, off-target behavior, and scaling to larger models.

## Limitations

- The result is an exploratory one-day hackathon case study, not peer-reviewed evidence.
- The target is a single 67M-parameter language model and a single decomposition.
- The model is small and weak; German is its strongest non-English language but far weaker than English.
- The edit indirectly uses the data and compute spent during decomposition and autointerpretation. The low token count for the final scalar fine-tune is not the whole cost.
- The preservation objective initially protected English only, so broad foreign-language components created avoidable off-target damage.
- Even the final single-component edit still damaged Italian because the chosen component was not purely German.
- The result does not show that VPD-style decompositions scale efficiently to billion-parameter or frontier-scale models.
- The source is language-model evidence, not numeric time-series, event-stream, or action-conditioned world-model evidence.

## Open Questions

- How often do VPD components isolate capabilities cleanly enough that scalar rescaling beats LoRA under matched total decomposition-plus-edit cost?
- Does component-basis editing preserve rare capabilities better than LoRA or sparse full-parameter updates when the protected set is broader than English/French/Spanish/Italian/code?
- Can decomposition cost be amortized across enough downstream edits to matter in practice?
- Do decomposed subcomponents stay stable across checkpoints, seeds, model scales, or training-data variants?
- Can parameter decomposition expose safe, auditable edit handles for numeric time-series models, telemetry models, or action-conditioned world models?
