What Do Language Models Learn and When? The Implicit Curriculum Hypothesis

Source

Raw Markdown: paper_implicit-curriculum-hypothesis-2026.md
PDF: paper_implicit-curriculum-hypothesis-2026.pdf
Preprint: arXiv 2604.08510v1
Code: KaiserWhoLearns/ElementalTask
Model-trajectory dataset: elemental-tasks/model-trajectories
Gonzo ML discussion: Telegram post 5452 (local extract stored at papers/implicit-curriculum-hypothesis-2026/telegram-post-gonzo-ml-5452.md)
Gonzo-linked review: ArXivIQ review
Podcast pointer: Gonzo ML Podcasts 3797
Local official-artifact metadata: papers/implicit-curriculum-hypothesis-2026/official_artifacts_metadata.json
Local GitHub README snapshot: papers/implicit-curriculum-hypothesis-2026/github-readme-ElementalTask.md

Status And Credibility

arXiv lists the paper as cs.CL, version v1, submitted on 2026-04-09. The PDF first page says “Preprint. Under review.” and gives affiliations with Carnegie Mellon University, Johns Hopkins University, Northeastern University, and the University of Southern California.

Credibility is high enough for an important ingest because the paper is current, comes from a credible language-modeling team, includes Graham Neubig as a senior author, releases an official GitHub repository for ElementalTask, and has a public Hugging Face model-trajectory dataset. Caveats: no accepted venue page was found at ingest time; the source is still an under-review preprint; and the paper’s evidence is from language-model checkpoints and synthetic/task-suite probes, not time-series or action-conditioned world-model experiments.

Core Claim

The paper proposes the Implicit Curriculum Hypothesis: during language-model pretraining, skills emerge in a stable, compositional order that is consistent across model families and readable from internal representations.

For a task set T with a design-level prerequisite relation where task i precedes task j, the paper defines an emergence time t*(task, model) as the first checkpoint where the model exceeds a fixed accuracy threshold. It then tests three claims:

Compositional ordering: prerequisite tasks should emerge no later than composites built from them.
Cross-model stability: emergence order should be similar across model families, sizes, and data mixtures.
Representational alignment: tasks with similar function-vector representations should have similar learning trajectories, making unseen composite-task trajectories predictable from residual-stream geometry.

The important move for this wiki is that aggregate validation loss is treated as too coarse. The paper asks what capabilities appear when, and whether internal representations can forecast capability emergence before exhaustive evaluation.

Evidence

The experiments build 91 exact-match tasks: 53 simple tasks and 38 composite tasks spanning string manipulation, morphology, translation, retrieval, coreference, logical operations, reading-comprehension-style tasks, and arithmetic. The models are open-weight checkpointed language models from Pythia, OLMo-2, OLMo-3, and LLM360, with sizes spanning 410M to 13B parameters and roughly 20 checkpoint samples up to about the first 1T training tokens.

Key reported results:

Evidence thread	Reported result	Why it matters
Cross-model emergence order	Mean Spearman $ρ = .81$ across 45 model pairs under an absolute threshold; pairwise correlations range roughly $.64$ to $.93$	Suggests skill-acquisition order is not arbitrary or family-specific.
Composite after prerequisites	54/76 composite-parent relations emerge in the expected order; 19 weak and 3 strong inversions remain	Supports the curriculum claim but shows the ordering is not a strict proof of prerequisites.
Threshold sensitivity	Relative-threshold emergence weakens cross-model agreement	Diagnostics must define meaningful absolute competence thresholds, not only percent-of-best scores.
Function-vector prediction	Leave-one-out prediction for 26 held-out composites gives $R^{2} \approx .67$ to $.838$ across models with all tasks in the basis	Residual-stream task geometry is predictive of training trajectories.
Composition bottleneck	Restricting the trajectory-prediction basis to simple tasks increases MAE by about $.135$ on average	Composite task trajectories carry structure not fully captured by elemental tasks alone.
Released artifacts	Official GitHub repository plus HF `elemental-tasks/model-trajectories` dataset	Makes the diagnostics more inspectable than a paper-only preprint.

The paper also contains a small source-consistency caveat: the abstract/Gonzo summary says four model families and describes a 410M—13B range, while the reported 45 pairwise correlations imply 10 model instances in the pairwise table. The source page therefore avoids relying on an exact model count beyond the stated families, range, and reported pair count.

Relevance To This Wiki

This is upstream LLM training-dynamics and representation-diagnostics evidence. It is not a time-series foundation-model paper.

The transferable idea is that aggregate loss can hide ordered acquisition of capabilities. For foundation time-series models, the analogous question is not “which language skills emerge?” but which latent-state capabilities emerge when: local numeric fidelity, seasonal dynamics, cross-channel coupling, rare-regime sensitivity, context use, event-stream parsing, action-conditioned next-state dynamics, and counterfactual prediction.

The paper also gives a diagnostic pattern: build a small, interpretable capability suite; log emergence times at absolute thresholds; and test whether representation geometry predicts future capability trajectories. A TSFM version could use latent-state probes, rare-regime probes, channel-coupling probes, intervention-window probes, and rollout probes rather than relying only on average forecast error or validation loss.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Training dynamics and diagnostics	adjacent	Shows LLM pretraining has structured skill-emergence order and that residual-stream function vectors predict future task trajectories.	Needs analogous TSFM checkpoint studies over numeric state, event streams, context, actions, and rare regimes.
Data diversity and curriculum	adjacent	Stable learning order suggests a way to monitor whether a model is developing expected capabilities on schedule.	Does not implement data reweighting or prove a curriculum policy for useful-signal-poor temporal corpora.
Representation quality	adjacent	Internal task representations predict held-out composite learning curves.	Needs layer/head/probe protocols that preserve dense numeric detail and latent state, not only language-task function vectors.
Benchmark hygiene	warning	Relative thresholds can destroy cross-model emergence-order agreement; aggregate loss hides capability transitions.	TSFM reports need absolute competence thresholds, checkpoint density, probe design, and slice-specific metrics.
Scaling and efficiency	warning	Scaling laws on validation loss say how much loss falls, not what is learned when.	Needs scaling laws for capability emergence, useful-signal density, and representation geometry.
Causal structure, counterfactuals, and control	insufficient evidence	The method could inspire action/intervention probe suites.	No actions, control inputs, interventions, counterfactual rollouts, or control utility are evaluated.

Limitations

The evidence is language-model centric and uses exact-match text tasks, not numeric time series, graph time series, observability telemetry, healthcare trajectories, robotics trajectories, or action-conditioned world models.
The task dependencies are design-level dependencies, not proof that the model uses those exact internal primitives.
Composite-ordering is not perfect: 22/76 parent relations are inverted under the reported absolute-threshold analysis, including three strong inversions involving first_letter.
The diagnostics depend on checkpoint density, threshold choice, task construction, prompt format, and function-vector extraction settings.
The source is an arXiv preprint under review at ingest time, not an accepted peer-reviewed paper.

Links Into The Wiki

Open Questions

What is the TSFM analogue of an elemental skill: a channel operation, a temporal motif, a regime distinction, an event parser, a latent-state update, or an action-conditioned transition?
Can TSFM checkpoint trajectories reveal a stable order of capability emergence across model families and data mixtures?
Which absolute thresholds should define emergence for rare regimes, cross-channel coupling, context use, and action-conditioned rollout?
Can latent-state probe geometry predict future capability trajectories before running expensive downstream evaluations?
Would dynamic curriculum policies improve if they monitor whether expected capabilities are emerging ahead of or behind schedule?

Alex Open Research Wiki

Explorer

What Do Language Models Learn and When? The Implicit Curriculum Hypothesis

What Do Language Models Learn and When? The Implicit Curriculum Hypothesis

Source

Status And Credibility

Core Claim

Evidence

Relevance To This Wiki

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks