Dynamic Curriculum Learning For JEPA

Status: draft research idea extracted from internal discussion notes.

Collaboration

If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.

Ideas are not the bottleneck. Hands are. Time-series modeling should be moving at least as fast as vision, audio, and robotics.

Summary

Large temporal datasets are often data-rich but useful-signal-poor. In observability, telecom, blockchain, industrial telemetry, and many video settings, most windows show normal behavior. The rare windows are the ones that carry failures, regime changes, corner cases, motion boundaries, or other decision-relevant state.

The idea is to train JEPA-style models with a dynamic curriculum that filters or reweights training examples online according to the model’s current surprise or another current training-effect proxy. Instead of sampling uniformly from a huge corpus, the training loop continuously asks which windows are still informative for the current model.

This is especially natural for JEPA, because the model already predicts in representation space. Surprise can be measured as a latent prediction error rather than as raw reconstruction error.

The stronger research claim is that curriculum should become a foundation-model primitive for useful-signal-poor temporal corpora, not just a preprocessing trick. The sampler, target embedding distribution, anti-collapse regularizer, and rare-state evaluation all become part of the model recipe.

After When Does LeJEPA Learn a World Model?, the sharper version is that dynamic curriculum is a distribution-shaping controller for JEPA pretraining. It should spend compute on rare informative transitions while preserving enough autocorrelated, approximately isotropic coverage for the learned state to remain linearly usable.

After A Bitter Lesson for Data Filtering and No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models, the idea also needs a stricter no-filter baseline. Dynamic curriculum should be framed as compute allocation under support-preservation constraints, not as irreversible quality pruning. The sampler may downweight repeated or corrupt windows, but it should keep a uniform or regime-aware floor so rare-but-structured distributional support is not deleted.

LLMs as Noisy Channels adds a complementary scaling-law motivation: more data can become harmful when accumulated data noise overtakes useful signal. For this idea, that means the curriculum objective should distinguish redundant normal-state windows, corrupt windows, rare-but-structured windows, and truly useful surprise instead of simply maximizing data volume or raw prediction error.

Implicit Curriculum Hypothesis adds a monitoring pattern, not a direct curriculum algorithm: define small capability probes, log when each capability crosses an absolute threshold, and test whether internal representation geometry predicts later probe trajectories. A JEPA-style version would ask whether early latent-state probes can predict later rare-regime, context-use, channel-coupling, or intervention-window competence.

A newly identified deployment use-case is company-local distributed training. When raw enterprise data cannot leave the company’s boundary and is mostly unlabeled, the same dynamic filter can decide which local windows are worth turning into update signals before gradients, adapter deltas, or secure aggregates are exported to an external coordinator.

Problem

The practical problem has two parts.

First, there is too much data to train on naively. A month of high-dimensional time series can contain terabytes of data and trillions of points. Video has the same shape at a different scale: long spans of predictable background dynamics can dominate the training stream.

Second, most of that data is not equally useful. If a model sees mostly normal operation, it can learn a strong representation of normality while discarding rare, hard, or abnormal cases as noise. For operational systems, safety settings, and video corner cases, those rare windows are often the main reason to train the model.

So the target is not simply “more data.” The target is higher useful learning signal per unit of compute.

This creates three related but separable problems:

Dataset-level selection. Given an unlabeled corpus, identify which windows still teach the current model something useful.
Capability-emergence monitoring. Given checkpoints and a probe suite, identify which latent-state capabilities are emerging on schedule and which are lagging.
Model-level robustness. Given arbitrary temporal data, train an encoder that does not erase rare correlated events, change points, or decision-relevant state even when those cases are sparse.
Distribution shaping. Given a long-tailed stream, choose training pairs so rare states are represented without turning the whole objective into hard-example mining over noise, corrupt windows, or policy-shaped tails.
Private local selection. Given unlabeled company data that cannot be inspected centrally, choose locally which windows should contribute to exported update signals without revealing raw observations.

High-Level Idea

Use online data selection during JEPA pretraining.

The training loop should:

Run candidate windows, clips, or trajectories through the current model.
Compute one or more selection signals for each candidate: surprise, gradient magnitude, gradient direction, probe alignment, or another training-effect proxy.
Keep, upweight, downweight, or drop examples according to a moving curriculum policy.
Log capability-probe trajectories across checkpoints so the sampler can tell whether rare-state and context-use capabilities are emerging on schedule.
Update the curriculum as the model learns, so examples that were useful early can become cheap or redundant later.

The core surprise-band heuristic is:

too little surprise      -> already learned or uninformative, downweight
useful surprise band     -> informative training signal, keep or upweight
too much surprise        -> possible noise, corruption, or out-of-distribution case, cap or inspect

This makes curriculum learning dynamic rather than fixed. The same example can move between bands as the encoder and predictor improve.

Historical World-Model Analogue

World Models is a historical RL/world-model analogue, not direct JEPA or time-series evidence. The authors propose using the dynamics model’s prediction loss to drive exploration and collect data where the model is unfamiliar. That makes it a useful predecessor for surprise-based temporal curricula, while preserving the caveat that high surprise can also mean corrupt, adversarial, or irrelevant states.

Dataset Policy Gradient Analogue

Synthetic Data for any Differentiable Target is the closest current source for turning data selection into a differentiable training-effect problem. DPG does not train a JEPA and does not operate on multivariate time series, but it shows a stronger version of the same idea: an example can be scored by how much upweighting it changes a downstream differentiable metric after a short training trajectory.

For this idea, DPG suggests a two-level curriculum design:

cheap online surprise score
  -> periodically calibrated by expensive metagradient value probes
  -> sampler/controller learns which windows improve rare-state metrics

The useful transfer is not the text generator itself. The useful transfer is the data-valuation interface: candidate windows, clips, trajectories, or event-stream segments could receive rewards from differentiable rare-state probes, latent-geometry diagnostics, or intervention-window state-prediction losses. A curriculum controller could then learn to approximate those rewards cheaply.

The caveat is equally important. DPG shows that optimizing data against a narrow metric can deliberately write hidden properties into a model. A dynamic curriculum must therefore report rare-state preservation, normal-behavior retention, latent-geometry health, and clean-label-poisoning-like side effects, not just faster loss reduction.

TarDiff and OATS add direct time-series analogues: influence-guided synthetic samples can target downstream clinical utility or TSFM pretraining value. They support gradient-effect data valuation as a direction, but they inherit guidance-set leakage, metric-target, and passive-forecasting caveats before they can support JEPA-style rare-state curricula.

Motion-Targeted Attribution Analogue

Motion Attribution for Video Generation provides the clearest current video analogue for gradient-alignment selection. Motive defines target temporal dynamics with query clips, masks per-location loss by optical-flow magnitude, and ranks candidate fine-tuning clips by projected motion-gradient similarity. On three video backbones, fine-tuning on its query-conditioned 1,000-clip subsets outperforms random 1,000-clip selection and fine-tuning on all 10,000 clips in the sampled candidate pool on the target dynamic-degree metric. This is not a 10%-of-pretraining-data result.

This validates a static component, not the full dynamic-curriculum claim or a 90% end-to-end compute saving. Motive computes gradients for all 10,000 candidates before selecting 1,000, uses a fixed motion-query set, and performs one-shot subset selection. The paper’s one-epoch, 50-repeat description also does not make matched optimizer-step accounting explicit. The JEPA experiment should test whether repeated score refresh improves on that static baseline, while adding a support-preserving sampling floor and untargeted-capability retention checks. Its 150 A100-hour attribution pass is also a strong reason to compare full-model gradients with layer-local, adapter, and periodic-probe approximations.

No-Filter And Diversity Evidence

The two no-filter sources sharpen what this idea is and is not.

A Bitter Lesson for Data Filtering shows a compute-scale crossover in language-model pretraining: ordinary quality filters can help under small compute, but a larger model trained longer can extract more value from the unfiltered Common Crawl pool. The useful lesson for this page is that data value depends on compute, model capacity, and training duration. A sample that looks low-quality to a fixed heuristic may still carry weak signal, rare co-occurrence structure, or long-tail coverage that a larger model can use.

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models shows the benchmark version of the same warning: an English-only image-text filter improves ImageNet/COCO-style metrics while hurting cultural and socioeconomic coverage. The time-series analogue is a filter that improves average forecast error or easy anomaly benchmarks while erasing tail devices, rare tenants, regional seasonality, intervention windows, or pre-failure buildup.

Together they turn the curriculum contract into:

remove or downweight redundancy and corruption,
but preserve the support of useful natural diversity.

That means no-filter, loose-filter, dedup-only, and dynamic-reweighting baselines should all appear in the experiment. A dynamic curriculum is successful only if it beats uniform/no-filter at matched compute while retaining tail-regime probes and normal-behavior calibration. If it wins by deleting diversity, it is just a narrow benchmark filter.

Beyond Surprise: Gradient-Space Curriculum Signals

Surprise should be treated as the simplest baseline signal, not the only possible filter. A candidate window can also be scored by the size or direction of the update it would induce in the model.

For a candidate window $x$ with training loss $L_{θ} (x)$ , define its local update direction as:

g_{x} = \nabla_{θ} L_{θ} (x)

Possible gradient-space filters include:

Gradient magnitude. Upweight windows whose $∥ g_{x} ∥$ is large enough to change the model, while capping extreme norms that may indicate corrupt data, unstable scale, or adversarial samples.
Gradient alignment. Prefer windows whose gradient direction aligns with a probe objective, such as rare-state recall, regime probe loss, intervention-window state prediction, or latent-geometry health: $cos (g_{x}, g_{p r o b e})$ .
Retention conflict. Downweight windows whose update direction strongly conflicts with a normal-behavior retention gradient or anti-collapse geometry check: negative alignment with $g_{r e t ain}$ can mean the curriculum is buying tail sensitivity by forgetting baseline dynamics.
Gradient novelty or diversity. Prefer a batch whose gradients span useful regimes instead of repeatedly applying near-duplicate updates from the same easy or noisy bucket.
Layer-local effects. Measure gradient magnitude or direction only in the encoder, predictor, adapter, or world-model dynamics block when the full parameter-space gradient is too large or too noisy.

This is the same conceptual move as DPG, but cheaper and less fully differentiated. DPG estimates downstream training effect by differentiating through an inner training step or short trajectory. Gradient-magnitude and gradient-direction filters approximate that effect with local update geometry, so they can be used as online curriculum signals or as features for a sampler/controller trained from occasional metagradient probes.

The important distinction is that high surprise and high gradient norm are not automatically good. A corrupt sensor window, ingestion glitch, or out-of-distribution artifact can produce both. Directional checks make the curriculum more target-aware: a medium-surprise window whose gradient points toward rare-state preservation may be more valuable than a high-surprise window whose gradient points against normal-behavior retention.

In private distributed training, gradient-space scoring should normally happen inside the data boundary. Raw gradients, per-example gradient norms, and gradient-alignment statistics can leak information, so the coordinator should receive only bounded update signals or privacy-safe aggregates if this branch is evaluated.

Agenda Impact

This idea changes the research agenda from “train on more temporal data” to “spend training compute on the windows that preserve latent state.” In this frame, a dynamic curriculum is not only an accelerator. It is a mechanism for preventing a foundation time-series model from normalizing away the rare states that make industrial, observability, blockchain, robotics, and video corpora valuable.

The publishable target should therefore report rare-state preservation and normal-behavior retention together. A method that improves average loss by selecting only common or easy windows is a failure. A method that overweights anomalies until normal dynamics are forgotten is also a failure.

The current best first system is still simple: a streaming sampler with moving surprise buckets, a uniform sampling floor, caps for extreme high-surprise windows, and periodic score refresh. The agenda-level extension is to test whether that sampler can be interpreted as distribution shaping in embedding space, especially when paired with SIGReg-style Gaussian regularization.

When Does LeJEPA Learn a World Model? makes the sampling contract sharper: the curriculum should preserve informative autocorrelation and approximate isotropy rather than merely chase high surprise. A sampler that overselects policy-shaped or highly non-Gaussian windows may train a non-collapsed but distorted latent state.

This creates the central tension:

rare events must not be averaged away
but rare-event upweighting must not destroy the latent geometry

The right target is therefore not maximum surprise. It is a controlled training distribution:

normal floor             -> retain baseline dynamics and calibration
useful surprise band     -> learn rare, hard, and regime-changing states
extreme-tail cap         -> avoid corrupt, adversarial, or unlearnable samples
geometry checks          -> preserve autocorrelation, isotropy, and state coverage

In this frame, the curriculum is part of the world-model objective. It shapes which latent transitions the model is asked to make predictable, and it can help or hurt identifiability depending on whether the selected pairs still look like informative local state transitions.

Identifiability-Aware Curriculum

The LeJEPA identifiability result is not a recipe to Gaussianize all temporal data. Real time-series and video corpora are often bounded, periodic, discrete, graph-structured, intervention-heavy, and long-tailed. A global Gaussian prior can be too blunt for that setting.

The useful interpretation is local. The curriculum should try to expose approximately local neighborhoods where the next-state relation is learnable and autocorrelated, while still covering rare regimes. Normal operation, degraded operation, pre-failure buildup, intervention windows, and recovery windows may each need their own local sampling buckets or charts.

This suggests a stronger design principle:

preserve rare regimes as regimes
do not collapse them into one high-surprise tail

For rare events, the curriculum should prefer structured rare windows over arbitrary high-loss windows. A structured rare window has temporal buildup, cross-channel correlation, topology, known intervention timing, repeated morphology, or a plausible physical/system mechanism. A merely high-loss window may be sensor corruption, ingestion damage, one-off exogenous noise, or a target-encoder artifact.

That distinction matters because SIGReg-style Gaussian or whitening constraints can keep embeddings non-collapsed while still hiding rare-state damage. The evaluation has to ask whether the representation keeps tail states linearly usable, calibrated, and connected to the surrounding normal dynamics.

Implementation Sketch

A concrete implementation can be built as a sampler around a JEPA training loop.

Candidate stream. Read windows from a large time-series corpus or clips from a video corpus. For multivariate time series, candidate units can be channel-time patches, graph telemetry windows, or full trajectories. For video, candidate units can be spatiotemporal clips.
JEPA forward pass. Encode context and target views, then predict the target embedding from the context embedding. In VL-JEPA-like settings, the target may be a text embedding conditioned on visual input and a query. In time-series JEPA, the target may be a future latent state or a text-conditioned state embedding.
Selection score. Compute a scalar or vector signal such as target-prediction loss, normalized latent residual, energy score, embedding distance, uncertainty, ensemble disagreement, gradient magnitude, gradient alignment with a rare-state probe, or gradient conflict with normal-behavior retention.
Normalization. Normalize each selection signal within a recent moving window, per dataset shard, per modality, per series, or per regime. This matters because raw loss, gradient norm, and alignment can be dominated by scale, sensor noise, frame complexity, or easy dataset artifacts.
Selection policy. Maintain a target distribution over surprise and/or gradient-space signals. Possible policies include percentile bands, temperature sampling by surprise, bounded hard-example mining, gradient-norm caps, gradient-alignment thresholds, mixture sampling across signal buckets, or a controller that keeps the batch-level signal near a scheduled target.
Safety valves. Keep a small uniform-sampling component so the model does not forget normal behavior or overfit to rare anomalies. Cap extremely high-surprise examples so corrupted data does not dominate.
Support-preservation floor. Keep a no-filter or loose-filter component, not only a normal-window floor. This preserves natural distributional diversity that heuristic filters may misclassify as low quality.
Regime buckets. Track sampling statistics by normal windows, rare-but-structured windows, intervention windows, transition windows, and suspected-corrupt windows when labels or weak heuristics are available. The same global surprise percentile can mean different things in different regimes.
Geometry checks. Monitor embedding covariance, per-regime coverage, positive-pair autocorrelation, and tail-slice probe performance. These checks are a guard against a sampler that improves average loss while producing a distorted latent state.
Target-encoder sanity checks. Compare patch-independent targets against contextual or internal-layer targets. This check is motivated by a NEPA-style next-embedding-prediction experiment rather than by a clean JEPA result; for JEPA, it should be treated as a target-construction risk to ablate. If the target encoder already mixes neighboring patches too strongly, the surprise score can become a bad proxy for marginal learning value.
Feedback loop. Periodically refresh scores because surprise, gradient magnitude, and gradient direction are model-dependent. A stale score can turn the curriculum into a static data filter.

This should be implemented as one best path first: a streaming sampler with moving percentile buckets and a small uniform floor. Gradient-magnitude and gradient-alignment filters should be second-stage ablations or calibration signals after that baseline works.

Private Distributed Training Application

The same filter becomes useful when a company will not export raw data. In that setting, the local training worker can run inside the company boundary, score candidate windows with the current model, apply the surprise or gradient-space curriculum locally, and export only a bounded update signal.

flowchart LR
  Data[Company-local unlabeled data]
  Scorer[Local JEPA scorer]
  Filter[Local signal filter]
  Train[Local training step]
  Guard[Privacy guard]
  Update[Gradient or adapter delta]
  Coord[External coordinator]
  Model[Updated shared model]
  Data --> Scorer --> Filter --> Train --> Guard --> Update --> Coord --> Model
  Model --> Scorer

This does not make the protocol privacy-preserving by itself. Raw gradients, low-rank deltas, adapter updates, and even filtered example statistics can leak private information. The point is narrower: dynamic filtering solves the local data-selection problem when the external coordinator cannot inspect, label, deduplicate, or curate the company corpus.

The resulting contract is:

raw data stays local
selection happens local
only bounded update signals leave
privacy leakage is measured explicitly

For time-series and observability use-cases, this is especially important because the company-local stream may be mostly normal operation with sparse incidents, deployments, interventions, topology changes, or customer-specific regimes. Uniform local training would spend the exported gradient budget on redundant normal windows. Surprise-band or gradient-space filtering should spend that budget on windows that still improve latent-state prediction while preserving a uniform floor for normal behavior.

This links the curriculum idea to Company-Local Block-Wise Fine-Tuning. Block-wise or adapter-local training can define where the data-touching update is computed; dynamic filtering can define which local windows are allowed to produce that update.

Hypotheses

The main hypotheses are:

Useful-signal scarcity. The bottleneck in large temporal corpora is not only data volume. It is the imbalance between repetitive normal behavior and rare decision-relevant transitions.
Surprise as value proxy. A model’s current latent prediction surprise is a useful proxy for the marginal learning value of a window.
Gradient effect as value proxy. A candidate window’s gradient magnitude, gradient direction, or alignment with a probe objective may be a better local proxy for training value than scalar surprise alone.
Target band. The best examples are not necessarily the hardest examples. There should be a useful surprise band between already-learned samples and unlearnable or corrupted samples.
Cross-modal transfer. The same curriculum principle should apply to time series and image/video trajectories because both are temporal streams with uneven information density.
Rare-state preservation. Dynamic selection should improve rare-event sensitivity, regime understanding, and corner-case behavior without turning the whole task into anomaly-only classification.
Distribution-aware refinement. The empirical heuristic may become stronger if combined with distribution constraints on the embedding space, such as SIGReg-style Gaussian regularization from LeJEPA.
Identifiability-aware sampling. Selection should track whether chosen positive pairs preserve informative autocorrelation and avoid policy-shaped non-Gaussian marginals that make the learned state distorted rather than linearly identifiable.
Local-chart hypothesis. Long-tailed temporal data may not be globally Gaussian, but regime-local neighborhoods can still behave like learnable autocorrelated state charts. The curriculum should preserve these local charts instead of forcing all tail states into one global bucket.
Tail-geometry tradeoff. Rare-state upweighting should improve tail sensitivity while keeping the learned state geometry usable for probing, retrieval, planning, or action-conditioned rollout.
Target-dependence risk. Surprise is only useful if the target representation preserves the relevant local state. If the target encoder collapses patch-level distinctions or over-correlates neighboring patches, the curriculum can confidently select the wrong windows.
Metric-target risk. A metagradient-calibrated curriculum is only as good as the differentiable target used to score examples; Synthetic Data for any Differentiable Target shows that optimized synthetic examples can steer hidden weights or narrow metrics while looking benign.
Model-level robustness. Online selection helps with an unlabeled long-tailed corpus, but a stronger solution should also make JEPA encoders robust when the corpus distribution is unknown and cannot be manually characterized.
Private-update efficiency. In company-local distributed training, local surprise filtering should improve the utility of exported gradients or adapter deltas because the remote coordinator cannot directly curate the private corpus.
Privacy-boundary separation. Dynamic filtering is a data-selection mechanism, not a privacy guarantee; it must be paired with update clipping, secure aggregation, differential privacy where appropriate, and gradient-inversion or membership-inference tests.

Experiments Already Done

The current evidence is internal and unpublished.

In the discussion notes, Alex described roughly two months of prior experiments on non-public blockchain or time-series data. The data was unlabeled and large enough that manually deciding which windows mattered was not realistic. The experiments used online filtering during training and tested several heuristics for keeping model surprise in a useful range.

The reported result was substantial training acceleration and, more importantly, improved interpretability or predictability of rare events. Some seemingly obvious heuristics reportedly did not work, so the publishable version should preserve negative ablations instead of presenting surprise filtering as a one-line trick. These experiments were not published because the data was not suitable for publication.

No video proof of concept is recorded here yet. The proposed next step is to test the same surprise-curriculum policy on a video JEPA or VL-JEPA setup.

Publishable Experiment Plan

The publishable version should use public datasets in both modalities.

For time series:

use high-dimensional multivariate time series, observability telemetry, telecom-like data, or anomaly-rich industrial datasets;
compare uniform sampling against dynamic surprise-band sampling;
include no-filter, loose-filter, dedup-only, static quality-filter, and dynamic-reweighting baselines;
evaluate forecasting or state prediction, rare-event detection, regime classification, and downstream anomaly or incident tasks;
track whether the representation preserves spikes, change points, missingness, cross-channel deviations, and rare failures.

For video:

use video datasets where most frames are predictable but useful cases depend on motion boundaries, event order, rare actions, or corner cases;
compare standard JEPA/VL-JEPA training against the dynamic sampler;
include no-filter or minimally filtered data baselines when the dataset has natural diversity;
evaluate video classification, retrieval, temporal event localization, and corner-case sensitivity;
measure whether the model spends learning capacity on informative clips rather than repeated background dynamics.

For private distributed training:

simulate several company-local tenants with different multivariate time-series, event-stream, or graph-time-series distributions;
keep raw windows hidden from the coordinator and allow only gradients, adapter deltas, quantized deltas, or secure aggregates to leave the tenant boundary;
compare local uniform sampling against local dynamic surprise-band filtering at matched exported update budget;
evaluate global utility, tenant-local utility, rare-regime retention, normal-behavior retention, and leakage resistance;
run membership-inference and gradient-inversion probes on exported update signals before claiming any privacy benefit.

The cross-modal research claim should be tested directly:

the same selection-signal-controlled curriculum improves useful-signal efficiency
for both time series and image/video trajectories.

Metrics And Ablations

Useful metrics:

pretraining speed at matched compute;
downstream score at matched number of tokens, windows, clips, or FLOPs;
rare-event recall and calibration;
performance on regime shifts and corner cases;
embedding distribution health;
positive-pair autocorrelation and per-regime isotropy;
tail-slice embedding coverage and probe performance;
normal-retention versus rare-state-preservation tradeoff curves;
robustness to corrupted high-surprise samples;
normal-behavior retention after hard-example upweighting;
distribution-support retention versus no-filter and loose-filter baselines;
duplicate/redundant-window rate after selection;
score staleness and refresh cadence;
target-encoder dependence of the surprise or selection score;
gradient magnitude distribution, gradient-direction alignment with rare-state probes, and gradient conflict with normal-behavior retention;
separation between useful rare windows and unlearnable or corrupt windows.
utility per exported gradient byte or adapter-delta parameter in private distributed training;
leakage risk of exported update signals after local filtering;

Necessary ablations:

uniform sampling;
no-filter or minimally filtered sampling;
loose quality filters;
deduplication-only filtering;
static anomaly or loss-based filtering;
hard-example-only sampling;
dynamic surprise-band sampling;
dynamic sampling with and without a uniform floor;
dynamic sampling with and without embedding-distribution regularization;
dynamic sampling with and without per-regime buckets;
dynamic sampling with and without autocorrelation or geometry constraints;
gradient-magnitude sampling;
gradient-alignment sampling against rare-state, retention, or geometry probe gradients;
DPG-calibrated surprise or gradient scoring versus purely local scoring;
deliberate tail oversampling versus capped useful-tail sampling;
per-modality and shared curriculum policies.
company-local uniform training versus company-local surprise-filtered training;
raw gradients versus low-rank deltas, quantized deltas, clipped updates, and secure aggregates.

Relation To Existing Wiki Threads

This idea extends the wiki’s Latent-State Time-Series Modeling position: the model should maintain useful latent state, not merely reduce average forecast error.

It also connects to JEPA and Latent-Space Predictive Learning: the selection signal should come from representation-space prediction or gradient-space training effect, not from raw reconstruction alone.

JEPA Slow Features is a warning source. A model can avoid trivial collapse while still learning the wrong factors. Dynamic curriculum should therefore be evaluated on whether it preserves decision-relevant state, not only on whether it lowers loss.

LeJEPA and LeWorldModel are useful anchors for the regularization and world-model branches. The curriculum should not fight anti-collapse regularization; it should complement it by choosing examples that expose the state distinctions the representation must preserve.

When Does LeJEPA Learn a World Model? adds the sampling and geometry caveat. Gaussian or whitening constraints can make a learned state linearly identifiable when the latent world and positive pairs match the assumptions. In long-tailed temporal corpora, the same constraints are only safe if the sampler preserves rare regimes, local autocorrelation, and useful state coverage rather than overselecting distorted high-surprise tails.

Distribution Priors In Self-Supervised Learning is the broader topic page for this risk: anti-collapse priors can preserve variance while still suppressing rare regimes, interventions, or long-tail system states.

A Bitter Lesson for Data Filtering and No Filter add the data-support warning. In this wiki’s terms, dynamic curriculum should not be sold as “filter harder.” It should be sold as “spend compute where the current model still learns, while preserving a measurable sampling floor for natural diversity and long-tail regimes.”

Company-Local Block-Wise Fine-Tuning is the private-adaptation neighbor. Its boundary question is where private data touches the model and what update signal may leave. Dynamic curriculum adds a complementary question: which private windows should be allowed to create an update when the coordinator cannot inspect the corpus.

Relation To Foundation TSFM Agenda

This is an idea page, so the verdicts below describe the intended contribution if the proposed experiment works. Evidence status is recorded separately in the Evidence and Missing pieces columns. In other words, this idea targets the long-tail curriculum slot directly even though the publishable evidence still has to be produced.

Agenda slot	Verdict	Evidence	Missing pieces
Data diversity, curriculum, and long tail	partially closes	If validated, surprise-band or gradient-space sampling would spend training compute on informative rare, hard, or regime-changing windows instead of uniform sampling over repetitive normal behavior. The no-filter sources add a constraint: the sampler should preserve natural distributional support instead of deleting low-quality-looking tails. Current direct evidence is internal and unpublished.	Run public time-series and video experiments under matched compute, plus private-distributed simulations with rare-state metrics, normal-retention checks, no-filter/loose-filter/dedup-only baselines, corrupt-window controls, distribution-support probes, and leakage tests.
Data valuation	adjacent	DPG gives a current metagradient example-value mechanism that could calibrate or train a curriculum sampler when a differentiable rare-state or representation-health target exists. Gradient magnitude and direction are cheaper local proxies for the same training-effect question.	Adapt DPG-style data valuation and gradient-space proxies to time-series JEPA windows without making the sampler too expensive, too noisy, or too metric-hacked.
Representation quality	adjacent	Proposes JEPA latent prediction surprise and gradient-space training effect as selection signals, tying the curriculum to state representation rather than raw reconstruction alone. LeJEPA Identifiability adds that selected positive pairs should preserve autocorrelation and usable latent geometry.	Add probes showing that selected windows preserve semantic state, dense numeric detail, local state geometry, and tail-slice linear usability.
Dynamic compute allocation	adjacent	Proposes reallocating training examples and pretraining compute toward high-value windows, including local update budget in private distributed training.	This does not yet allocate inference-time model depth, patch granularity, or serving compute.
Benchmark level	warning	Makes aggregate scores insufficient: the core claim is rare-state preservation under useful-signal-poor data.	Define public slices for rare events, regime changes, intervention windows, corner cases, corrupt windows, and normal-retention checks.

Open Questions

What is the best selection score for JEPA: latent loss, normalized residual, energy, variance, uncertainty, disagreement, downstream probe loss, gradient magnitude, or gradient direction?
Should the curriculum target a fixed surprise percentile, a moving distribution, or a scheduled difficulty curve?
How much uniform sampling is needed to avoid forgetting normal behavior?
How much no-filter or loose-filter sampling is needed to preserve natural diversity while still saving compute?
At what compute/model-size scale does dynamic filtering stop helping relative to no-filter on temporal corpora?
How should the sampler distinguish rare useful events from corrupt data?
How often must surprise scores be refreshed before the curriculum becomes stale?
How should target encoders be designed so surprise reflects marginal learning value rather than patch-dependence artifacts?
Can the same policy be shared across time series and video, or does each modality need its own normalization and target band?
Does SIGReg-style regularization make surprise scores more stable and comparable across training?
Should Gaussian or whitening constraints be enforced globally, per regime, or only as soft geometry checks?
Can rare regimes be modeled as local latent charts without losing the continuity needed for world modeling?
Which autocorrelation and isotropy diagnostics best predict whether a curriculum helps or hurts identifiability?
Can distribution-shaping theory explain which surprise bands should be kept, or is the policy necessarily empirical?
Can DPG-style metagradient data valuation calibrate surprise scores, or is it too expensive and too target-dependent for large temporal corpora?
Can gradient-magnitude or gradient-alignment filters recover most of the benefit of metagradient data valuation without backpropagating through long training trajectories?
For video and event streams, does segment-level or motion-event-level attribution outperform whole-window scoring without making storage, privacy, and compute costs prohibitive?
Which parameter subset should define gradient direction: full model, encoder, predictor, adapter, or action-conditioned dynamics block?
In private distributed training, can per-example gradient signals be used locally without leaking sensitive information through exported curriculum statistics?
Can this become a general foundation-model primitive for useful-signal-efficient temporal pretraining?
In company-local distributed training, should the local filter export only gradients or also privacy-safe aggregate curriculum statistics?
How should the protocol prevent a malicious or compromised coordinator from inferring private examples through adaptive model updates?

Alex Open Research Wiki

Explorer

Dynamic Curriculum Learning For JEPA

Dynamic Curriculum Learning For JEPA

Collaboration

Summary

Problem

High-Level Idea

Historical World-Model Analogue

Dataset Policy Gradient Analogue

Motion-Targeted Attribution Analogue

No-Filter And Diversity Evidence

Beyond Surprise: Gradient-Space Curriculum Signals

Agenda Impact

Identifiability-Aware Curriculum

Implementation Sketch

Private Distributed Training Application

Hypotheses

Experiments Already Done

Publishable Experiment Plan

Metrics And Ablations

Relation To Existing Wiki Threads

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Dynamic Curriculum Learning For JEPA

Dynamic Curriculum Learning For JEPA

Collaboration

Summary

Problem

High-Level Idea

Historical World-Model Analogue

Dataset Policy Gradient Analogue

Motion-Targeted Attribution Analogue

No-Filter And Diversity Evidence

Beyond Surprise: Gradient-Space Curriculum Signals

Agenda Impact

Identifiability-Aware Curriculum

Implementation Sketch

Private Distributed Training Application

Hypotheses

Experiments Already Done

Publishable Experiment Plan

Metrics And Ablations

Relation To Existing Wiki Threads

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks