Dynamic Curriculum Learning For JEPA
Status: draft research idea extracted from internal discussion notes.
Collaboration
If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.
Ideas are not the bottleneck. Hands are. Time-series modeling should be moving at least as fast as vision, audio, and robotics.
- Email: [email protected]
- X: @chemeris
- Telegram: @alexanderchemeris
Summary
Large temporal datasets are often data-rich but useful-signal-poor. In observability, telecom, blockchain, industrial telemetry, and many video settings, most windows show normal behavior. The rare windows are the ones that carry failures, regime changes, corner cases, motion boundaries, or other decision-relevant state.
The idea is to train JEPA-style models with a dynamic curriculum that filters or reweights training examples online according to the model’s current surprise. Instead of sampling uniformly from a huge corpus, the training loop continuously asks which windows are still informative for the current model.
This is especially natural for JEPA, because the model already predicts in representation space. Surprise can be measured as a latent prediction error rather than as raw reconstruction error.
The stronger research claim is that curriculum should become a foundation-model primitive for useful-signal-poor temporal corpora, not just a preprocessing trick. The sampler, target embedding distribution, anti-collapse regularizer, and rare-state evaluation all become part of the model recipe.
After When Does LeJEPA Learn a World Model?, the sharper version is that dynamic curriculum is a distribution-shaping controller for JEPA pretraining. It should spend compute on rare informative transitions while preserving enough autocorrelated, approximately isotropic coverage for the learned state to remain linearly usable.
A newly identified deployment use-case is company-local distributed training. When raw enterprise data cannot leave the company’s boundary and is mostly unlabeled, the same dynamic filter can decide which local windows are worth turning into update signals before gradients, adapter deltas, or secure aggregates are exported to an external coordinator.
Problem
The practical problem has two parts.
First, there is too much data to train on naively. A month of high-dimensional time series can contain terabytes of data and trillions of points. Video has the same shape at a different scale: long spans of predictable background dynamics can dominate the training stream.
Second, most of that data is not equally useful. If a model sees mostly normal operation, it can learn a strong representation of normality while discarding rare, hard, or abnormal cases as noise. For operational systems, safety settings, and video corner cases, those rare windows are often the main reason to train the model.
So the target is not simply “more data.” The target is higher useful learning signal per unit of compute.
This creates three related but separable problems:
- Dataset-level selection. Given an unlabeled corpus, identify which windows still teach the current model something useful.
- Model-level robustness. Given arbitrary temporal data, train an encoder that does not erase rare correlated events, change points, or decision-relevant state even when those cases are sparse.
- Distribution shaping. Given a long-tailed stream, choose training pairs so rare states are represented without turning the whole objective into hard-example mining over noise, corrupt windows, or policy-shaped tails.
- Private local selection. Given unlabeled company data that cannot be inspected centrally, choose locally which windows should contribute to exported update signals without revealing raw observations.
High-Level Idea
Use online data selection during JEPA pretraining.
The training loop should:
- Run candidate windows, clips, or trajectories through the current model.
- Compute a surprise score for each candidate.
- Keep, upweight, downweight, or drop examples according to a moving curriculum policy.
- Update the curriculum as the model learns, so examples that were useful early can become cheap or redundant later.
The core heuristic is:
too little surprise -> already learned or uninformative, downweight
useful surprise band -> informative training signal, keep or upweight
too much surprise -> possible noise, corruption, or out-of-distribution case, cap or inspectThis makes curriculum learning dynamic rather than fixed. The same example can move between bands as the encoder and predictor improve.
Historical World-Model Analogue
World Models is a historical RL/world-model analogue, not direct JEPA or time-series evidence. The authors propose using the dynamics model’s prediction loss to drive exploration and collect data where the model is unfamiliar. That makes it a useful predecessor for surprise-based temporal curricula, while preserving the caveat that high surprise can also mean corrupt, adversarial, or irrelevant states.
Agenda Impact
This idea changes the research agenda from “train on more temporal data” to “spend training compute on the windows that preserve latent state.” In this frame, a dynamic curriculum is not only an accelerator. It is a mechanism for preventing a foundation time-series model from normalizing away the rare states that make industrial, observability, blockchain, robotics, and video corpora valuable.
The publishable target should therefore report rare-state preservation and normal-behavior retention together. A method that improves average loss by selecting only common or easy windows is a failure. A method that overweights anomalies until normal dynamics are forgotten is also a failure.
The current best first system is still simple: a streaming sampler with moving surprise buckets, a uniform sampling floor, caps for extreme high-surprise windows, and periodic score refresh. The agenda-level extension is to test whether that sampler can be interpreted as distribution shaping in embedding space, especially when paired with SIGReg-style Gaussian regularization.
When Does LeJEPA Learn a World Model? makes the sampling contract sharper: the curriculum should preserve informative autocorrelation and approximate isotropy rather than merely chase high surprise. A sampler that overselects policy-shaped or highly non-Gaussian windows may train a non-collapsed but distorted latent state.
This creates the central tension:
rare events must not be averaged away
but rare-event upweighting must not destroy the latent geometryThe right target is therefore not maximum surprise. It is a controlled training distribution:
normal floor -> retain baseline dynamics and calibration
useful surprise band -> learn rare, hard, and regime-changing states
extreme-tail cap -> avoid corrupt, adversarial, or unlearnable samples
geometry checks -> preserve autocorrelation, isotropy, and state coverageIn this frame, the curriculum is part of the world-model objective. It shapes which latent transitions the model is asked to make predictable, and it can help or hurt identifiability depending on whether the selected pairs still look like informative local state transitions.
Identifiability-Aware Curriculum
The LeJEPA identifiability result is not a recipe to Gaussianize all temporal data. Real time-series and video corpora are often bounded, periodic, discrete, graph-structured, intervention-heavy, and long-tailed. A global Gaussian prior can be too blunt for that setting.
The useful interpretation is local. The curriculum should try to expose approximately local neighborhoods where the next-state relation is learnable and autocorrelated, while still covering rare regimes. Normal operation, degraded operation, pre-failure buildup, intervention windows, and recovery windows may each need their own local sampling buckets or charts.
This suggests a stronger design principle:
preserve rare regimes as regimes
do not collapse them into one high-surprise tailFor rare events, the curriculum should prefer structured rare windows over arbitrary high-loss windows. A structured rare window has temporal buildup, cross-channel correlation, topology, known intervention timing, repeated morphology, or a plausible physical/system mechanism. A merely high-loss window may be sensor corruption, ingestion damage, one-off exogenous noise, or a target-encoder artifact.
That distinction matters because SIGReg-style Gaussian or whitening constraints can keep embeddings non-collapsed while still hiding rare-state damage. The evaluation has to ask whether the representation keeps tail states linearly usable, calibrated, and connected to the surrounding normal dynamics.
Implementation Sketch
A concrete implementation can be built as a sampler around a JEPA training loop.
-
Candidate stream. Read windows from a large time-series corpus or clips from a video corpus. For multivariate time series, candidate units can be channel-time patches, graph telemetry windows, or full trajectories. For video, candidate units can be spatiotemporal clips.
-
JEPA forward pass. Encode context and target views, then predict the target embedding from the context embedding. In VL-JEPA-like settings, the target may be a text embedding conditioned on visual input and a query. In time-series JEPA, the target may be a future latent state or a text-conditioned state embedding.
-
Surprise score. Compute a scalar such as target-prediction loss, normalized latent residual, energy score, embedding distance, uncertainty, or an ensemble disagreement proxy.
-
Normalization. Normalize surprise within a recent moving window, per dataset shard, per modality, per series, or per regime. This matters because raw loss can be dominated by scale, sensor noise, frame complexity, or easy dataset artifacts.
-
Selection policy. Maintain a target surprise distribution. Possible policies include percentile bands, temperature sampling by surprise, bounded hard-example mining, mixture sampling across surprise buckets, or a controller that keeps the batch-level surprise near a scheduled target.
-
Safety valves. Keep a small uniform-sampling component so the model does not forget normal behavior or overfit to rare anomalies. Cap extremely high-surprise examples so corrupted data does not dominate.
-
Regime buckets. Track sampling statistics by normal windows, rare-but-structured windows, intervention windows, transition windows, and suspected-corrupt windows when labels or weak heuristics are available. The same global surprise percentile can mean different things in different regimes.
-
Geometry checks. Monitor embedding covariance, per-regime coverage, positive-pair autocorrelation, and tail-slice probe performance. These checks are a guard against a sampler that improves average loss while producing a distorted latent state.
-
Target-encoder sanity checks. Compare patch-independent targets against contextual or internal-layer targets. This check is motivated by a NEPA-style next-embedding-prediction experiment rather than by a clean JEPA result; for JEPA, it should be treated as a target-construction risk to ablate. If the target encoder already mixes neighboring patches too strongly, the surprise score can become a bad proxy for marginal learning value.
-
Feedback loop. Periodically refresh scores because surprise is model-dependent. A stale score can turn the curriculum into a static data filter.
This should be implemented as one best path first: a streaming sampler with moving percentile buckets and a small uniform floor. More complex controllers should come only after the baseline works.
Private Distributed Training Application
The same filter becomes useful when a company will not export raw data. In that setting, the local training worker can run inside the company boundary, score candidate windows with the current model, apply the surprise-band curriculum locally, and export only a bounded update signal.
flowchart LR Data[Company-local unlabeled data] Scorer[Local JEPA scorer] Filter[Surprise-band filter] Train[Local training step] Guard[Privacy guard] Update[Gradient or adapter delta] Coord[External coordinator] Model[Updated shared model] Data --> Scorer --> Filter --> Train --> Guard --> Update --> Coord --> Model Model --> Scorer
This does not make the protocol privacy-preserving by itself. Raw gradients, low-rank deltas, adapter updates, and even filtered example statistics can leak private information. The point is narrower: dynamic filtering solves the local data-selection problem when the external coordinator cannot inspect, label, deduplicate, or curate the company corpus.
The resulting contract is:
raw data stays local
selection happens local
only bounded update signals leave
privacy leakage is measured explicitlyFor time-series and observability use-cases, this is especially important because the company-local stream may be mostly normal operation with sparse incidents, deployments, interventions, topology changes, or customer-specific regimes. Uniform local training would spend the exported gradient budget on redundant normal windows. Surprise-band filtering should spend that budget on windows that still improve latent-state prediction while preserving a uniform floor for normal behavior.
This links the curriculum idea to Company-Local Block-Wise Fine-Tuning. Block-wise or adapter-local training can define where the data-touching update is computed; dynamic filtering can define which local windows are allowed to produce that update.
Hypotheses
The main hypotheses are:
- Useful-signal scarcity. The bottleneck in large temporal corpora is not only data volume. It is the imbalance between repetitive normal behavior and rare decision-relevant transitions.
- Surprise as value proxy. A model’s current latent prediction surprise is a useful proxy for the marginal learning value of a window.
- Target band. The best examples are not necessarily the hardest examples. There should be a useful surprise band between already-learned samples and unlearnable or corrupted samples.
- Cross-modal transfer. The same curriculum principle should apply to time series and image/video trajectories because both are temporal streams with uneven information density.
- Rare-state preservation. Dynamic selection should improve rare-event sensitivity, regime understanding, and corner-case behavior without turning the whole task into anomaly-only classification.
- Distribution-aware refinement. The empirical heuristic may become stronger if combined with distribution constraints on the embedding space, such as SIGReg-style Gaussian regularization from LeJEPA.
- Identifiability-aware sampling. Selection should track whether chosen positive pairs preserve informative autocorrelation and avoid policy-shaped non-Gaussian marginals that make the learned state distorted rather than linearly identifiable.
- Local-chart hypothesis. Long-tailed temporal data may not be globally Gaussian, but regime-local neighborhoods can still behave like learnable autocorrelated state charts. The curriculum should preserve these local charts instead of forcing all tail states into one global bucket.
- Tail-geometry tradeoff. Rare-state upweighting should improve tail sensitivity while keeping the learned state geometry usable for probing, retrieval, planning, or action-conditioned rollout.
- Target-dependence risk. Surprise is only useful if the target representation preserves the relevant local state. If the target encoder collapses patch-level distinctions or over-correlates neighboring patches, the curriculum can confidently select the wrong windows.
- Model-level robustness. Online selection helps with an unlabeled long-tailed corpus, but a stronger solution should also make JEPA encoders robust when the corpus distribution is unknown and cannot be manually characterized.
- Private-update efficiency. In company-local distributed training, local surprise filtering should improve the utility of exported gradients or adapter deltas because the remote coordinator cannot directly curate the private corpus.
- Privacy-boundary separation. Dynamic filtering is a data-selection mechanism, not a privacy guarantee; it must be paired with update clipping, secure aggregation, differential privacy where appropriate, and gradient-inversion or membership-inference tests.
Experiments Already Done
The current evidence is internal and unpublished.
In the discussion notes, Alex described roughly two months of prior experiments on non-public blockchain or time-series data. The data was unlabeled and large enough that manually deciding which windows mattered was not realistic. The experiments used online filtering during training and tested several heuristics for keeping model surprise in a useful range.
The reported result was substantial training acceleration and, more importantly, improved interpretability or predictability of rare events. Some seemingly obvious heuristics reportedly did not work, so the publishable version should preserve negative ablations instead of presenting surprise filtering as a one-line trick. These experiments were not published because the data was not suitable for publication.
No video proof of concept is recorded here yet. The proposed next step is to test the same surprise-curriculum policy on a video JEPA or VL-JEPA setup.
Publishable Experiment Plan
The publishable version should use public datasets in both modalities.
For time series:
- use high-dimensional multivariate time series, observability telemetry, telecom-like data, or anomaly-rich industrial datasets;
- compare uniform sampling against dynamic surprise-band sampling;
- evaluate forecasting or state prediction, rare-event detection, regime classification, and downstream anomaly or incident tasks;
- track whether the representation preserves spikes, change points, missingness, cross-channel deviations, and rare failures.
For video:
- use video datasets where most frames are predictable but useful cases depend on motion boundaries, event order, rare actions, or corner cases;
- compare standard JEPA/VL-JEPA training against the dynamic sampler;
- evaluate video classification, retrieval, temporal event localization, and corner-case sensitivity;
- measure whether the model spends learning capacity on informative clips rather than repeated background dynamics.
For private distributed training:
- simulate several company-local tenants with different multivariate time-series, event-stream, or graph-time-series distributions;
- keep raw windows hidden from the coordinator and allow only gradients, adapter deltas, quantized deltas, or secure aggregates to leave the tenant boundary;
- compare local uniform sampling against local dynamic surprise-band filtering at matched exported update budget;
- evaluate global utility, tenant-local utility, rare-regime retention, normal-behavior retention, and leakage resistance;
- run membership-inference and gradient-inversion probes on exported update signals before claiming any privacy benefit.
The cross-modal research claim should be tested directly:
the same surprise-controlled curriculum improves useful-signal efficiency
for both time series and image/video trajectories.Metrics And Ablations
Useful metrics:
- pretraining speed at matched compute;
- downstream score at matched number of tokens, windows, clips, or FLOPs;
- rare-event recall and calibration;
- performance on regime shifts and corner cases;
- embedding distribution health;
- positive-pair autocorrelation and per-regime isotropy;
- tail-slice embedding coverage and probe performance;
- normal-retention versus rare-state-preservation tradeoff curves;
- robustness to corrupted high-surprise samples;
- normal-behavior retention after hard-example upweighting;
- score staleness and refresh cadence;
- target-encoder dependence of the surprise score;
- separation between useful rare windows and unlearnable or corrupt windows.
- utility per exported gradient byte or adapter-delta parameter in private distributed training;
- leakage risk of exported update signals after local filtering;
Necessary ablations:
- uniform sampling;
- static anomaly or loss-based filtering;
- hard-example-only sampling;
- dynamic surprise-band sampling;
- dynamic sampling with and without a uniform floor;
- dynamic sampling with and without embedding-distribution regularization;
- dynamic sampling with and without per-regime buckets;
- dynamic sampling with and without autocorrelation or geometry constraints;
- deliberate tail oversampling versus capped useful-tail sampling;
- per-modality and shared curriculum policies.
- company-local uniform training versus company-local surprise-filtered training;
- raw gradients versus low-rank deltas, quantized deltas, clipped updates, and secure aggregates.
Relation To Existing Wiki Threads
This idea extends the wiki’s Latent-State Time-Series Modeling position: the model should maintain useful latent state, not merely reduce average forecast error.
It also connects to JEPA and Latent-Space Predictive Learning: the selection signal should come from representation-space prediction, not from raw reconstruction alone.
JEPA Slow Features is a warning source. A model can avoid trivial collapse while still learning the wrong factors. Dynamic curriculum should therefore be evaluated on whether it preserves decision-relevant state, not only on whether it lowers loss.
LeJEPA and LeWorldModel are useful anchors for the regularization and world-model branches. The curriculum should not fight anti-collapse regularization; it should complement it by choosing examples that expose the state distinctions the representation must preserve.
When Does LeJEPA Learn a World Model? adds the sampling and geometry caveat. Gaussian or whitening constraints can make a learned state linearly identifiable when the latent world and positive pairs match the assumptions. In long-tailed temporal corpora, the same constraints are only safe if the sampler preserves rare regimes, local autocorrelation, and useful state coverage rather than overselecting distorted high-surprise tails.
Distribution Priors In Self-Supervised Learning is the broader topic page for this risk: anti-collapse priors can preserve variance while still suppressing rare regimes, interventions, or long-tail system states.
Company-Local Block-Wise Fine-Tuning is the private-adaptation neighbor. Its boundary question is where private data touches the model and what update signal may leave. Dynamic curriculum adds a complementary question: which private windows should be allowed to create an update when the coordinator cannot inspect the corpus.
Relation To Foundation TSFM Agenda
This is an idea page, so the verdicts below describe the intended contribution if the proposed experiment works. Evidence status is recorded separately in the Evidence and Missing pieces columns. In other words, this idea targets the long-tail curriculum slot directly even though the publishable evidence still has to be produced.
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Data diversity, curriculum, and long tail | partially closes | If validated, surprise-band sampling would spend training compute on informative rare, hard, or regime-changing windows instead of uniform sampling over repetitive normal behavior. The same mechanism would help company-local corpora where the coordinator cannot inspect or label private data. Current evidence is internal and unpublished. | Run public time-series and video experiments under matched compute, plus private-distributed simulations with rare-state metrics, normal-retention checks, corrupt-window controls, and leakage tests. |
| Representation quality | adjacent | Proposes JEPA latent prediction surprise as the selection signal, tying the curriculum to state representation rather than raw reconstruction alone. LeJEPA Identifiability adds that selected positive pairs should preserve autocorrelation and usable latent geometry. | Add probes showing that selected windows preserve semantic state, dense numeric detail, local state geometry, and tail-slice linear usability. |
| Dynamic compute allocation | adjacent | Proposes reallocating training examples and pretraining compute toward high-value windows, including local update budget in private distributed training. | This does not yet allocate inference-time model depth, patch granularity, or serving compute. |
| Benchmark level | warning | Makes aggregate scores insufficient: the core claim is rare-state preservation under useful-signal-poor data. | Define public slices for rare events, regime changes, intervention windows, corner cases, corrupt windows, and normal-retention checks. |
Open Questions
- What is the best surprise score for JEPA: latent loss, normalized residual, energy, variance, uncertainty, disagreement, or downstream probe loss?
- Should the curriculum target a fixed surprise percentile, a moving distribution, or a scheduled difficulty curve?
- How much uniform sampling is needed to avoid forgetting normal behavior?
- How should the sampler distinguish rare useful events from corrupt data?
- How often must surprise scores be refreshed before the curriculum becomes stale?
- How should target encoders be designed so surprise reflects marginal learning value rather than patch-dependence artifacts?
- Can the same policy be shared across time series and video, or does each modality need its own normalization and target band?
- Does SIGReg-style regularization make surprise scores more stable and comparable across training?
- Should Gaussian or whitening constraints be enforced globally, per regime, or only as soft geometry checks?
- Can rare regimes be modeled as local latent charts without losing the continuity needed for world modeling?
- Which autocorrelation and isotropy diagnostics best predict whether a curriculum helps or hurts identifiability?
- Can distribution-shaping theory explain which surprise bands should be kept, or is the policy necessarily empirical?
- Can this become a general foundation-model primitive for useful-signal-efficient temporal pretraining?
- In company-local distributed training, should the local filter export only gradients or also privacy-safe aggregate curriculum statistics?
- How should the protocol prevent a malicious or compromised coordinator from inferring private examples through adaptive model updates?