Hierarchical Modeling with a Fixed FLOPs Budget

Status: draft research design note.

Motivation

Different inputs deserve different amounts of computation. This is true across modalities, and it is also true inside one modality: an empty image region, a predictable text span, a repeated metric window, a change point, and a dense burst of events should not all receive the same processing budget.

Yet most hierarchical modeling work still chooses the degree of compression manually. A paper may learn chunk boundaries, concepts, patches, or latent channel queries, but the compression ratios are usually hyperparameters, and the compression points are usually fixed levels inside the architecture. The model can learn local routing decisions, but the larger compute schedule is still mostly designed by hand.

The bitter-lesson version of the problem is simple: if the useful compression pattern depends on the data, the modality, the task, and the local information density, the model should learn where and how much to compress instead of inheriting a human-written schedule.

But that requires a target. FLOPs are a useful target because compute is already the common currency. We compare models under fixed-FLOPs or IsoFLOP-style budgets; real-time systems have bounded serving resources; and in offline systems users effectively pay for the amount of compute spent on a task.

The target here is a model that learns its hierarchy and compression schedule as part of pretraining while respecting a fixed expected FLOPs budget.

The primary target is fixed expected training FLOPs, with optional hard per-sample inference budgets. This is adaptive FLOPs under a global constraint, not a promise that every input will use identical real FLOPs or identical wall-clock latency. Individual samples, spans, channels, or modality slices may spend different compute when they carry different information density.

The deeper goal is not only efficient multimodal token compression. For time series and world models, the hierarchy should learn a state representation for prediction and control: preserve the variables, event boundaries, and latent state needed to evaluate future observations under actions, control inputs, or interventions.

Recent adjacent evidence strengthens the premise. Compute Optimal Tokenization gives a text-language scaling example where compression rate belongs inside the compute budget, and where training-optimal compression can differ from serving-cost-optimal compression. Scaling Test-Time Compute for Agentic Coding shows that long agent rollouts become more useful after they are compressed into structured summaries. Learning is Forgetting argues that LLM training itself is lossy compression toward objective-relevant information. The shared lesson is that end-to-end systems should not carry raw detail all the way through the stack by default. They should learn what to retain while making the compression policy accountable to the downstream objective and budget.

Budget Regimes

The phrase “fixed FLOPs” hides several budget granularities: fixed per sample or request, fixed per batch, fixed on average over a dataset, fixed per modality mixture, and fixed per serving endpoint.

Those granularities sit inside three related regimes:

  • Training compute budget. How many expected theoretical FLOPs the model spends on average per batch or modality mixture. This is the easiest place to use a differentiable FLOPs proxy as a training signal.
  • Inference serving budget. How many hard-routed FLOPs the model spends per sample or request. This is closer to the user-facing serving contract and may need per-endpoint quotas.
  • Latency budget. How much real time the request takes on a GPU, TPU, or other accelerator. Latency is affected by memory movement, batching, dynamic shapes, kernel fusion, and accelerator utilization, so it is not always linear in theoretical FLOPs.

So FLOPs can be the common currency without being the whole production story. Theoretical expected FLOPs are a good pretraining pressure: they make routing and compression decisions differentiable and comparable. Production deployment may need a latency-calibrated proxy that maps the same decisions to measured hardware time.

High-Level Idea

Hierarchical modeling with a fixed FLOPs budget treats token count, patch count, channel bottleneck size, concept count, and expert activation as one resource-allocation problem.

Instead of specifying:

layer 1 compresses by R1
layer 2 compresses by R2
layer 3 compresses by R3

the training setup specifies:

expected training compute budget = B

and asks the model to decide where the budget is worth spending.

This is fewer architectural heuristics, not zero heuristics: replace per-layer compression schedules with a single compute constraint, then let adaptive routing decide how the exposed compression sites are used.

The core hypothesis is:

learn adaptive compression under a global expected FLOPs budget,
then preserve fine detail only where it helps prediction, state maintenance,
dense output reconstruction, or action-conditioned rollout.

This is deliberately not the same claim as “sparse attention is U-Net-like”. Sparse attention usually changes who communicates with whom while keeping the same resolution. Hierarchical compression changes the granularity of the active representation. A model becomes U-Net-like only when it also has a return path from coarse representations back to fine resolution.

Architecture

The proposed architecture has seven separable pieces.

  1. Modality adapter. Convert raw inputs into primitive representations: bytes or subwords for text, patches or pixels for images, spatiotemporal patches for video, samples or channel-time patches for multivariate time series, and categorical tokens for event streams.

  2. Fine-level encoder. Build local representations before compression. This layer should preserve details that may later be needed for dense outputs, rare-event detection, or action-conditioned state updates.

  3. Adaptive compression modules. At candidate compression sites, a router chooses boundaries, merges, latent queries, concept tokens, channel groups, or keep/drop decisions. The mechanism can look different by domain, but it should expose a comparable expected compute cost.

  4. Global budget controller. Compression decisions compete for a shared expected-FLOPs budget. The controller should discourage each layer from solving its own local compression problem in isolation.

  5. Coarse processor. The compressed representation is processed by a Transformer, SSM, Mamba-like model, MoE, recurrent backbone, or hybrid. The important property is not the operator family; it is that expensive global or semantic mixing happens at a cheaper granularity.

  6. Return path. Dense tasks need information to flow back to the fine level. This can be implemented through upsampling, dechunking, cross-attention from fine tokens to coarse tokens, residual detail paths, skip connections, or decoder states. Without this return path, the model is a hierarchical encoder, not a U-Net-like architecture.

  7. Task heads. The same hierarchy can feed forecasting, reconstruction, dense prediction, anomaly localization, next-state prediction, or action-conditioned world-model heads. The compression policy should be evaluated against the head that actually matters.

We do not assume the model initially discovers all possible compression points. A practical first version exposes a small set of candidate compression sites, while the budget controller learns how much each site is used. This removes hand-written compression ratios before it removes the human choice of candidate levels.

Remaining Design Choices

The claim is not that the system has no heuristics. The research target is fewer architectural heuristics: replace per-layer compression schedules with one compute constraint and make the remaining choices explicit.

The remaining choices include the compression operators, candidate layers or hierarchy levels, the form of the FLOPs proxy, the gate temperature tau, the budget stabilization coefficient rho, the hard-routing rule, and the preservation probes or losses. The useful comparison is therefore against a hand-scheduled hierarchy at matched expected compute, not against an impossible zero-design-prior baseline.

Why The Return Path Matters

The return path is the difference between compression for cheap representation learning and compression for dense prediction or control.

For text, a compressed concept stream may still need byte-level recovery for exact copying, spelling, code, or robust multilingual behavior.

For image/video, a compressed scene or object stream may still need patch-level recovery for localization, OCR, motion boundaries, or dense generation.

For multivariate time series, a compressed latent state may still need sample-level and channel-level recovery for spikes, change points, missingness, rare failures, and channel-specific deviations.

For action-conditioned world models, the preservation target is not pixel-perfect or sample-perfect reconstruction. The target is to preserve the state variables and event boundaries needed to evaluate future observations under actions, control inputs, or interventions.

Cross-Modal Principle

The shared rule should be:

compress by local information density and downstream value,
not by modality name.

That does not mean all modalities should share the same adapter or the same router. It means the architecture should avoid rules like “images always compress by 4x” or “time series always compress early”. The model should learn whether a region, span, channel group, event window, or trajectory segment deserves more compute.

For multivariate time series, time compression and channel compression must be separated. A temporal patcher can preserve event boundaries while losing channel identity. A channel bottleneck can preserve global trends while erasing local deviations. A useful model may need one router for temporal structure, another for channel or topology structure, and a shared budget controller above both.

Adaptive Patching And JEPA Targets

Dynamic patching is not automatically compatible with predictive latent targets. Next-Embedding Prediction records an internal NEPA-style time-series failure mode where patch-independent target embeddings were stable, while more context-dependent patching or internal-layer prediction degraded the objective. This is not direct pure-JEPA evidence. The lesson is not to avoid adaptive patching; it is to make target construction part of the ablation plan for both NEPA-style and JEPA-style systems.

For this fixed-FLOPs hierarchy idea, that means every adaptive compression site should be tested against preservation probes. A router that saves compute by correlating neighboring patches may lower the average objective while making surprise scores, rare-event selection, or dense numeric recovery worse. The first implementation should therefore keep an explicit patch-independent baseline and compare it with dynamic patching under the same expected compute.

Relation To The Referenced Work

The proposal is meant to compose ideas from the current corpus rather than replace them.

  • H-Net is the sequence-hierarchy reference: dynamic chunking, compression, processing, and reconstruction live inside the model.
  • ConceptMoE is the concept-compression and compute-allocation reference: token spans can be merged before expensive expert computation, and comparisons can be made under activated-FLOPs constraints.
  • Compute Optimal Tokenization is the compression-aware scaling reference: token granularity changes the scaling unit, so comparisons should name the information-density unit and serving-cost tradeoff rather than only token count.
  • Synergy is the abstraction-routing reference: the model learns a bridge from byte-level processing to higher-level concept processing.
  • ReinPatch is the time-series boundary-selection reference: continuous time-series patching should be optimized for the downstream forecasting objective, not copied directly from language chunking.
  • U-Cast is the high-dimensional multivariate reference: the channel axis can require its own hierarchy, bottleneck, and upsampling path.
  • Beyond Language Modeling is the multimodal compute-allocation reference: modality mixtures can have asymmetric compute and specialization needs.
  • Scaling Test-Time Compute for Agentic Coding is the agentic-runtime compression reference: structured summaries can be a better substrate than raw trajectories for selection and reuse.
  • Learning is Forgetting is the training-dynamics compression reference: representations improve when they preserve target-relevant information and discard irrelevant input detail.
  • FADE is the controlled-forgetting reference: memory horizons can be adapted instead of fixed globally.

The synthesis claim is narrower than “fixed FLOPs are new.” Fixed-FLOPs constrained optimization and conditional computation are not new by themselves. The useful composition is dynamic hierarchical compression plus a global expected-FLOPs constraint plus cross-modal information-density routing plus world-model and time-series preservation.

In that composition, fixed expected FLOPs act as the common pressure that makes compression decisions comparable across layers, modalities, and axes.

Recurrent Depth And Energy Loops

Recurrent Transformer depth is adjacent to this idea because it spends extra computation at test time without adding a proportional number of unique weights. The useful deployment story is real: bounded-memory inference, early exit when representations converge, and loop count as a possible uncertainty or failure signal. LoopFormer makes the loop budget explicit through elastic depth and shortcut consistency, while Parcae adds stability and saturation evidence for looped scaling. Hyperloop Transformers adds the parameter-memory variant: loop-level hyper-connections can improve a looped model’s parameter-performance frontier, while the early-exit evidence remains suggestive until tested under fixed expected-FLOPs and latency budgets. But as research evidence this branch still needs a compute constraint and a strong baseline, because a wider or deeper non-recurrent model with unique layers may solve the same task better when memory or latency is not binding.

Energy-Based Transformers add another adjacent path: score candidate predictions with an energy function and spend optimization steps where the candidate is hard. Extra recurrence inside that loop is not automatically useful, because the energy-gradient inference procedure is already iterative. The first question should be whether the energy landscape, candidate generation, or stopping rule is the bottleneck.

What Prevents Collapse Into Over-Compression?

A global budget by itself can create a bad incentive: erase fine detail until the average objective looks cheap enough. Probes can catch some of this after the fact, but the training objective should also include explicit retaining forces:

L = L_predictive/world-model
  + L_preservation
  + L_budget
  + L_anti-collapse

L_predictive/world-model is the main predictive objective: next-token prediction, forecasting, latent prediction, next-state prediction, or action-conditioned rollout. L_preservation keeps information that should survive compression even when it is rare, local, or only visible at fine resolution. L_budget prices compute. L_anti-collapse prevents degenerate representations or routing policies that satisfy the budget by making many states indistinguishable.

The right retaining force depends on the domain.

  • JEPA-like world models. Latent prediction should be paired with SIGReg, variance regularization, Gaussian regularization, or a similar anti-collapse term. LeJEPA and LeWorldModel are the local references for this branch, while JEPA Slow Features warns that a representation can avoid constant collapse and still encode the wrong factors.
  • Time series. Preservation should explicitly protect rare events, change points, missingness patterns, regime transitions, cross-channel deviations, and other state variables that average forecasting loss can underweight.
  • Dense outputs. Reconstruction, dechunking, upsampling, or fine-level decoder losses make the return path accountable for details that the compressed stream alone would otherwise erase.

The design goal is not to prevent compression. It is to make over-compression pay when it destroys decision-relevant state.

FADE adds the continual-learning analogue: forgetting is useful when it selectively clears stale information, but harmful when one fixed decay rule erases stable knowledge. For budgeted hierarchy, this suggests that compression should also have different memory horizons: some fine detail can be dropped quickly, while rare safety signals, action effects, and stable dynamics should be retained.

Evaluation

Average pretraining loss is not enough. Compression failures often show up at the wrong scale.

Useful probes include:

  • Text: byte and character robustness, code, multilingual spans, exact copying, and long-context retrieval.
  • Images: localization, OCR, small objects, chart/table reading, dense prediction, and generation fidelity.
  • Video: motion boundaries, event order, delayed effects, and action-conditioned future frames.
  • Time series: spikes, change points, rare events, regime shifts, missingness, scale changes, and cross-channel dependencies.
  • World-model settings: next-state prediction, action consequence prediction, counterfactual prediction when causal assumptions are valid, uncertainty over plausible futures, and recovery from partial observations.

The most important evaluation question is whether the router learns to spend compute on decision-relevant information, not whether it merely reduces average token count.

For learned compression, the budget controller should report target-relevant information preserved per expected FLOP, then expose which rare events, dense numeric details, or action variables are lost. Otherwise a model can look efficient while silently moving the cost into downstream decision errors.

Relation To Foundation TSFM Agenda

This is an idea page, so the verdicts below describe the intended contribution if the proposed system or experiment works. Evidence status is recorded separately in the Evidence and Missing pieces columns.

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationpartially closesIf validated, adaptive compression and routing under a fixed expected training FLOPs budget would make compute allocation a learned part of pretraining instead of a hand-set schedule.Train a time-series system that learns routing policies and reports matched-compute results.
Patch size, dynamic tokenization, and point-wise numeric embeddingspartially closesIf validated, the idea would make temporal compression and fine-detail return paths adaptive rather than fixed by one patch size.Define concrete tokenization operators that preserve spikes, change points, missingness, and local causal events.
Native multivariate encoding and high-channel scalingpartially closesIf validated, the idea would make channel hierarchy and topology-aware bottlenecks separate learned decisions from temporal compression.Demonstrate cross-channel interaction retention at hundreds or thousands of channels.
Representation qualitywarningIdentifies over-compression as a likely failure mode: average objectives can erase rare or decision-relevant detail even if compute is saved.Add preservation probes for semantic state, dense numeric detail, and action-conditioned rollout.
Continual adaptationadjacentFADE suggests forgetting can be selective and learned rather than a uniform decay rule.Show adaptive memory horizons for streaming time-series or world-model state, not only final-layer classifiers.
Control and counterfactualsadjacentProposes action-conditioned world-model heads as one target for deciding what detail must survive compression.Evaluate on a control dataset with logged action channels and counterfactual rollout targets.

Open Questions

  • Can one budget controller coordinate temporal compression, channel compression, concept compression, and expert routing?
  • After the initial fixed expected training FLOPs target, which hard budget contracts matter most: per sample, per batch, dataset average, modality mixture, serving endpoint, or latency target?
  • Can candidate compression sites themselves be learned, or is exposing a small human-chosen set the right first system?
  • How should the model expose uncertainty when compression removes fine detail?
  • Should adaptive compression routers and adaptive parameter-decay policies be coupled, or evaluated separately, given that FADE changes slow-weight memory horizons rather than per-sample compute?
  • What probe suite catches rare-event erasure early enough during pretraining?
  • Can fixed-FLOPs hierarchy become a foundation-model primitive for action-conditioned time-series world models rather than another preprocessing trick?
  • When adaptive patching is used with JEPA-style targets, which target construction avoids patch-dependence shortcuts?
  • Can recurrent loop convergence provide calibrated uncertainty or early-exit signals under a fixed expected compute budget?

Technical Details

Differentiable FLOPs Proxy

Real inference FLOPs depend on hard routing decisions. During training, the model needs a differentiable proxy.

For a keep, merge, or boundary router:

p_i = sigmoid(z_i / tau)
K_tilde = sum_i p_i

For a boundary router:

K_tilde = 1 + sum_i P(boundary after token i)

For a Transformer-like layer, a simple expected-cost estimate is:

C_l ~= a_l K_tilde_l d_l^2
     + b_l K_tilde_l^2 d_l
     + c_l K_tilde_l d_l d_ff

For local attention, replace the quadratic attention term with the local-window cost. For MoE, count activated experts rather than total parameters. For channel hierarchies, account separately for temporal tokens and channel groups.

Budget Objective

One operational version of L_budget is the equality-budget objective:

g = C_tilde / B - 1
 
L_budget = lambda * g
         + (rho / 2) * g^2

B is the target FLOPs budget. lambda is a dual variable updated during training. rho is a quadratic stabilization coefficient. The linear term prices budget violation; the quadratic term reduces oscillation around the target.

For the primary research target, C_tilde can be averaged over the batch or modality mixture, making the constraint fixed in expectation rather than fixed for every sample. If the serving contract is “do not exceed this budget” rather than “use this budget”, replace g with max(0, g) and apply it per sample, request, or endpoint as needed.

The full loss should then compose the budget term with the task and retention terms:

L = L_predictive/world-model
  + alpha * L_preservation
  + beta * L_anti-collapse
  + L_budget

The differentiable proxy usually targets training compute first. For deployment, the same controller may need a second calibrated cost model that estimates hard-routed serving FLOPs or measured latency.

Hard Routing

Training can use soft gates, straight-through estimators, Gumbel-style relaxations, or soft top-k. Inference should use hard decisions: top-k, boundary selection, latent-query selection, or quota-based routing.

The practical risk is train/inference mismatch. Late-stage training should therefore anneal toward harder decisions, or periodically evaluate the exact hard-routing compute path.

Flow-Of-Ranks Note

FlowRanks may be useful as an optional diagnostic: rank spectra can suggest where representations look redundant, or help initialize router biases. It should not be treated as the core method. Rank can miss rare but decision-relevant events, and the main compression policy should be learned from the budgeted objective plus task and preservation losses.

If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.

Ideas are not the bottleneck. Hands are. Time-series modeling should be moving at least as fast as vision, audio, and robotics.

Related pages:

Source pages: