Hierarchical Modeling with a Fixed FLOPs Budget

Status: draft research design note.

Motivation

Different inputs deserve different amounts of computation. This is true across modalities, and it is also true inside one modality: an empty image region, a predictable text span, a repeated metric window, a change point, and a dense burst of events should not all receive the same processing budget.

Yet most hierarchical modeling work still chooses the degree of compression manually. A paper may learn chunk boundaries, concepts, patches, or latent channel queries, but the compression ratios are usually hyperparameters, and the compression points are usually fixed levels inside the architecture. The model can learn local routing decisions, but the larger compute schedule is still mostly designed by hand.

The bitter-lesson version of the problem is simple: if the useful compression pattern depends on the data, the modality, the task, and the local information density, the model should learn where and how much to compress instead of inheriting a human-written schedule.

But that requires a target. FLOPs are a useful target because compute is already the common currency. We compare models under fixed-FLOPs or IsoFLOP-style budgets; real-time systems have bounded serving resources; and in offline systems users effectively pay for the amount of compute spent on a task.

The target here is a model that learns its hierarchy and compression schedule as part of pretraining while respecting a fixed expected FLOPs budget.

The primary target is fixed expected training FLOPs, with optional hard per-sample inference budgets. This is adaptive FLOPs under a global constraint, not a promise that every input will use identical real FLOPs or identical wall-clock latency. Individual samples, spans, channels, or modality slices may spend different compute when they carry different information density.

The deeper goal is not only efficient multimodal token compression. For time series and world models, the hierarchy should learn a state representation for prediction and control: preserve the variables, event boundaries, and latent state needed to evaluate future observations under actions, control inputs, or interventions.

Recent adjacent evidence strengthens the premise. Compute Optimal Tokenization gives a text-language scaling example where compression rate belongs inside the compute budget, and where training-optimal compression can differ from serving-cost-optimal compression. Scaling Laws, Carefully adds the experiment-design discipline: fixed compute is a constrained optimization problem over model size, data, and any extra allocation variable, and the claimed optimum can move when the data regime, fit region, parameter counting, optimizer, tokenization, or loss precision changes. Scaling Test-Time Compute for Agentic Coding shows that long agent rollouts become more useful after they are compressed into structured summaries. Learning is Forgetting argues that LLM training itself is lossy compression toward objective-relevant information. Variable-Width Transformers adds a direct architecture baseline: even a static nonuniform layer-width schedule can improve a language model under matched parameters while reducing average layer width, fitted FLOPs, and KV-cache cost. Oryx, Hybrid Associative Memories, and HOLA add three long-context allocation examples: choose mixer mode across sequence spans, choose which tokens enter a data-dependent KV scratchpad, or spend a fixed exact-memory budget on the largest recurrent-state updates. The shared lesson is that end-to-end systems should not carry raw detail all the way through the stack by default. They should learn what to retain while making the compression policy accountable to the downstream objective, budget, and scaling-law fit assumptions.

Budget Regimes

The phrase “fixed FLOPs” hides several budget granularities: fixed per sample or request, fixed per batch, fixed on average over a dataset, fixed per modality mixture, and fixed per serving endpoint.

Those granularities sit inside three related regimes:

Training compute budget. How many expected theoretical FLOPs the model spends on average per batch or modality mixture. This is the easiest place to use a differentiable FLOPs proxy as a training signal.
Inference serving budget. How many hard-routed FLOPs the model spends per sample or request. This is closer to the user-facing serving contract and may need per-endpoint quotas.
Latency budget. How much real time the request takes on a GPU, TPU, or other accelerator. Latency is affected by memory movement, batching, dynamic shapes, kernel fusion, and accelerator utilization, so it is not always linear in theoretical FLOPs.

So FLOPs can be the common currency without being the whole production story. Theoretical expected FLOPs are a good pretraining pressure: they make routing and compression decisions differentiable and comparable. Production deployment may need a latency-calibrated proxy that maps the same decisions to measured hardware time.

Scaling-Law Calibration

Scaling Laws, Carefully changes the experimental contract for this idea. The slogan should not be “train under a fixed budget and report a win.” It should be:

fit and compare allocation frontiers under matched compute,
then show that learned hierarchy shifts the optimum without hiding costs.

The dense-language-model baseline is $C \approx 6 N D$ , but a budgeted hierarchy changes both sides of the accounting. The data term is no longer just token count or sample count; it may be active channel-time cells, event windows, compressed latent tokens, retained history chunks, or action-conditioned transitions. The model term is no longer just parameter count; it may include active width, expert count, recurrent loop count, return-path decoding, and routing overhead. A credible first paper should therefore define an explicit cost model such as:

C_tilde = C_adapter
        + C_fine_encoder
        + C_router
        + C_active_coarse_processor
        + C_return_path
        + C_heads

and then calibrate it against real wall-clock latency and memory traffic for the hard-routed implementation.

The source also sharpens the data-regime requirement. Useful-signal-poor time-series corpora can have enormous repeated normal-state volume but much less unique decision-relevant data. If repeated normal windows are counted like fresh informative windows, the fitted optimum can prefer the wrong model size or compression rate. A fixed-FLOPs hierarchy should therefore track both raw data volume and an effective-data proxy: unique regimes, rare structured windows, event types, intervention windows, or probe-improving examples.

The minimum scaling-law ablation is no longer one matched-FLOPs point. It is a small grid:

Axis	Why it matters
Model size or width	Separates compression gains from ordinary capacity scaling.
Data or effective-data amount	Tests whether the router helps in data-rich vs useful-signal-poor regimes.
Budget $B$	Checks whether the learned allocation is robust across IsoFLOP curves.
Compression or active-token target	Tests whether hierarchy is the variable moving the frontier.
Preservation-loss weight	Exposes the cost of retaining rare events and dense numeric detail.

The reported result should include fit-region and loss-noise sensitivity. If the learned router only wins under one chosen budget, one loss rounding, one data mix, or one target construction, the scaling-law evidence is too brittle.

Concrete Experiment Target

The first strong experiment should use a useful-signal-poor multivariate time-series corpus, such as observability telemetry, industrial process measurements, or another setting where repeated normal-state windows dominate while rare regimes, change points, event windows, and action or control-input windows are sparse. The point is not to win one forecasting leaderboard. The point is to test whether a budgeted hierarchy learns to spend compute on decision-relevant state under matched expected FLOPs.

The experimental claim should be stated as a frontier claim:

(N^{*}, D^{*}, A^{*}) = \tilde{C} (N, D, A) = B arg min [L_{predictive/world-model} + α L_{preservation}],

where $A$ is the new allocation variable exposed by the hierarchy: expected active channel-time cells, retained event windows, latent queries, active width, expert activations, recurrent loop steps, or another measurable active-representation count. A learned hierarchy is interesting only if changing $A$ moves the matched-budget frontier compared with static schedules.

A minimal publishable protocol should report these artifacts:

Artifact	Required content
Budget frontier	Curves across at least several $B$ values, not a single matched-FLOPs point.
Effective-data accounting	Raw windows plus a proxy for unique regimes, rare structured windows, event types, action/control-input windows, or probe-improving windows.
Allocation diagnostics	How active compute is distributed over normal windows, anomalies, change points, dense numeric spans, channel groups, event streams, and action-conditioned transitions.
Preservation probes	Rare-regime retention, dense numeric reconstruction, cross-channel binding, event timing, and action-conditioned next-state dynamics.
Hardware audit	Correlation between $\tilde{C}$ , hard-routed FLOPs, wall-clock latency, memory traffic, batching behavior, and dynamic-shape or kernel overhead.

The baseline suite should include a uniform model, a hand-scheduled hierarchy, a static bowtie or variable-width baseline, sparse attention or MoE where appropriate, a no-preservation-loss budgeted router, and the full budgeted router with preservation losses. An oracle or heuristic router can be useful as an upper-bound diagnostic, but it should not replace learned allocation.

High-Level Idea

Hierarchical modeling with a fixed FLOPs budget treats token count, patch count, channel bottleneck size, concept count, and expert activation as one resource-allocation problem.

The simple visual thesis is:

not all windows deserve equal compute

A long region of repeated normal state should be cheap. A rare regime, change point, event boundary, missingness pattern, action window, or control-input response may deserve more active representation even when the global budget is fixed.

Instead of specifying:

layer 1 compresses by R1
layer 2 compresses by R2
layer 3 compresses by R3

the training setup specifies:

expected training compute budget = B

and asks the model to decide where the budget is worth spending.

This is fewer architectural heuristics, not zero heuristics: replace per-layer compression schedules with a single compute constraint, then let adaptive routing decide how the exposed compression sites are used.

The core hypothesis is:

learn adaptive compression under a global expected FLOPs budget,
then preserve fine detail only where it helps prediction, state maintenance,
dense output reconstruction, or action-conditioned rollout.

This is deliberately not the same claim as “sparse attention is U-Net-like”. Sparse attention usually changes who communicates with whom while keeping the same resolution. Hierarchical compression changes the granularity of the active representation. A model becomes U-Net-like only when it also has a return path from coarse representations back to fine resolution.

Difference From Chain-of-Thought And Test-Time Reasoning

Chain-of-thought-style training and test-time reasoning mainly expose one variable: more reasoning tokens, tool calls, or rollout steps can be spent to improve answer quality. That is useful for offline reasoning, but it is incomplete for real-time systems.

This idea makes the budget part of the objective. The model is not only asked:

maximize quality

It is asked:

maximize quality and preserved state
subject to expected FLOPs, latency, or serving budget B

That changes the research question. A CoT-style method can answer “does extra thinking improve quality?” A fixed-FLOPs hierarchy asks “which internal representation should change so the model preserves decision-relevant state at the same budget?” For real-time observability, control, robotics, or interactive agents, the second question matters because the system cannot always buy more tokens or more wall-clock time.

So this page is a toolbox, not a single architecture. It collects possible model-internal variables that can be learned, routed, compressed, or expanded under a budget, including:

Variable that can change	Example budgeted decision
Active time windows	Spend little on repeated normal state, more on rare regimes and change points.
Patch or chunk boundaries	Make boring spans large and event boundaries fine-grained.
Channel groups	Preserve channels that carry local deviations or cross-channel state.
Active latent tokens or queries	Allocate more latent slots to dense or uncertain regions.
Mixer mode or explicit memory slots	Use attention or KV scratchpad only where exact history is worth the cost.
Width or layer participation	Use wide layers only where the state needs capacity.
Expert activations	Route hard windows to specialized experts while keeping easy windows cheap.
Recurrent loop steps	Iterate more only when the latent state has not converged.
Memory horizon or forgetting rate	Retain stable dynamics and rare safety signals longer than disposable detail.
Return-path detail	Decode or reconstruct fine state only where downstream heads need it.

The contrast with CoT is therefore not “CoT is wrong.” CoT changes the amount of explicit reasoning. This idea asks what else inside the model can vary when quality must improve under a fixed or real-time compute contract.

Architecture

The proposed architecture has seven separable pieces.

Modality adapter. Convert raw inputs into primitive representations: bytes or subwords for text, patches or pixels for images, spatiotemporal patches for video, samples or channel-time patches for multivariate time series, and categorical tokens for event streams.
Fine-level encoder. Build local representations before compression. This layer should preserve details that may later be needed for dense outputs, rare-event detection, or action-conditioned state updates.
Adaptive compression modules. At candidate compression sites, a router chooses boundaries, merges, latent queries, concept tokens, channel groups, or keep/drop decisions. The mechanism can look different by domain, but it should expose a comparable expected compute cost.
Global budget controller. Compression decisions compete for a shared expected-FLOPs budget. The controller should discourage each layer from solving its own local compression problem in isolation.
Coarse processor. The compressed representation is processed by a Transformer, SSM, Mamba-like model, MoE, recurrent backbone, or hybrid. The important property is not the operator family; it is that expensive global or semantic mixing happens at a cheaper granularity.
Return path. Dense tasks need information to flow back to the fine level. This can be implemented through upsampling, dechunking, cross-attention from fine tokens to coarse tokens, residual detail paths, skip connections, or decoder states. Without this return path, the model is a hierarchical encoder, not a U-Net-like architecture.
Task heads. The same hierarchy can feed forecasting, reconstruction, dense prediction, anomaly localization, next-state prediction, or action-conditioned world-model heads. The compression policy should be evaluated against the head that actually matters.

We do not assume the model initially discovers all possible compression points. A practical first version exposes a small set of candidate compression sites, while the budget controller learns how much each site is used. This removes hand-written compression ratios before it removes the human choice of candidate levels.

Remaining Design Choices

The claim is not that the system has no heuristics. The research target is fewer architectural heuristics: replace per-layer compression schedules with one compute constraint and make the remaining choices explicit.

The remaining choices include the compression operators, candidate layers or hierarchy levels, the form of the FLOPs proxy, the gate temperature tau, the budget stabilization coefficient rho, the hard-routing rule, and the preservation probes or losses. The useful comparison is therefore against a hand-scheduled hierarchy at matched expected compute, not against an impossible zero-design-prior baseline.

Static Bowtie Baseline From Variable-Width Transformers

Variable-Width Transformers closes one concrete subproblem and raises the bar for this idea. It shows that the full hidden width of Transformer blocks can be allocated nonuniformly across depth, with early and late layers wide and middle layers narrow, while keeping parameters matched to a uniform baseline. The carry-forward residual stream acts as a simple return path: coordinates inactive in the bottleneck bypass narrow layers and re-enter later wide layers without learned projection matrices.

That is a meaningful partial closure because the fixed-FLOPs idea needs a credible hand-scheduled baseline. A learned budget controller should not only beat a uniform Transformer; it should beat a static ><former-style bowtie schedule under matched expected training FLOPs, hard inference budget, and realized latency. The paper does not close the core adaptive part: its schedule is manually swept and static, and it does not choose time patches, channel groups, event windows, modality fragments, or expert activations per input.

For time-series and world-model work, the transfer hypothesis is therefore narrower:

static variable width is the baseline;
learned data-dependent allocation must prove extra value.

Why The Return Path Matters

The return path is the difference between compression for cheap representation learning and compression for dense prediction or control.

For text, a compressed concept stream may still need byte-level recovery for exact copying, spelling, code, or robust multilingual behavior.

For image/video, a compressed scene or object stream may still need patch-level recovery for localization, OCR, motion boundaries, or dense generation.

For multivariate time series, a compressed latent state may still need sample-level and channel-level recovery for spikes, change points, missingness, rare failures, and channel-specific deviations.

For action-conditioned world models, the preservation target is not pixel-perfect or sample-perfect reconstruction. The target is to preserve the state variables and event boundaries needed to evaluate future observations under actions, control inputs, or interventions.

The shared rule should be:

compress by local information density and downstream value,
not by modality name.

That does not mean all modalities should share the same adapter or the same router. It means the architecture should avoid rules like “images always compress by 4x” or “time series always compress early”. The model should learn whether a region, span, channel group, event window, or trajectory segment deserves more compute.

For multivariate time series, time compression and channel compression must be separated. A temporal patcher can preserve event boundaries while losing channel identity. A channel bottleneck can preserve global trends while erasing local deviations. A useful model may need one router for temporal structure, another for channel or topology structure, and a shared budget controller above both.

Adaptive Patching And JEPA Targets

Dynamic patching is not automatically compatible with predictive latent targets. Next-Embedding Prediction records an internal NEPA-style time-series failure mode where patch-independent target embeddings were stable, while more context-dependent patching or internal-layer prediction degraded the objective. This is not direct pure-JEPA evidence. The lesson is not to avoid adaptive patching; it is to make target construction part of the ablation plan for both NEPA-style and JEPA-style systems.

For this fixed-FLOPs hierarchy idea, that means every adaptive compression site should be tested against preservation probes. A router that saves compute by correlating neighboring patches may lower the average objective while making surprise scores, rare-event selection, or dense numeric recovery worse. The first implementation should therefore keep an explicit patch-independent baseline and compare it with dynamic patching under the same expected compute.

Relation To The Referenced Work

The proposal is meant to compose ideas from the current corpus rather than replace them.

H-Net is the sequence-hierarchy reference: dynamic chunking, compression, processing, and reconstruction live inside the model.
ConceptMoE is the concept-compression and compute-allocation reference: token spans can be merged before expensive expert computation, and comparisons can be made under activated-FLOPs constraints.
Compute Optimal Tokenization is the compression-aware scaling reference: token granularity changes the scaling unit, so comparisons should name the information-density unit and serving-cost tradeoff rather than only token count.
Scaling Laws, Carefully is the fixed-compute experiment-design reference: compare allocation frontiers, disclose data-regime and fitting sensitivity, and do not treat one matched-FLOPs win as a scaling-law result.
Synergy is the abstraction-routing reference: the model learns a bridge from byte-level processing to higher-level concept processing.
ReinPatch is the time-series boundary-selection reference: continuous time-series patching should be optimized for the downstream forecasting objective, not copied directly from language chunking.
U-Cast is the high-dimensional multivariate reference: the channel axis can require its own hierarchy, bottleneck, and upsampling path.
Beyond Language Modeling is the multimodal compute-allocation reference: modality mixtures can have asymmetric compute and specialization needs.
Scaling Test-Time Compute for Agentic Coding is the agentic-runtime compression reference: structured summaries can be a better substrate than raw trajectories for selection and reuse.
Learning is Forgetting is the training-dynamics compression reference: representations improve when they preserve target-relevant information and discard irrelevant input detail.
FADE is the controlled-forgetting reference: memory horizons can be adapted instead of fixed globally.
Learn From Your Own Latents And Not From Tokens adds a baseline warning for explicit hierarchy: on RHM, data2vec-like own-latent targets can implicitly climb the hierarchy, so any fixed-FLOPs hierarchy should beat a single latent-prediction baseline under matched compute and preservation probes.
Latent Context Language Models are the soft-token context-compression reference: staged training, auxiliary reconstruction, and exact expansion show that even text compression needs explicit detail-retention and recovery mechanisms when compressed state may hide rare or lookup-critical facts.
Compress & Attend Transformers are the manually selectable context-compression budget reference: one adaptive model can expose chunk size as a serving knob, but learned data-dependent budget allocation remains open.
Variable-Width Transformers are the static width-allocation baseline: layer width itself can be a budgeted hierarchy, and the carry-forward residual path is a simple bottleneck return path, but the schedule is not learned per input.
Gated DeltaNet is the scalar-gated fast-weight reference for bounded recurrent memory before the newer erase/write decoupling.
Oryx is the sequence-axis mixer-allocation reference: attention and recurrent mixers can share key/value representations while the chosen mixer mode varies by span.
Hybrid Associative Memories are the explicit-memory-budget reference: a recurrent state handles predictable context while the KV scratchpad stores hard-to-predict tokens under a cache target.
HOLA is the fixed exact-memory-budget reference: each GDN layer keeps the top- $w$ key—value pairs by committed update magnitude and uses a separate sharpened read. Its local score is a useful candidate signal, not yet a learned hierarchy or global FLOPs controller.

The synthesis claim is narrower than “fixed FLOPs are new.” Fixed-FLOPs constrained optimization and conditional computation are not new by themselves. The useful composition is dynamic hierarchical compression plus a global expected-FLOPs constraint plus cross-modal information-density routing plus world-model and time-series preservation.

In that composition, fixed expected FLOPs act as the common pressure that makes compression decisions comparable across layers, modalities, and axes.

Recurrent Depth And Energy Loops

Recurrent Transformer depth is adjacent to this idea because it spends extra computation at test time without adding a proportional number of unique weights. The useful deployment story is real: bounded-memory inference, early exit when representations converge, and loop count as a possible uncertainty or failure signal. LoopFormer makes the loop budget explicit through elastic depth and shortcut consistency, while FPRM turns fixed-point residual convergence into a stopping rule and shows that signal-propagation fixes can remove the need for a hand-designed fast/slow recursive hierarchy on symbolic reasoning tasks. Parcae adds stability and saturation evidence for looped scaling. Hyperloop Transformers adds the parameter-memory variant: loop-level hyper-connections can improve a looped model’s parameter-performance frontier, while the early-exit evidence remains suggestive until tested under fixed expected-FLOPs and latency budgets. But as research evidence this branch still needs a compute constraint and a strong baseline, because a wider or deeper non-recurrent model with unique layers may solve the same task better when memory or latency is not binding.

Looped World Models adds the world-model-specific variant: recurrent depth is spent inside an action-conditioned transition, and deferred decoding shifts cost away from per-step output heads. It is useful evidence for the idea shape, but still needs fixed expected-FLOPs, latency, hidden-state-drift, and closed-loop transfer tests.

Energy-Based Transformers add another adjacent path: score candidate predictions with an energy function and spend optimization steps where the candidate is hard. Extra recurrence inside that loop is not automatically useful, because the energy-gradient inference procedure is already iterative. The first question should be whether the energy landscape, candidate generation, or stopping rule is the bottleneck.

What Prevents Collapse Into Over-Compression?

A global budget by itself can create a bad incentive: erase fine detail until the average objective looks cheap enough. Probes can catch some of this after the fact, but the training objective should also include explicit retaining forces:

L = L_predictive/world-model
  + L_preservation
  + L_budget
  + L_anti-collapse

L_predictive/world-model is the main predictive objective: next-token prediction, forecasting, latent prediction, next-state prediction, or action-conditioned rollout. L_preservation keeps information that should survive compression even when it is rare, local, or only visible at fine resolution. L_budget prices compute. L_anti-collapse prevents degenerate representations or routing policies that satisfy the budget by making many states indistinguishable.

The right retaining force depends on the domain.

JEPA-like world models. Latent prediction should be paired with SIGReg, variance regularization, Gaussian regularization, or a similar anti-collapse term. LeJEPA and LeWorldModel are the local references for this branch, while JEPA Slow Features warns that a representation can avoid constant collapse and still encode the wrong factors.
Time series. Preservation should explicitly protect rare events, change points, missingness patterns, regime transitions, cross-channel deviations, and other state variables that average forecasting loss can underweight.
Dense outputs. Reconstruction, dechunking, upsampling, or fine-level decoder losses make the return path accountable for details that the compressed stream alone would otherwise erase.

The design goal is not to prevent compression. It is to make over-compression pay when it destroys decision-relevant state.

FADE adds the continual-learning analogue: forgetting is useful when it selectively clears stale information, but harmful when one fixed decay rule erases stable knowledge. For budgeted hierarchy, this suggests that compression should also have different memory horizons: some fine detail can be dropped quickly, while rare safety signals, action effects, and stable dynamics should be retained.

Evaluation

Average pretraining loss is not enough. Compression failures often show up at the wrong scale.

Useful probes include:

Text: byte and character robustness, code, multilingual spans, exact copying, and long-context retrieval.
Images: localization, OCR, small objects, chart/table reading, dense prediction, and generation fidelity.
Video: motion boundaries, event order, delayed effects, and action-conditioned future frames.
Time series: spikes, change points, rare events, regime shifts, missingness, scale changes, and cross-channel dependencies.
World-model settings: next-state prediction, action consequence prediction, counterfactual prediction when causal assumptions are valid, uncertainty over plausible futures, and recovery from partial observations.

The most important evaluation question is whether the router learns to spend compute on decision-relevant information, not whether it merely reduces average token count.

For learned compression, the budget controller should report target-relevant information preserved per expected FLOP, then expose which rare events, dense numeric details, or action variables are lost. Otherwise a model can look efficient while silently moving the cost into downstream decision errors.

After Variable-Width Transformers, the baseline suite should include a static nonuniform-width Transformer in addition to a uniform Transformer, hand-scheduled patch/channel hierarchies, sparse attention, recurrent state, and MoE routing. If a learned controller cannot beat a static bowtie schedule under matched parameters, expected FLOPs, KV-cache/memory budget, and realized latency, the extra routing complexity is not justified.

The main plots should therefore be frontier plots rather than only tables: quality versus expected FLOPs, preservation versus expected FLOPs, realized latency versus expected FLOPs, and rare-state utility versus normal-window utility. A model that improves average forecast loss while degrading rare-regime or action-conditioned probes has not solved the fixed-FLOPs hierarchy problem; it has learned a cheaper lossy summary.

Relation To Foundation TSFM Agenda

This is an idea page, so the verdicts below describe the intended contribution if the proposed system or experiment works. Evidence status is recorded separately in the Evidence and Missing pieces columns.

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute allocation	partially closes	If validated, adaptive compression and routing under a fixed expected training FLOPs budget would make compute allocation a learned part of pretraining instead of a hand-set schedule; Scaling Laws, Carefully makes this an IsoFLOP-frontier comparison rather than a one-budget comparison.	Train a time-series system that learns routing policies and reports matched-compute results across model size, data/effective-data amount, budget, compression target, preservation loss, and realized latency.
Patch size, dynamic tokenization, and point-wise numeric embeddings	partially closes	If validated, the idea would make temporal compression and fine-detail return paths adaptive rather than fixed by one patch size.	Define concrete tokenization operators that preserve spikes, change points, missingness, and local causal events.
Native multivariate encoding and high-channel scaling	partially closes	If validated, the idea would make channel hierarchy and topology-aware bottlenecks separate learned decisions from temporal compression.	Demonstrate cross-channel interaction retention at hundreds or thousands of channels.
Representation quality	warning	Identifies over-compression as a likely failure mode: average objectives can erase rare or decision-relevant detail even if compute is saved.	Add preservation probes for semantic state, dense numeric detail, and action-conditioned rollout.
Continual adaptation	adjacent	FADE suggests forgetting can be selective and learned rather than a uniform decay rule.	Show adaptive memory horizons for streaming time-series or world-model state, not only final-layer classifiers.
Control and counterfactuals	adjacent	Proposes action-conditioned world-model heads as one target for deciding what detail must survive compression.	Evaluate on a control dataset with logged action channels and counterfactual rollout targets.

Open Questions

Can one budget controller coordinate temporal compression, channel compression, concept compression, and expert routing?
After the initial fixed expected training FLOPs target, which hard budget contracts matter most: per sample, per batch, dataset average, modality mixture, serving endpoint, or latency target?
Can candidate compression sites themselves be learned, or is exposing a small human-chosen set the right first system?
How should the model expose uncertainty when compression removes fine detail?
Should adaptive compression routers and adaptive parameter-decay policies be coupled, or evaluated separately, given that FADE changes slow-weight memory horizons rather than per-sample compute?
What probe suite catches rare-event erasure early enough during pretraining?
Can fixed-FLOPs hierarchy become a foundation-model primitive for action-conditioned time-series world models rather than another preprocessing trick?
When adaptive patching is used with JEPA-style targets, which target construction avoids patch-dependence shortcuts?
Can recurrent loop convergence provide calibrated uncertainty or early-exit signals under a fixed expected compute budget?
Can learned data-dependent routing beat a static ><former-style bowtie width schedule once matched parameters, expected FLOPs, KV-cache size, kernel availability, and wall-clock latency are counted?
What small-run grid is sufficient to fit a stable scaling law over model size, effective data, budget, compression target, and preservation loss without overfitting the frontier?
Which first useful-signal-poor corpus should anchor the experiment: observability telemetry, industrial process data, grid-control trajectories, or another multivariate time-series setting with repeated normal windows and sparse action/control-input evidence?

Technical Details

Differentiable FLOPs Proxy

Real inference FLOPs depend on hard routing decisions. During training, the model needs a differentiable proxy.

For a keep, merge, or boundary router:

p_i = sigmoid(z_i / tau)
K_tilde = sum_i p_i

For a boundary router:

K_tilde = 1 + sum_i P(boundary after token i)

For a Transformer-like layer, a simple expected-cost estimate is:

C_l ~= a_l K_tilde_l d_l^2
     + b_l K_tilde_l^2 d_l
     + c_l K_tilde_l d_l d_ff

For local attention, replace the quadratic attention term with the local-window cost. For MoE, count activated experts rather than total parameters. For channel hierarchies, account separately for temporal tokens and channel groups.

Budget Objective

One operational version of L_budget is the equality-budget objective:

g = C_tilde / B - 1
 
L_budget = lambda * g
         + (rho / 2) * g^2

B is the target FLOPs budget. lambda is a dual variable updated during training. rho is a quadratic stabilization coefficient. The linear term prices budget violation; the quadratic term reduces oscillation around the target.

For the primary research target, C_tilde can be averaged over the batch or modality mixture, making the constraint fixed in expectation rather than fixed for every sample. If the serving contract is “do not exceed this budget” rather than “use this budget”, replace g with max(0, g) and apply it per sample, request, or endpoint as needed.

The full loss should then compose the budget term with the task and retention terms:

L = L_predictive/world-model
  + alpha * L_preservation
  + beta * L_anti-collapse
  + L_budget

The differentiable proxy usually targets training compute first. For deployment, the same controller may need a second calibrated cost model that estimates hard-routed serving FLOPs or measured latency.

Hard Routing

Training can use soft gates, straight-through estimators, Gumbel-style relaxations, or soft top-k. Inference should use hard decisions: top-k, boundary selection, latent-query selection, or quota-based routing.

The practical risk is train/inference mismatch. Late-stage training should therefore anneal toward harder decisions, or periodically evaluate the exact hard-routing compute path.

Flow-Of-Ranks Note

FlowRanks may be useful as an optional diagnostic: rank spectra can suggest where representations look redundant, or help initialize router biases. It should not be treated as the core method. Rank can miss rare but decision-relevant events, and the main compression policy should be learned from the budgeted objective plus task and preservation losses.

Collaboration And Links

If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.

Ideas are not the bottleneck. Hands are. Time-series modeling should be moving at least as fast as vision, audio, and robotics.

Source pages:

Alex Open Research Wiki

Explorer

Hierarchical Modeling with a Fixed FLOPs Budget

Hierarchical Modeling with a Fixed FLOPs Budget

Motivation

Budget Regimes

Scaling-Law Calibration

Concrete Experiment Target

High-Level Idea

Difference From Chain-of-Thought And Test-Time Reasoning

Architecture

Remaining Design Choices

Static Bowtie Baseline From Variable-Width Transformers

Why The Return Path Matters

Adaptive Patching And JEPA Targets

Relation To The Referenced Work

Recurrent Depth And Energy Loops

What Prevents Collapse Into Over-Compression?

Evaluation

Relation To Foundation TSFM Agenda

Open Questions

Technical Details

Differentiable FLOPs Proxy

Budget Objective

Hard Routing

Flow-Of-Ranks Note

Collaboration And Links

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Hierarchical Modeling with a Fixed FLOPs Budget

Hierarchical Modeling with a Fixed FLOPs Budget

Motivation

Budget Regimes

Scaling-Law Calibration

Concrete Experiment Target

High-Level Idea

Difference From Chain-of-Thought And Test-Time Reasoning

Architecture

Remaining Design Choices

Static Bowtie Baseline From Variable-Width Transformers

Why The Return Path Matters

Cross-Modal Principle

Adaptive Patching And JEPA Targets

Relation To The Referenced Work

Recurrent Depth And Energy Loops

What Prevents Collapse Into Over-Compression?

Evaluation

Relation To Foundation TSFM Agenda

Open Questions

Technical Details

Differentiable FLOPs Proxy

Budget Objective

Hard Routing

Flow-Of-Ranks Note

Collaboration And Links

Graph View

Table of Contents

Backlinks