Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Source

Core Claim

Kairos argues that time-series foundation models can gain zero-shot forecasting generalization through adaptive temporal abstraction rather than mostly through larger parameter counts: dynamic patching, mixture-of-size encoding, and dynamic RoPE let the model adapt token granularity and positional scale to heterogeneous time-series structure.

Key Contributions

  • Introduces a Mixture-of-Size Encoder that routes each coarse segment to a sparse set of patch-size experts, with null experts allowing the model to skip unnecessary granularities.
  • Adds Dynamic Rotary Position Embedding (DRoPE), which modulates RoPE frequencies from instance-level spectral features and calibrates token positions for mixed patch sizes.
  • Uses a Multi-Patch Decoder with learnable forecast tokens to predict multiple future patches in parallel, reducing the amount of autoregressive rollout needed for longer horizons.
  • Builds the Predictability-Stratified Time Series (PreSTS) pretraining corpus, over 300B time points sampled to prioritize predictable real-world sequences while adding complementary synthetic data.
  • Reports zero-shot forecasting results on GIFT-Eval and Time-Series-Library, plus frozen-representation transfer results on UCR classification tasks.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
Kairos-10MMini benchmarked checkpointThe paper’s mini configuration uses 4 layers, 4 heads, 256 model width, 10M parameters, and patch sizes {32, 64, 128}.mldi-lab/Kairos_10m
Kairos-23MSmall benchmarked checkpointThe paper’s small configuration uses 4 layers, 8 heads, 384 model width, 23M parameters, and patch sizes {32, 64, 128}. On GIFT-Eval, it is reported ahead of several larger zero-shot TSFMs by normalized MASE.mldi-lab/Kairos_23m
Kairos-50MBase released checkpointThe paper’s base configuration is reported as 53M parameters with 6 layers, 8 heads, 512 model width, and patch sizes {32, 64, 128, 256}; the requested official artifact is the 50m checkpoint release.mldi-lab/Kairos_50m

Method Notes

Kairos is a passive forecasting model: it predicts future numeric observations from historical observations and does not introduce an explicit action, control input, treatment, or intervention channel. It handles multivariate time series with channel-independent modeling, so each variable is treated as an individual sequence rather than through native cross-channel dynamics.

The Mixture-of-Size Encoder first partitions a sequence into coarse segments, then routes each segment to selected patch-size experts. This creates a variable effective tokenization: stable regions can use coarser tokens, while volatile or high-information regions can use finer tokens.

DRoPE addresses the fact that mixed patch sizes break the usual assumption that token index is a uniform proxy for elapsed time. It combines instance-specific spectral modulation of RoPE frequencies with granularity-aware position calibration, so attention can reflect both periodic structure and physical time distance.

Evidence And Results

  • On GIFT-Eval, the paper reports Kairos-Base with the best normalized MASE among the compared methods and second-best CRPS, while Kairos-Small also reports a stronger MASE than larger zero-shot TSFMs such as Toto and Sundial.
  • On Time-Series-Library zero-shot forecasting, Kairos-Mini is reported to outperform recent TSFMs and most full-shot deep learning baselines in the paper’s aggregate comparison.
  • Ablations attribute the GIFT-Eval gains to the combined architecture: replacing the adaptive encoder with fixed patching, removing DRoPE, or reverting to single-patch autoregressive decoding all worsens normalized MASE.
  • Routing interventions support the segment-level adaptation claim: uniform granularity weights and shuffled routing decisions degrade performance substantially compared with the full model.
  • Matched-data comparisons suggest architecture is the primary source of the reported gains, with PreSTS adding a smaller but useful contribution.

Limitations

  • The model is focused on forecasting; anomaly detection, imputation, and broader task support are left for future versions, though the paper includes a classification-transfer appendix.
  • Channel-independent modeling means Kairos does not explicitly capture inter-variable dependencies in multivariate time series.
  • The benchmark story is centered on GIFT-Eval and selected TSLib datasets, so downstream users should check domain, frequency, and horizon match before treating the reported zero-shot results as general.
  • Because Kairos is not action-conditioned, it is not directly a world model for interventions or controllable dynamics without adding explicit control-input structure.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic tokenizationpartially closesUses segment-level mixture-of-size tokenization so patch granularity adapts to local information density.Needs evidence that the router preserves spikes, missingness, and causal events beyond forecasting.
Time representationpartially closesDynamic RoPE modulates positional frequencies per instance to handle heterogeneous periodic structure.Does not cover irregular event streams, calendar time, or asynchronous channels.
Data diversity and curriculumadjacentPreSTS is predictability-stratified and combines large real and synthetic time-series data.Stratification targets pretraining quality, not rare-regime preservation or active curriculum.
Native multivariate encodinginsufficient evidenceMultivariate data are handled channel-independently.Needs explicit cross-channel dynamics, channel metadata, and topology.
Control and counterfactualsinsufficient evidenceEvaluation is zero-shot passive forecasting.Needs action, control input, intervention, and counterfactual rollout interfaces.

Open Questions

  • How much of Kairos’s advantage survives when native multivariate channel mixing is added without losing the parameter-efficiency gains from adaptive tokenization?
  • Can the segment-level router become a useful interpretability signal for regime changes, anomaly boundaries, or forecast difficulty?
  • Would explicit covariates, actions, control inputs, or interventions fit naturally into the mixture-of-size tokenization scheme, or would they require a separate event-stream representation?