---
abstract: |
  Time series forecasting plays a crucial role in data mining, driving rapid advancements across numerous industries. With the emergence of large models, time series foundation models (TSFMs) have exhibited remarkable generalization capabilities, such as zero-shot learning, through large-scale pre-training. Meanwhile, Retrieval-Augmented Generation (RAG) methods have been widely employed to enhance the performance of foundation models on unseen data, allowing models to access to external knowledge. In this paper, we introduce **`\model`{=latex}**, a **R**etrieval-**A**ugmented **F**orecasting model that enhance zero-shot time series forecasting through retrieval-augmented techniques. We develop customized time series knowledge bases that are tailored to the specific forecasting tasks. TimeRAF employs an end-to-end learnable retriever to extract valuable information from the knowledge base. Additionally, we propose Channel Prompting for knowledge integration, which effectively extracts relevant information from the retrieved knowledge along the channel dimension. Extensive experiments demonstrate the effectiveness of our model, showing significant improvement across various domains and datasets.
author:
- |
  Huanyu Zhang$^{1, 2}$, Chang Xu$^{2}$, Yi-Fan Zhang$^{1}$, Zhang Zhang$^{1}$, Liang Wang$^{1}$,\
  \
  \
  **Tieniu Tan**$^{1, 3}$, **Jiang Bian**$^{2}$\
  \
  \
  $^{1}$ Institute of Automation, Chinese Academy of Sciences, $^{2}$ Microsoft Research, $^{3}$ Nanjing University\
  \
  \
  `{huanyu.zhang,yifan.zhang}@cripac.ia.ac.cn`\
  \
  \
  `{chanx,jiang.bian}@microsoft.com`\
  \
  \
  `{zzhang,wangliang,tnt}@nlpr.ia.ac.cn`
bibliography:
- main.bib
title: "TimeRAF: Retrieval-Augmented Foundation model for Zero-shot Time Series Forecasting"
---

\newcommand{\figleft}{{\em (Left)}}
\newcommand{\figcenter}{{\em (Center)}}
\newcommand{\figright}{{\em (Right)}}
\newcommand{\figtop}{{\em (Top)}}
\newcommand{\figbottom}{{\em (Bottom)}}
\newcommand{\captiona}{{\em (a)}}
\newcommand{\captionb}{{\em (b)}}
\newcommand{\captionc}{{\em (c)}}
\newcommand{\captiond}{{\em (d)}}
\def \xc #1{\textcolor{blue}{#1}}
\newcommand{\newterm}[1]{{\bf #1}}
\def\figref#1{figure~\ref{#1}}
\def\Figref#1{Figure~\ref{#1}}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
\def\secref#1{section~\ref{#1}}
\def\Secref#1{Section~\ref{#1}}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
\def\eqref#1{equation~\ref{#1}}
\def\Eqref#1{Equation~\ref{#1}}
\def\plaineqref#1{\ref{#1}}
\def\chapref#1{chapter~\ref{#1}}
\def\Chapref#1{Chapter~\ref{#1}}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
\def\algref#1{algorithm~\ref{#1}}
\def\Algref#1{Algorithm~\ref{#1}}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
\def\partref#1{part~\ref{#1}}
\def\Partref#1{Part~\ref{#1}}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\1{\bm{1}}
\newcommand{\train}{\mathcal{D}}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
\def\eps{{\epsilon}}
\def\reta{{\textnormal{$\eta$}}}
\def\ra{{\textnormal{a}}}
\def\rb{{\textnormal{b}}}
\def\rc{{\textnormal{c}}}
\def\rd{{\textnormal{d}}}
\def\re{{\textnormal{e}}}
\def\rf{{\textnormal{f}}}
\def\rg{{\textnormal{g}}}
\def\rh{{\textnormal{h}}}
\def\ri{{\textnormal{i}}}
\def\rj{{\textnormal{j}}}
\def\rk{{\textnormal{k}}}
\def\rl{{\textnormal{l}}}
\def\rn{{\textnormal{n}}}
\def\ro{{\textnormal{o}}}
\def\rp{{\textnormal{p}}}
\def\rq{{\textnormal{q}}}
\def\rr{{\textnormal{r}}}
\def\rs{{\textnormal{s}}}
\def\rt{{\textnormal{t}}}
\def\ru{{\textnormal{u}}}
\def\rv{{\textnormal{v}}}
\def\rw{{\textnormal{w}}}
\def\rx{{\textnormal{x}}}
\def\ry{{\textnormal{y}}}
\def\rz{{\textnormal{z}}}
\def\rvepsilon{{\mathbf{\epsilon}}}
\def\rvtheta{{\mathbf{\theta}}}
\def\rva{{\mathbf{a}}}
\def\rvb{{\mathbf{b}}}
\def\rvc{{\mathbf{c}}}
\def\rvd{{\mathbf{d}}}
\def\rve{{\mathbf{e}}}
\def\rvf{{\mathbf{f}}}
\def\rvg{{\mathbf{g}}}
\def\rvh{{\mathbf{h}}}
\def\rvu{{\mathbf{i}}}
\def\rvj{{\mathbf{j}}}
\def\rvk{{\mathbf{k}}}
\def\rvl{{\mathbf{l}}}
\def\rvm{{\mathbf{m}}}
\def\rvn{{\mathbf{n}}}
\def\rvo{{\mathbf{o}}}
\def\rvp{{\mathbf{p}}}
\def\rvq{{\mathbf{q}}}
\def\rvr{{\mathbf{r}}}
\def\rvs{{\mathbf{s}}}
\def\rvt{{\mathbf{t}}}
\def\rvu{{\mathbf{u}}}
\def\rvv{{\mathbf{v}}}
\def\rvw{{\mathbf{w}}}
\def\rvx{{\mathbf{x}}}
\def\rvy{{\mathbf{y}}}
\def\rvz{{\mathbf{z}}}
\def\erva{{\textnormal{a}}}
\def\ervb{{\textnormal{b}}}
\def\ervc{{\textnormal{c}}}
\def\ervd{{\textnormal{d}}}
\def\erve{{\textnormal{e}}}
\def\ervf{{\textnormal{f}}}
\def\ervg{{\textnormal{g}}}
\def\ervh{{\textnormal{h}}}
\def\ervi{{\textnormal{i}}}
\def\ervj{{\textnormal{j}}}
\def\ervk{{\textnormal{k}}}
\def\ervl{{\textnormal{l}}}
\def\ervm{{\textnormal{m}}}
\def\ervn{{\textnormal{n}}}
\def\ervo{{\textnormal{o}}}
\def\ervp{{\textnormal{p}}}
\def\ervq{{\textnormal{q}}}
\def\ervr{{\textnormal{r}}}
\def\ervs{{\textnormal{s}}}
\def\ervt{{\textnormal{t}}}
\def\ervu{{\textnormal{u}}}
\def\ervv{{\textnormal{v}}}
\def\ervw{{\textnormal{w}}}
\def\ervx{{\textnormal{x}}}
\def\ervy{{\textnormal{y}}}
\def\ervz{{\textnormal{z}}}
\def\rmA{{\mathbf{A}}}
\def\rmB{{\mathbf{B}}}
\def\rmC{{\mathbf{C}}}
\def\rmD{{\mathbf{D}}}
\def\rmE{{\mathbf{E}}}
\def\rmF{{\mathbf{F}}}
\def\rmG{{\mathbf{G}}}
\def\rmH{{\mathbf{H}}}
\def\rmI{{\mathbf{I}}}
\def\rmJ{{\mathbf{J}}}
\def\rmK{{\mathbf{K}}}
\def\rmL{{\mathbf{L}}}
\def\rmM{{\mathbf{M}}}
\def\rmN{{\mathbf{N}}}
\def\rmO{{\mathbf{O}}}
\def\rmP{{\mathbf{P}}}
\def\rmQ{{\mathbf{Q}}}
\def\rmR{{\mathbf{R}}}
\def\rmS{{\mathbf{S}}}
\def\rmT{{\mathbf{T}}}
\def\rmU{{\mathbf{U}}}
\def\rmV{{\mathbf{V}}}
\def\rmW{{\mathbf{W}}}
\def\rmX{{\mathbf{X}}}
\def\rmY{{\mathbf{Y}}}
\def\rmZ{{\mathbf{Z}}}
\def\ermA{{\textnormal{A}}}
\def\ermB{{\textnormal{B}}}
\def\ermC{{\textnormal{C}}}
\def\ermD{{\textnormal{D}}}
\def\ermE{{\textnormal{E}}}
\def\ermF{{\textnormal{F}}}
\def\ermG{{\textnormal{G}}}
\def\ermH{{\textnormal{H}}}
\def\ermI{{\textnormal{I}}}
\def\ermJ{{\textnormal{J}}}
\def\ermK{{\textnormal{K}}}
\def\ermL{{\textnormal{L}}}
\def\ermM{{\textnormal{M}}}
\def\ermN{{\textnormal{N}}}
\def\ermO{{\textnormal{O}}}
\def\ermP{{\textnormal{P}}}
\def\ermQ{{\textnormal{Q}}}
\def\ermR{{\textnormal{R}}}
\def\ermS{{\textnormal{S}}}
\def\ermT{{\textnormal{T}}}
\def\ermU{{\textnormal{U}}}
\def\ermV{{\textnormal{V}}}
\def\ermW{{\textnormal{W}}}
\def\ermX{{\textnormal{X}}}
\def\ermY{{\textnormal{Y}}}
\def\ermZ{{\textnormal{Z}}}
\def\vzero{{\bm{0}}}
\def\vone{{\bm{1}}}
\def\vmu{{\bm{\mu}}}
\def\vtheta{{\bm{\theta}}}
\def\va{{\bm{a}}}
\def\vb{{\bm{b}}}
\def\vc{{\bm{c}}}
\def\vd{{\bm{d}}}
\def\ve{{\bm{e}}}
\def\vf{{\bm{f}}}
\def\vg{{\bm{g}}}
\def\vh{{\bm{h}}}
\def\vi{{\bm{i}}}
\def\vj{{\bm{j}}}
\def\vk{{\bm{k}}}
\def\vl{{\bm{l}}}
\def\vm{{\bm{m}}}
\def\vn{{\bm{n}}}
\def\vo{{\bm{o}}}
\def\vp{{\bm{p}}}
\def\vq{{\bm{q}}}
\def\vr{{\bm{r}}}
\def\vs{{\bm{s}}}
\def\vt{{\bm{t}}}
\def\vu{{\bm{u}}}
\def\vv{{\bm{v}}}
\def\vw{{\bm{w}}}
\def\vx{{\bm{x}}}
\def\vy{{\bm{y}}}
\def\vz{{\bm{z}}}
\def\evalpha{{\alpha}}
\def\evbeta{{\beta}}
\def\evepsilon{{\epsilon}}
\def\evlambda{{\lambda}}
\def\evomega{{\omega}}
\def\evmu{{\mu}}
\def\evpsi{{\psi}}
\def\evsigma{{\sigma}}
\def\evtheta{{\theta}}
\def\eva{{a}}
\def\evb{{b}}
\def\evc{{c}}
\def\evd{{d}}
\def\eve{{e}}
\def\evf{{f}}
\def\evg{{g}}
\def\evh{{h}}
\def\evi{{i}}
\def\evj{{j}}
\def\evk{{k}}
\def\evl{{l}}
\def\evm{{m}}
\def\evn{{n}}
\def\evo{{o}}
\def\evp{{p}}
\def\evq{{q}}
\def\evr{{r}}
\def\evs{{s}}
\def\evt{{t}}
\def\evu{{u}}
\def\evv{{v}}
\def\evw{{w}}
\def\evx{{x}}
\def\evy{{y}}
\def\evz{{z}}
\def\mA{{\bm{A}}}
\def\mB{{\bm{B}}}
\def\mC{{\bm{C}}}
\def\mD{{\bm{D}}}
\def\mE{{\bm{E}}}
\def\mF{{\bm{F}}}
\def\mG{{\bm{G}}}
\def\mH{{\bm{H}}}
\def\mI{{\bm{I}}}
\def\mJ{{\bm{J}}}
\def\mK{{\bm{K}}}
\def\mL{{\bm{L}}}
\def\mM{{\bm{M}}}
\def\mN{{\bm{N}}}
\def\mO{{\bm{O}}}
\def\mP{{\bm{P}}}
\def\mQ{{\bm{Q}}}
\def\mR{{\bm{R}}}
\def\mS{{\bm{S}}}
\def\mT{{\bm{T}}}
\def\mU{{\bm{U}}}
\def\mV{{\bm{V}}}
\def\mW{{\bm{W}}}
\def\mX{{\bm{X}}}
\def\mY{{\bm{Y}}}
\def\mZ{{\bm{Z}}}
\def\mBeta{{\bm{\beta}}}
\def\mPhi{{\bm{\Phi}}}
\def\mLambda{{\bm{\Lambda}}}
\def\mSigma{{\bm{\Sigma}}}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
\def\tA{{\tens{A}}}
\def\tB{{\tens{B}}}
\def\tC{{\tens{C}}}
\def\tD{{\tens{D}}}
\def\tE{{\tens{E}}}
\def\tF{{\tens{F}}}
\def\tG{{\tens{G}}}
\def\tH{{\tens{H}}}
\def\tI{{\tens{I}}}
\def\tJ{{\tens{J}}}
\def\tK{{\tens{K}}}
\def\tL{{\tens{L}}}
\def\tM{{\tens{M}}}
\def\tN{{\tens{N}}}
\def\tO{{\tens{O}}}
\def\tP{{\tens{P}}}
\def\tQ{{\tens{Q}}}
\def\tR{{\tens{R}}}
\def\tS{{\tens{S}}}
\def\tT{{\tens{T}}}
\def\tU{{\tens{U}}}
\def\tV{{\tens{V}}}
\def\tW{{\tens{W}}}
\def\tX{{\tens{X}}}
\def\tY{{\tens{Y}}}
\def\tZ{{\tens{Z}}}
\def\gA{{\mathcal{A}}}
\def\gB{{\mathcal{B}}}
\def\gC{{\mathcal{C}}}
\def\gD{{\mathcal{D}}}
\def\gE{{\mathcal{E}}}
\def\gF{{\mathcal{F}}}
\def\gG{{\mathcal{G}}}
\def\gH{{\mathcal{H}}}
\def\gI{{\mathcal{I}}}
\def\gJ{{\mathcal{J}}}
\def\gK{{\mathcal{K}}}
\def\gL{{\mathcal{L}}}
\def\gM{{\mathcal{M}}}
\def\gN{{\mathcal{N}}}
\def\gO{{\mathcal{O}}}
\def\gP{{\mathcal{P}}}
\def\gQ{{\mathcal{Q}}}
\def\gR{{\mathcal{R}}}
\def\gS{{\mathcal{S}}}
\def\gT{{\mathcal{T}}}
\def\gU{{\mathcal{U}}}
\def\gV{{\mathcal{V}}}
\def\gW{{\mathcal{W}}}
\def\gX{{\mathcal{X}}}
\def\gY{{\mathcal{Y}}}
\def\gZ{{\mathcal{Z}}}
\def\sA{{\mathbb{A}}}
\def\sB{{\mathbb{B}}}
\def\sC{{\mathbb{C}}}
\def\sD{{\mathbb{D}}}
\def\sF{{\mathbb{F}}}
\def\sG{{\mathbb{G}}}
\def\sH{{\mathbb{H}}}
\def\sI{{\mathbb{I}}}
\def\sJ{{\mathbb{J}}}
\def\sK{{\mathbb{K}}}
\def\sL{{\mathbb{L}}}
\def\sM{{\mathbb{M}}}
\def\sN{{\mathbb{N}}}
\def\sO{{\mathbb{O}}}
\def\sP{{\mathbb{P}}}
\def\sQ{{\mathbb{Q}}}
\def\sR{{\mathbb{R}}}
\def\sS{{\mathbb{S}}}
\def\sT{{\mathbb{T}}}
\def\sU{{\mathbb{U}}}
\def\sV{{\mathbb{V}}}
\def\sW{{\mathbb{W}}}
\def\sX{{\mathbb{X}}}
\def\sY{{\mathbb{Y}}}
\def\sZ{{\mathbb{Z}}}
\def\emLambda{{\Lambda}}
\def\emA{{A}}
\def\emB{{B}}
\def\emC{{C}}
\def\emD{{D}}
\def\emE{{E}}
\def\emF{{F}}
\def\emG{{G}}
\def\emH{{H}}
\def\emI{{I}}
\def\emJ{{J}}
\def\emK{{K}}
\def\emL{{L}}
\def\emM{{M}}
\def\emN{{N}}
\def\emO{{O}}
\def\emP{{P}}
\def\emQ{{Q}}
\def\emR{{R}}
\def\emS{{S}}
\def\emT{{T}}
\def\emU{{U}}
\def\emV{{V}}
\def\emW{{W}}
\def\emX{{X}}
\def\emY{{Y}}
\def\emZ{{Z}}
\def\emSigma{{\Sigma}}
\newcommand{\etens}[1]{\mathsfit{#1}}
\def\etLambda{{\etens{\Lambda}}}
\def\etA{{\etens{A}}}
\def\etB{{\etens{B}}}
\def\etC{{\etens{C}}}
\def\etD{{\etens{D}}}
\def\etE{{\etens{E}}}
\def\etF{{\etens{F}}}
\def\etG{{\etens{G}}}
\def\etH{{\etens{H}}}
\def\etI{{\etens{I}}}
\def\etJ{{\etens{J}}}
\def\etK{{\etens{K}}}
\def\etL{{\etens{L}}}
\def\etM{{\etens{M}}}
\def\etN{{\etens{N}}}
\def\etO{{\etens{O}}}
\def\etP{{\etens{P}}}
\def\etQ{{\etens{Q}}}
\def\etR{{\etens{R}}}
\def\etS{{\etens{S}}}
\def\etT{{\etens{T}}}
\def\etU{{\etens{U}}}
\def\etV{{\etens{V}}}
\def\etW{{\etens{W}}}
\def\etX{{\etens{X}}}
\def\etY{{\etens{Y}}}
\def\etZ{{\etens{Z}}}
\newcommand{\pdata}{p_{\rm{data}}}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
\newcommand{\pmodel}{p_{\rm{model}}}
\newcommand{\Pmodel}{P_{\rm{model}}}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
\newcommand{\pencode}{p_{\rm{encoder}}}
\newcommand{\pdecode}{p_{\rm{decoder}}}
\newcommand{\precons}{p_{\rm{reconstruct}}}
\newcommand{\laplace}{\mathrm{Laplace}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Ls}{\mathcal{L}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\emp}{\tilde{p}}
\newcommand{\lr}{\alpha}
\newcommand{\reg}{\lambda}
\newcommand{\rect}{\mathrm{rectifier}}
\newcommand{\softmax}{\mathrm{softmax}}
\newcommand{\sigmoid}{\sigma}
\newcommand{\softplus}{\zeta}
\newcommand{\KL}{D_{\mathrm{KL}}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\standarderror}{\mathrm{SE}}
\newcommand{\Cov}{\mathrm{Cov}}
\newcommand{\normlzero}{L^0}
\newcommand{\normlone}{L^1}
\newcommand{\normltwo}{L^2}
\newcommand{\normlp}{L^p}
\newcommand{\normmax}{L^\infty}
\newcommand{\parents}{Pa}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\Tr}{Tr}
\let\ab\allowbreak
\newcommand{\model}[0]{TimeRAF\xspace}
\newcommand{\modelr}[0]{TimeRAF\textsubscript{R}\xspace}
\newcommand{\modelh}[0]{TimeRAF\textsubscript{H}\xspace}
\newcommand{\modeld}[0]{TimeRAF\textsubscript{D}\xspace}
\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}
\maketitle

# Introduction {#sec:intro}

Time series (TS) forecasting has gained significant popularity in recent years due to its vital role in various domains, including finance [@yu2023finance], healthcare [@li2024frozen], weather [@wu2023weather_forecasting], and traffic [@jin2021trafficbert]. The popular approach in the past typically learns from single-domain, small-scale datasets [@nie2023patchtst; @zeng2023dlinear], which inherently constrains their generalization capabilities. However, the landscape of time series analysis is evolving rapidly with the advent of large models. Time series foundation models (TSFMs), trained on large-scale, multi-domain datasets, have demonstrated zero-shot learning abilities, revolutionizing various time series domains and diverse applications [@liang2024tsfm; @woo2024moirai; @liu2024timer].

Meanwhile, Retrieval-Augmented Generation (RAG) is an increasingly prevalent technique that enhances the capabilities of foundation models in various domains, including text generation [@karpukhin2020dpr] and image generation [@chen2023reimagen]. This approach allows models to access external knowledge through various information retrieval techniques, enabling them to gather supplementary information during the generation process. Typically, the retrieval knowledge can be sourced from external datasets in the same format with the training corpus. For instance, in dialogue systems, RAG can help generate more contextually relevant responses by retrieving previous dialogues or similar interactions from a database [@huang2023lapdog]. However, similar studies have garnered little attention in the time series domain. A natural question arises: ***Can the integration of time series foundation models with retrieval-augmented methods also improve performance, particularly in challenging scenarios that require strong generalization abilities, such as zero-shot forecasting?***

As an intuitive example, a pre-trained model trained on a general time series dataset may struggle when forecasting for specific domains such as weather patterns in a particular region, which is illustrated in the left plot of Figure `\ref{fig:intro}`{=latex}. However, by accessing domain-specific external knowledge bases, the model could dynamically retrieve relevant information---such as time series data from similar weather conditions---without requiring extensive parameter updates. This allows the model to integrate domain-specific prior knowledge, improving its zero-shot forecasting capability. In this manner, retrieved external data provides valuable context and serves as an additional source of prior information, enabling more accurate predictions. These advancements motivate our exploration of **R**etrieval-**A**ugmented for time series **F**orecasting (RAF). However, designing an effective RAF framework for time series forecasting involves several key challenges: **(1) *What types of data can serve as knowledge bases to support time series models?* (2) *How can relevant knowledge be retrieved when encountering inputs from diverse domains?* (3) *How can retrieved knowledge be effectively integrated to improve model performance?***

<figure id="fig:intro">
<div class="center">
<img src="imgs/intro.png" style="width:87.0%" />
</div>
<figcaption><strong>Left:</strong> Time series foundation models (TSFMs), while capable of zero-shot forecasting, are limited by insufficient prior knowledge, resulting in constrained prediction accuracy. <strong>Right:</strong> By dynamically retrieving relevant information from an external knowledge base, our enhances prediction accuracy, leading to more precise zero-shot forecasting performance.</figcaption>
</figure>

To address these challenges, we introduce `\model`{=latex}, a novel framework designed to leverage retrieval-augmented generation techniques for time series foundation models. As shown in the right of Figure `\ref{fig:intro}`{=latex}, by retrieving and integrating external time series data, we aim to overcome the limitations of existing TSFMs and enhance zero-shot time series forecasting performance. `\model `{=latex}consists of a retriever that scores and selects relevant time series data from an external knowledge base. The knowledge base can either be a comprehensive database composed of multiple datasets across various domains or a domain-specific database comprising a singular dataset relevant to test data. Furthermore, an end-to-end learnable retrieval methodology is introduced to ensure that the retrieved data delivers enhancement. To leverage retrieved time series, we introduce an effective approach, named Channel Prompting, to integrate the knowledge from retrieved data. Our extensive experiments on various datasets demonstrate that `\model `{=latex}significantly achieves a substantial improvement over TSFM and outperforms several existing zero-shot time series forecasting methods.

Overall, our contributions can be summarized as follows:

- We propose `\model`{=latex}, a novel framework that leverages retrieval augmentation techniques to enhance zero-shot time series forecasting. By retrieving relevant data from an external knowledge base and effectively integrating the retrieved information, `\model `{=latex}supplements the pre-trained knowledge of foundation models, enhancing their forecasting capabilities.

- We employ a learnable retriever to calculate retrieval scores for time series within the knowledge base and select the best options. To integrate retrieved knowledge, we introduce Channel Prompting to extract valuable information from the retrieved data effectively.

- Our `\model `{=latex}demonstrates significant improvement through the incorporation of RAF into TSFM and even outperforms several full-shot methods. Furthermore, we present comprehensive ablation studies and visualizations to evaluate the efficacy of our approach.

# Related Work

**Foundation Models for Zero-shot Time Series Forecasting:** Recent years have witnessed the rise of TSFMs. TimeGPT-1 [@garza2023timegpt] is the first closed-source model offering zero-shot forecasting capabilities. ForecastPFN [@dooley2024forecastpfn], pre-trained on synthetic time series data, serves as a zero-shot forecaster, but excels primarily in data- or time-limited settings. Lag-llama [@rasul2023lag] leveraged the LLaMA architecture [@touvron2023llama] with lagged time series features for time series forecasting. TimesFM [@das2024timefm]is a patch-based, decoder-only foundational model designed for time series forecasting, which employs a larger output patch size to enhance decoding efficiency. The model is pre-trained on a comprehensive dataset sourced from Google Trends and Wikipedia pageviews, in combination with open data. MOIRAI [@woo2024moirai] introduces LOTSA, a large-scale collection of open time series datasets, and utilizes it to train a foundation model based on a masked encoder architecture. MOIRAI achieves competitive or superior performance as a zero-shot forecaster when compared to full-shot models. Tiny Time Mixers (TTMs) [@ekambaram2024ttm] leverages a lightweight mixer-style architecture and demonstrated remarkable zero-shot forecasting performance. Since TSFMs have shown potential in zero-shot time series forecasting, our approach aims to enhance their generalization capabilities by applying RAG techniques to leverage external knowledge.

**Retrieval Augmented Generation for Foundation Models:** Foundation Models like LLMs have achieved remarkable success, though they still face limitations in domain-specific or knowledge-intensive tasks. To address these challenges, various RAG methods have been proposed: DocPrompting [@zhou2022doccoder] curated a retrieval annotation dataset to train a retriever for augmenting input in code generation. DPR [@karpukhin2020dpr] develops a dense embedding model for indexing passages in a low-dimensional, continuous space. RePlug [@shi2023replug] refined the retriever by distilling the knowledge from the language model's probability. LAPDOG [@huang2023lapdog] introduces an end-to-end dense retriever framework specifically for personalized dialogue generation, emphasizing objective optimization. Beyond NLP tasks, RAG has also been applied to other domains: REACT [@liu2023react] freezes the original model and updates only the additional trainable weights on the retrieved knowledge, significantly enhancing visual model's zero-shot performance. Re-Imagen [@chen2023reimagen] uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Additionally, in time series analysis, ReTime [@jing2022retime] retrieves relational references to improve forecasting and imputation for incomplete target time series, while RATSF [@wang2024ratsf] develops a cross-attention module to integrate historical data for enhanced prediction accuracy. However, existing time series RAG methods are either limited to historical data or do not support foundation models. In contrast, our `\model `{=latex}enhances TSFMs by extracting information from external knowledge bases that can be tailored to specific domains.

# Method {#sec:method}

## Overview

An illustration of our `\model `{=latex}framework is provided in Figure `\ref{fig:overview}`{=latex}. Firstly, a retriever is utilized for learning to retrieve relevant data from the external knowledge base (refer to `\secref{sec:retrieval}`{=latex}). Following this, the proposed Channel Prompting approach is employed for the integration of retrieved knowledge. Therefore, the entire forecaster $\mathcal{F}$ is capable of harnessing external knowledge, thereby facilitating knowledge enhanced forecasting (refer to `\secref{sec:integration}`{=latex}). During training, the backbone of TSFM remains frozen. Details of training and inference process are provided in `\secref{sec:train}`{=latex} and `\secref{sec:inference}`{=latex}. Besides, the knowledge bases utilized for training and inference are detailed in `\secref{sec:kb}`{=latex}.

<figure id="fig:overview">
<div class="center">
<img src="imgs/overview.png" style="width:88.0%" />
</div>
<figcaption><strong>Overview of :</strong> utilizes a retriever to dynamically retrieve relevant candidates from an external knowledge base and then utilizes the proposed Channel Prompting module to integrate knowledge between the retrieved data and the input. The knowledge-enhanced embeddings are subsequently fed into the backbone of the foundation model to improve forecasting results. During training, the backbone remains frozen.</figcaption>
</figure>

## Problem Formulation

Following previous work [@nie2023patchtst], we employ the channel inpendent strategy. Let $\bm{X} \in \mathbb{R}^{sl \times c}$ be a multivariate time series of length $sl$ and number of channels $c$. The input can be denoted as $\vx\in \mathbb{R}^{sl \times 1}$, and the forecasting task can be formally defined as predicting the future values $\hat{\vy} \in \mathbb{R}^{fl \times 1}$ given the history/lookback window $\vx$. Upon completing the analysis of all data channels, the final comprehensive prediction result $\hat{\bm{Y}} \in \mathbb{R}^{fl \times c}$ is derived. Here, $fl$ denotes the forecast length/horizon. The ground truth is denoted as $\bm{Y} \in \mathbb{R}^{fl \times c}$. In the zero-shot forecasting setting, the model generates predictions for future values based on datasets that have not been encountered during training. Given a set of retrieved time series data $\bm{C}$ from the external knowledge base (details will be elaborated in Section `\ref{sec:retrieval}`{=latex}), we aim to leverage the valuable information within them to enhance the forecasting capability of the forecaster $\mathcal{F}$. The entire process can be formulated as $\hat{\bm{Y}}= \mathcal{F}(\bm{X}, \bm{C})$.

## Knowledge Base {#sec:kb}

In order to facilitate knowledge retrieval, it is essential to first establish a knowledge base. To enhance the efficiency of knowledge integration and extraction, we undertake a preprocessing of all sequences within the knowledge base to align with the dimensions of the lookback window, resulting in the following representation: $\text{Knowledge Base} = \{\vt_i | \vt_i \in \mathbb{R}^{sl \times 1}\}_{i=1}^{n_{\text{kb}}}$, where $n_{\text{kb}}$ represents the size of knowledge base. To maintain the generalization capabilities of the foundation model, we use multi-domain datasets for training, similar to the pre-training phase of the foundation model, which will be detailed in `\secref{sec:datasets}`{=latex}.

Subsequently, we apply a sliding window with the same window size as the input of the foundation model across the training datasets. Based on the scale of each sub-dataset, we ultimately establish a knowledge base in which each domain has an equal proportion to uphold the balance. Additionally, there is no overlap present in the data within the knowledge base. During Training, to prevent data leakage caused by accessing future sequences, the retriever is constrained to retrieve information solely from datasets that are distinct from the input source. During inference, `\model `{=latex}has the option to utilize the extensive multi-domain knowledge base that we have developed or to opt for a domain-specific dataset as the knowledge base, based on the specific requirements.

## Knowledge Integration {#sec:integration}

Given $k$ retrieved time series data $\bm{C} = \{ \vc_1, \vc_2, \ldots, \vc_k \}$ from the external knowledge base (details will be discussed in Section `\ref{sec:retrieval}`{=latex}), we aim to leverage the valuable information within them, thereby complementing the pre-trained knowledge of TSFMs to enhance forecasting performance.

Following the preprocessing procedure of the TSFM, each sequence $\vc_i$ in $\bm{C}$ will undergo normalization followed by patching, analogous to the input. Thereafter, these patches will be processed through a projection layer to derive their respective embeddings. Let $\widetilde{\vx}\in \mathbb{R}^{n \times d}$ denotes the input embedding and $\widetilde{\vc_i}\in \mathbb{R}^{n \times d}$ represents the embedding of the $i$th retrieved candidate. Here, $n$ denotes to the number of patches, while $d$ indicates the dimensionality of the embedding.

The Channel Prompting begins with a flatten operation on both embedding of input and retrieved candidates. Subsequently, the flattened embedding of input and retrieved candidate will be concatenated: $$\begin{equation}
    \vz_i = \text{Concat}(\text{Flatten}(\widetilde{\vx}), \text{Flatten}(\widetilde{\vc_i})) .
\end{equation}$$ By integrating the input embedding with the external knowledge embedding, the representation of the input is enriched with supplementary contextual information. Furthermore, after obtaining the concatenated embedding $\vz_i\in \mathbb{R}^{2*n*d}$, the foundation model is better positioned to comprehend the lookback window through the incorporation of domain-specific knowledge or external facts.

Subsequently, we employ a MLP to effectively extract and combine the most relevant information from both the lookback window and the retrieved candidates. This process enables the compression of the combined representation into a more meaningful and compact form. In particular, the concatenated embedding $\vz$ is compressed back to the original dimensions corresponding to the foundation model, yielding $\widetilde{\vz}\in \mathbb{R}^{ n \times d}$. Besides, the original input embedding is reintroduced through a residual connection to ensure the complete preservation of the information from the lookback window. The entire process can be formulated as follows:

$$\begin{equation}
    \widetilde{\vx}^{\ast} = \widetilde{\vx} + \widetilde{\vz} = \widetilde{\vx} + \text{MLP}(\vz) .
\end{equation}$$

In the case where $k$ candidates are retrieved, each candidate will undergo the aforementioned processing steps, resulting in $k$ embeddings denoted as $\widetilde{\vz} = \{ \widetilde{\vz_1}, \widetilde{\vz_2}, \ldots, \widetilde{\vz_k} \}$. Prior to implementing the residual connection, we compute the average of these embeddings $\widetilde{\vz}$, i.e. $\widetilde{\vx}^{\ast} = \widetilde{\vx} + \text{Avg}(\text{MLP}(\vz_1), \ldots, \text{MLP}(\vz_k))$. By weighing the importance of different components of the combined embedding, the foundation model is empowered to to incorporate additional contextual information from the external knowledge base. Consequently, the final knowledge-enhanced input embedding $\widetilde{\vx}^{\ast}$ will be fed into the foundation model backbone, thereby enhancing prediction accuracy.

## Knowledge Retrieval {#sec:retrieval}

Inspired by DPR [@karpukhin2020dpr], we employ a dual-encoder retriever to efficiently obtain relevant information from the external knowledge base.

### Knowledge Retrieval Learning

The retriever adopts a MLP-based encoder to respectively embed the query and the candidates. In `\model`{=latex}, we utilize the input directly as the query. Then, the retriever calculates the dot product similarity score between the query and each candidate using their respective embeddings. Finally, the candidates with the $k$ highest similarity scores are retrieved, denoted as $\bm{C} = \{ \vc_1, \vc_2, \ldots, \vc_k \}$.

Intuitively, by augmenting the model with retrieved knowledge, the goal is to improve predictions based on desired metrics, such as Mean Squared Error (MSE). However, it is challenging to guarantee that retrieved candidates with higher similarity scores will consistently provide more useful knowledge for forecasting. To address this, we employ the foundation model $\mathcal{F}$ as an evaluator, leveraging its strong forecasting capability to provide feedback and guide the selection of knowledge.

Specifically, using the retrieved candidate $\vc_i$, we employ the foundation model $\mathcal{F}$ to obtain the the metric values of prediction $\hat{\vy}=\mathcal{F}(\vx,\vc_i)$. If $\mathcal{F}$ finds that integrating the knowledge from $\vc_i$ is beneficial for forecasting, we encourage the retriever to rank the score of $\vc_i$ to be higher. In this way, the model can automatically decide the usefulness of the candidates and learn to retrieve more helpful candidates from the knowledge base. To implement this learning strategy, we first transform the metric values into a probability distribution as:

$$\begin{equation}
\label{equ:prob}
    \vp_i = \frac{\text{exp}(\frac{1}{\tau_m}\text{M}(\mathcal{F}(\vx,\vc_i), \vy))}{\sum^{k}_{j=1}\text{exp}(\frac{1}{\tau_m}\text{M}(\mathcal{F}(\vx,\vc_j), \vy))} ,
\end{equation}$$ where $\text{M}(\hat{\vy}, \vy)$ denotes the metric function to evaluate the quality of the prediction $\hat{\vy}$ given the ground truth $\vy$ and $\tau_m$ is a temperature hyperparameter to control the sensitivity of the metric. Here the metric function satisfies that a higher value of $\text{M}(\cdot, \cdot)$ indicates better performance. If a smaller value of $\text{M}(\cdot, \cdot)$ indicates better performance, we can replace $\text{M}(\cdot, \cdot)$ with $-\text{M}(\cdot, \cdot)$ in `\eqref{equ:prob}`{=latex}.

It is evident that a beneficial $\vc_i$ will correspond to a higher $\vp_i$, allowing $\vp_i$ to serve as a supervised signal to guide the learning of the retriever. In particular, we aim to align the similarity score generated by the retriever with $\bm{P}=\{\vp_i\}_{i=1}^{k}$. Formally, suppose we have top-k retrieval candidates $\bm{C}_q$ along with their associated retrieval scores $\bm{S}_q \in \mathbb{R}^k$ with respect to the query $\vq$. We can then aim to minimize the Kullback-Leibler divergence between $\bm{S}_q$ and $\bm{P}$ as follows: $$\begin{equation}
\label{equ:l_r}
    \mathcal{L}_R = \KL(\bm{P}, \softmax (\bm{S}_q/ \tau_s)),
\end{equation}$$ where $\KL$ denotes the KL divergence and $\tau_s$ is a temperature hyperparameter to control the sensitivity of the similarity scores.

However, during the training process, there is a risk that the retriever may become entrenched in a local optimum, thereby consistently retrieving a limited set or a narrow range of candidates. Consequently, the forecaster fails to learn from the retriever and disregards the retrieved knowledge. To address this issue, we employ a straightforward augmentation strategy by incorporating randomly sampled data from the knowledge base to promote a broader exploration of candidates within the framework. Specifically, we initially replace each $\vc_i$ with a randomly selected candidate $\vc_i^{\mathrm{aug}}$ at a probability of $\rho$, yielding $\bm{C}_q^{\mathrm{aug}}$. Then the dot product similarity between the query $\vq$ and each candidate $\vc_i^{\mathrm{aug}}$ will be updated as the retrieval scores $\bm{S}_q^{\mathrm{aug}} = \{\vs_i^{\mathrm{aug}}\}_{i=1}^{k}$. Finally, based on `\eqref{equ:l_r}`{=latex}, we can minimize the following loss to update the retriever:

$$\begin{equation}
    \mathcal{L}_R^{\mathrm{aug}} = \KL(\bm{P}^{\mathrm{aug}}, \softmax (\bm{S}_q^{\mathrm{aug}}/ \tau_s)).
\end{equation}$$

### Retriever-Forecaster Joint Training {#sec:train}

Utilizing the candidates retrieved by the retriever, we aim to enhance forecasting capability by leveraging external knowledge and further supervising the training of the forecaster. As illustrated in Figure `\ref{fig:overview}`{=latex}, the backbone of foundation model remains frozen throughout the training process. To maintain consistency, we employ the same prediction loss utilized during the pre-training phase to update the entire forecaster. Formally, the prediction loss can be formulated as follows: $$\begin{equation}
    \mathcal{L}_{\mathrm{Pred}} = \mathcal{L}_{\mathrm{Pretrain}}(\mathcal{F}(\vx,\bm{C}), \vy).
\end{equation}$$ Combined with the loss utilized for updating the retriever,, the whole training loss is $$\begin{equation}
    \mathcal{L} = \mathcal{L}_{\mathrm{Pred}} + \lambda \cdot \mathcal{L}_R^{\mathrm{aug}} ,
\end{equation}$$ where $\lambda$ is a weight hyperparameter of $\mathcal{L}_R^{\mathrm{aug}}$.

## Inference Procedure {#sec:inference}

During the inference process, given a query, candidates with the highest $k$ retrieval scores from the knowledge base are retrieved by the retriever. Following preprocessing, the embeddings of the input and the retrieved candidates are processed through Channel Prompting to effectively integrate external knowledge. Ultimately, the knowledge-enhanced embeddings are fed into the backbone of the time series foundation model, which then generates the final prediction.

# Experiment {#sec:exp_set}

## Experiments Setups {#sec:datasets}

\resizebox{\linewidth}{!}{
    \begin{tabular}{llcccccc}
    \toprule
         & Statistics &  Energy & Transport & Nature & Web & Sales & Healthcare \\
        \midrule
        Dataset & \# Datasets & 3 & 2 & 2 & 2 & 2 & 2  \\
        & \# Obs & 10,875,374 & 8,223,748 & 9,784,137 & 157,104,689 & 58,411,778 & 72,583,275 \\
        & \% & 3.43\% & 2.59\% & 3.09\% & 49.56\% & 18.43\% & 22.90\% \\
        \midrule
        Knowledge Base & \# Datasets & 3 & 2 & 2 & 2 & 2 & 2  \\
        & \# Obs & 545,792 & 585,216 & 513,024 & 512,000 & 484,864 & 512,512 \\
        & \% & 17.31\% & 18.56\% & 16.27\% & 16.24\% & 15.38\% & 16.25\% \\
    \bottomrule
    \end{tabular}
    }

**Datasets and Knowledge Base:** Our training employs a subset of about 320 million time points from LOTSA [@woo2024moirai] and UTSD [@liu2024timer], which were used for the pre-training of Time Series Foundation Models. The dataset encompasses a diverse range of domains to maintain the generalization capabilities of the foundation model. The knowledge base used for training contains approximately $3$ million data points, as introduced in `\secref{sec:kb}`{=latex}, selected from the training datasets. Each domain within the knowledge base is designed to contain a roughly equivalent number of data points to maintain balance. Detailed statistics of our training dataset and knowledge base are provided in Table `\ref{tab:dataset_domain}`{=latex}. Consistent with LOTSA, we adopt Arrow [@arrow] as the unified storage format, which is suitable for deep learning pipelines. For evaluation, we consider the popular long sequence forecasting benchmark, including six public datasets : ETTh1, ETTh2, ETTm1, ETTm2, Weather, and Electricity, which are commonly utilized in previous works [@jin2023timellm; @woo2024moirai].The detail of our training datasets and evaluation datasets are provided in Appendix `\ref{sec:app_datasets}`{=latex}.

**Metric:** We employ mean squared error (MSE) as the standard error metric for our experiments.

**Implementation Detail:** We employ TTM-Base (TTM~B~), one of the latest State of The Art (SOTA) TSFM, as our backbone. The input context length is set to 512 and the forecasting length is 96, consistent with TTM~B~. During inference, `\model `{=latex}uses the same knowledge base employed during the training phase. More implementation details are provided in Appendix `\ref{sec:app_imple}`{=latex}.

**Baselines:** We compare with 12 of the latest open-sourced state-of-the-art forecasting methods categorized as follows: (a) Time Series Foundation Model: TTM [@ekambaram2024ttm], Moirai [@woo2024moirai], MOMENT [@goswami2024moment], Timer [@liu2024timer], Chronos [@ansari2024chronos], TimesFM [@das2024timefm]. (b) LLM-based Time Series Model: TimeLLM [@jin2023timellm], GPT4TS [@zhou2023one]. (c) Other architectures: iTransformer [@liuitransformer], TimesNet [@wutimesnet], PatchTST [@nie2023patchtst] and DLinear [@zeng2023dlinear]. All results are sourced from @liu2024timer or our reproduction. Due to page limitation, we report the results of the base version of TSFMs in the main text. Full results are provided in Appendix `\ref{sec:app_exp_lsf}`{=latex}.

## Results of Zero-shot Forecasting

<figure id="fig:exp_improve">
<div class="center">
<img src="imgs/exp/exp_bar_plot.png" />
</div>
<figcaption>Improvement by on zero-shot forecasting. <span class="math inline">5%</span> Few shot denotes finetuning TSFM with <span class="math inline">5%</span> of downstream dataset. demonstrates significant improvements across various datasets, even outperforming results obtained by few-shot fine-tuning.</figcaption>
</figure>

**Improvement by `\model `{=latex}on zero-shot forecasting:** As shown in Figure `\ref{fig:exp_improve}`{=latex}, we demonstrate the improvements brought by our method in zero-shot forecasting. The yellow bar represents the scenario where $5\%$ of the training data from the dataset is used to fine-tune the foundation model backbone. Augmented by retrieved knowledge, our `\model `{=latex}presents significant improvements across all the datasets. The experiment results indicate that, through our training, our retriever has learned to search for valuable information from the knowledge base. Subsequently, through channel prompting, `\model `{=latex}successfully extracts useful knowledge, ultimately enhancing the prediction results. Moreover, `\model `{=latex}also outperforms the performance achieved through few-shot fine-tuning, which further demonstrate the effectiveness of our method.

\renewcommand{\arraystretch}{1.2}
\resizebox{\linewidth}{!}{
    \begin{tabular}{lccccccccccccc}
          \toprule
          \textbf{Dataset}& \textbf{Zero-shot}                 & \textbf{Full-shot} \\
           \cmidrule(lr){2-8} \cmidrule(lr){9-14}
          & {\textbf{{\model}}} & {\textbf{{TTM\textsubscript{B}}}} &  {\textbf{Moirai\textsubscript{B}}} & {\textbf{MOMENT}} & {\textbf{Timer\textsubscript{1B}}} & {\textbf{TimesFM}} & \textbf{Chronos\textsubscript{S1}} & {\textbf{TimeLLM}} & {\textbf{GPT4TS}} & {\textbf{iTransformer}} & {\textbf{TimesNet}} & {\textbf{PatchTST}} & {\textbf{DLinear}} \\
          
          \midrule    
          ETTh1 & \textbf{\color{red}{0.359}} & 0.364  & 0.383 & 0.674 & 0.438 & 0.414 & 0.571 & \underline{0.362} & 0.376 & 0.386 & 0.384 & 0.414 & 0.386 \\ \midrule
        ETTh2 & \underline{\color{red}{0.276}} & 0.285  & 0.295 & 0.330 & 0.314 & 0.318 & 0.423 & \textbf{0.268} & 0.285 & 0.297 & 0.340 & 0.302 & 0.333 \\ \midrule
        ETTm1 & {0.399} & 0.415  & 0.448 & 0.670 & 0.690 & \color{red}{0.354} & 0.632 & \textbf{0.272} & \underline{0.292} & 0.334 & 0.338 & 0.329 & 0.345 \\ \midrule
        ETTm2 & \color{red}{0.177} & 0.186  & 0.225 & 0.257 & 0.213 & 0.201 & 0.272 & \textbf{0.161} & \underline{0.173} & 0.180 & 0.187 & 0.175 & 0.193 \\ \midrule
        Weather & \underline{\color{red}{0.152}} & 0.158 & 0.197 & 0.255 & 0.181 & - & - & \textbf{0.147} & 0.162 & 0.174 & 0.172 & 0.177 &  0.196 \\ \midrule
        Electricity & 0.168 & 0.170 & \color{red}{0.162}  & 0.744 & 0.192 & - & - & \textbf{0.131} & \underline{0.139} & 0.148 & 0.168 & 0.195 & 0.197 \\ 
         
    \bottomrule
    \end{tabular}%
    }

`\label{tab:lsf_full}`{=latex}

**`\model `{=latex}vs. other models:** We compare `\model `{=latex}against 12 baseline models. The experiment results are shown in Table `\ref{tab:lsf_full}`{=latex}, where 'zero-shot' refers to the forecasting results of various foundation models without any prior training on the test datasets, while 'full-shot' represents the prediction results of baseline models that have been fully trained on each dataset. Compared to the foundation models, `\model `{=latex}achieves either the best or competitive results across multiple datasets. Besides, our method is an enhancement built upon the foundation model. As the foundation model continues to evolve, `\model `{=latex}is anticipated to yield further improvements when adapted to new backbones. Additionally, we observe that `\model `{=latex}achieves strong results compared to full-shot baselines, thereby underscoring the effectiveness of retrieval-augmented forecasting.

## Ablation Studies

### Effectiveness of the Retriever

\resizebox{0.7\textwidth}{!}{
     \begin{tabular}{lccccc} 
    \toprule
       \textbf{Dataset}  & \textbf{\model} & \textbf{Random} & \textbf{Cosine} & \textbf{Token-Concat} & \textbf{Average}\\
    \midrule
       ETTh1  & \textbf{0.359} & 0.365 & 0.360 & 0.363 & 0.367\\
       ETTh2  & \textbf{0.276} & 0.287 & 0.282 & 0.278& 0.292\\
       ETTm1  & \textbf{0.399} & 0.420 & 0.401 & 0.404 & 0.421\\
       ETTm2  & \textbf{0.177} & 0.188 & 0.184 & 0.180 & 0.190\\
       Weather  & \textbf{0.152} & 0.159 & 0.153 & 0.153 & 0.166\\
       Electricity  & \textbf{0.168} & 0.173 & 0.172 & 0.174 & 0.181 \\
    \bottomrule
    \end{tabular}
    }

`\label{tab:abla}`{=latex}

As described in `\secref{sec:retrieval}`{=latex}, we employ an end-to-end approach to train the retriever, encouraging it to select the most valuable candidates from the knowledge base. To validate the effectiveness of the learnable retriever, we have designed two baselines for comparison: one that randomly selects candidates from the knowledge base and another that selects the top $k$ candidates based on cosine similarity. As shown in Table `\ref{tab:abla}`{=latex}, randomly selecting candidates fails to provide useful information to the forecaster and may even introduce noise, degrading the model's predictive performance. While the cosine similarity-based retrieval method offers some knowledge, its improvement is limited and falls short compared to our method, which automatically learns how to retrieve useful knowledge.

### Channel Prompting

An effective integration method, named Channel Prompting, is used to extract the relevant knowledge from the retrieved data, as detailed in `\secref{sec:integration}`{=latex}. To validate the effectiveness of channel prompting, we establish two baselines for comparison: the first, called *Token-Concat*, entails concatenating the retrieved candidates with the input at the token level, while the second, termed *Average*, involves directly computing the mean of the candidate and input embeddings for integration. As shown in Table `\ref{tab:abla}`{=latex}, our `\model `{=latex}outperforms both baselines. The token-level concatenation imposes restrictions on the integration to tokens located in the same position. While averaging input and retrieved candidates embeddings prove insufficient for extracting valuable information.

## Model Analysis

### Choice of Knowledge Base

\resizebox{0.63\linewidth}{!}{
       \begin{tabular}{lcccc} 
    \toprule
       \textbf{Dataset} & \textbf{w/o RAF} & \textbf{\modelr} & \textbf{\model} & \textbf{\modeld} \\
    \midrule
       ETTh1  & 0.364 & 0.362 & {\textbf{0.359}} & \underline{0.360} \\
       ETTh2  & 0.285 & 0.280 & {\textbf{0.276}} & \underline{0.278} \\
       ETTm1  & 0.415 & 0.404 & {\textbf{0.399}} & \underline{0.400} \\
       ETTm2  & 0.186 & 0.181 & {\textbf{0.177}} & \underline{0.178} \\
       Weather  & 0.158 & 0.155 & {\textbf{0.152}} & \textbf{0.152} \\
       Electricity  & 0.170 & 0.169 & {\textbf{0.168}} & \textbf{0.168} \\
       % \midrule
       % \textbf{Improvement(\%)} & - & \textbf{1.81\%} & \textbf{3.03\%} & \textbf{2.86\%} \\
    \bottomrule
    \end{tabular}
   }

`\label{tab:kb}`{=latex}

**Source of Knowledge Base:** The previous experimental results have convincingly demonstrated that following training, `\model `{=latex}has acquired the capability to dynamically access pertinent knowledge from an external knowledge base and effectively leverage this valuable information. To further explore the implications of employing various knowledge bases during inference, we have devised the following three scenarios: (a) **`\modelr`{=latex}** randomly selects data from the pre-trained multi-domain dataset, which may result in an uneven distribution across different domains. (b) **`\modeld`{=latex}** utilizes a knowledge base closely related to the test data. Specifically, the training set from the same dataset is directly employed as the knowledge base for retrieval. (c) **`\model`{=latex}** engages a meticulously curated multi-domain dataset, which is detailed in Table `\ref{tab:dataset_domain}`{=latex}. As presented in Table `\ref{tab:kb}`{=latex}, `\model `{=latex}achieves the best performance across different datasets. As a specifically designed knowledge base, it encompasses a rich repository of information across multiple domains, enabling it to provide useful information to enhance predictions. Meanwhile, the knowledge base used in `\modeld `{=latex}is particularly relevant to the test data, providing domain-specific knowledge. As a result, the zero-shot time series forecasting performance achieved with this knowledge base ranks just below that of `\model`{=latex}. However, the randomly selected knowledge base used in `\modelr `{=latex}suffers from domain imbalance, which limits the enhancement that external knowledge can provide to the forecaster.

\begin{wrapfigure}{r}{0.35\textwidth}
    
    
    \includegraphics[width=0.35\textwidth]{imgs/exp/size.pdf}  
    
    \caption{Influence of knowledge base size. Smaller knowledge base provides less information, leading to worse performance.}
    \label{fig:size}
\end{wrapfigure}

**Size of Knowledge Base:** The size of knowledge base also plays a vital role in the framework, determining the extent of external knowledge that can be accessed. We perform a comprehensive analysis of this aspect, presenting average results across various datasets in Figure `\ref{fig:size}`{=latex}. Full results are provided in Appendix `\ref{sec:app_size}`{=latex}. Initially, both `\model `{=latex}and `\modeld `{=latex}utilize knowledge bases of comparable scale, each consisting of approximately 3 million data points, as outlined in `\secref{sec:datasets}`{=latex}. Then, we progressively reduce the size of the knowledge base. As shown in Figure `\ref{fig:size}`{=latex}, the MSE are influenced by modifications in the knowledge base size. As the size diminishes, the amount of external knowledge it can provide decreases, leading to a decline in the performance. Once the knowledge base is reduced beyond a certain point, using a domain-specific knowledge base (`\modeld`{=latex}) can provide more relevant information compared to a multi-domain knowledge base (`\model`{=latex}), resulting in better forecasting performance.

### Influence of the Candidates Number {#sec:exp_topk}

We investigate the impact of varying the numbers of retrieved candidates on prediction performance. As illustrated in Figure `\ref{fig:exp_topk}`{=latex}, using multiple retrieved candidates (e.g., $4$ or $8$) equips the forecaster with a more comprehensive set of external information compared to relying on a single candidate, thereby further enhancing prediction performance. Nevertheless, the performance gains do not persist as the variable $k$ increases. In our analysis of the test data, we observe that when $k$ is elevated to $16$ or $32$, there is no significant improvement in the model's prediction accuracy. This phenomenon may be attributable to the introduction of excessive candidates, which can lead to redundant information, ultimately detracting from the overall effectiveness of the prediction results.

<figure id="fig:exp_topk" data-latex-placement="t">
<div class="center">
<img src="imgs/exp/topk.png" />
</div>
<figcaption>Influence of the Candidates Number <span class="math inline"><em>k</em></span>. As <span class="math inline"><em>k</em></span> increases, the performance gradually improves due to the integration of more relevant knowledge. However, when <span class="math inline"><em>k</em></span> exceeds a certain threshold, the abundance of information can introduce redundancy, negatively affecting the prediction.</figcaption>
</figure>

## Case Study on Retrieved Knowledge

<figure id="fig:case" data-latex-placement="t">

<figcaption><strong>Case Study on Retrieved Knowledge.</strong> <strong>(a) Example A:</strong> The retrieved knowledge shares similar periodicity and subtle fluctuations with the input, facilitating the forecaster’s ability to effectively capture the prior knowledge inherent in the input, thereby improving prediction performance. <strong>(b) Example B:</strong> The retrieved data provides supplementary insights, including partial future information (highlighted within the red dashed box), empowering the forecaster to generate better predictions.</figcaption>
</figure>

To conduct a detailed analysis of the information provided by the retriever and its contribution to enhancing the zero-shot forecasting capabilities of the foundation model, we present two illustrative examples in Figure `\ref{fig:case}`{=latex}. As shown in Figure `\ref{fig:case1}`{=latex}, the retrieved knowledge exhibits similar periodicity and nuanced fluctuations to the input, enhancing the forecaster's capacity to effectively capture the prior knowledge inherent in the input data, thereby improving prediction performance.

The data retrieved by the retriever is not always highly similar to the input, as illustrated in Figure `\ref{fig:case2}`{=latex}. In the absence of the retrieval-augmented forecasting method, the model generates predictions with small amplitude, relying solely on constrained historical data and underlying inertia. However, the incorporation of retrieved data provides additional insights, including partial future information (highlighted within the red dashed box), thereby improving the prediction generated by the forecaster.

# Conclusion and Future work

In this paper, we introduce `\model`{=latex}, a novel framework designed to leverage retrieval-augmented generation for zero-shot time series forecasting. We develop customized time series knowledge bases that are tailored to the specific forecasting tasks and employ an end-to-end learnable retriever to extract valuable information from the knowledge base. We also introduce Channel Prompting to extract relevant information from the retrieved data for knowledge integration. By leveraging external knowledge, `\model `{=latex}exhibits a notable enhancement in zero-shot time series forecasting.

While `\model `{=latex}achieves phenomenal performance, this represents merely the initial step in the integration of time series methods and RAG. Due to resource constraints, the knowledge base is established based on original time series data data without the implementation of advanced techniques like trend-seasonal decomposition. In terms of architecture, our approach to integrate external knowledge is somewhat heuristic and future work should design a more flexible and elegant approach. Also, the current architecture has ignored the potential interdependencies among different channels, which could be addressed more effectively in future methods. Finally, incorporating multi-modality such as tabular or text data is an exciting new direction to provide supplementary knowledge.

\bibliographystyle{iclr2025_conference}
\clearpage
\newpage
\appendix

::: center
`\LARGE `{=latex}**`\model`{=latex}: Retrieval-Augmented Foundation model for zero-shot time series forecasting\
$\;$\
------------Appendix------------**
:::

`\hypersetup{hidelinks}`{=latex} `\tableofcontents`{=latex} `\noindent`{=latex}`\hrulefill`{=latex}

\newpage
\addtocontents{toc}{\protect\setcounter{tocdepth}{2}}

# `\model `{=latex}Datasets {#sec:app_datasets}

## List of Training Datasets

Our fine-tuning employs a subset of about 320 million time points from LOTSA [@woo2024moirai] and UTSD [@liu2024timer]. To enhance data integrity, missing values are systematically addressed using linear interpolation techniques. For each univariate, multivariate, or irregular-sampled time series, we store them with timestamps, domains, sampling frequencies and other meta-information in one directory using ARROW format. One dataset may composed of multiple related time series.

All datasets can be classified into six distinct domains by their source: Energy, Nature, Transport, Web, Sales, and Healthcare. The datasets exhibit diverse sampling frequencies, ranging from macro intervals such as daily to more fine-grained intervals like hourly and minutely. Notably, several datasets can demonstrate exceptionally high-frequency sampling rates, such as the MotorImagery dataset, which operates at a millisecond frequency.

## List of Inference Datasets

In the field of time series forecasting, several classical datasets such as ETT [@zhou2021informer], ECL [@wu2021autoformer] and Weather [@wu2021autoformer] have become widely recognized benchmarks for evaluating model performance. We also utilize these datasets to evaluate the zero-shot forecasting performance and perform evaluation in a sliding window fashion following previous work [@nie2023patchtst; @ekambaram2024ttm]. Below, we offer a brief overview of these datasets.

- **ETT datasets:** The four ETT datasets (ETTH1, ETTH2, ETTM1, ETTM2) contain multivariate time series data collected from electrical transformers at two stations. ETTH1 and ETTH2 are collected at an hourly interval, while ETTM1 and ETTM2 are collected every 15 minutes. All four datasets have 7 channels.

- **Weather:** The weather dataset consists of 21 channels, which serve as weather indicators. It is collected at 10-minute intervals at the Max Planck Institute of Biogeochemistry weather station.

- **Electricity (ECL):** The Electricity dataset, also known as the ECL dataset, comprises the hourly electricity consumption data of 321 clients.

# Additional Implementation Detail {#sec:app_imple}

## Knowledge Base for Training

The knowledge base used for training contains approximately $3$ million data points, as introduced in `\secref{sec:kb}`{=latex}, selected from the training datasets. The key statistics of datasets in knowledge base is provided in Table `\ref{tab:dataset_domain}`{=latex}. The selection process entailed a meticulous curation to ensure a diverse representation of data across various domains. This diversity enhances the robustness of the model, enabling it to generalize better across different contexts. Each data point has been sourced from reputable datasets, ensuring high-quality input that informs the training process. Besides, all the time series data in the knowledge base has the same length as input, which is $512$.

## Training Details

Our training is performed on 4 NVIDIA A100 GPUs. Following the backbone configuration of the foundation model we use (TTM~B~ [@ekambaram2024ttm]), the input length $sl=512$ and $fl=96$. Both encoders in the retriever employ a two-layer MLP, with the size of the hidden layer configured to be four times the input dimension. The tanh activation function is utilized in both layers. The MLP in the Channel Prompting also use tanh activation function and has 4 layers. During the training phase, all parameters of the time series foundation model remain fixed, with only the retriever and channel prompting module undergoing training. The learning rate for the retriever is established at $0.001$, whereas the learning rate for the channel prompting is set at $0.00001$. The weight $\lambda$ is set to $1$. The entire model was trained for 2 epochs, and for different test datasets, we reported the best results. During the inference phase, with the exception of the ablation experiments detailed in section `\ref{sec:exp_topk}`{=latex}, we consistently retrieved eight candidates from the knowledge base, where $k=8$.

\renewcommand{\arraystretch}{1.3}
\resizebox{\linewidth}{!}{
    \begin{tabular}{c|c|c|c|c}
        \toprule
        Domain & Dataset & Frequency & Time Points & Source \\
        \midrule
         Energy & BDG-2 Fox & H & 2,324,568 &  BuildingsBench~\citep{emami2023buildingsbench} \\
         & Australian Electricity Demand & 30T & 1,153,584 &  Monash~\citep{godahewa2021monash} \\ 
         &Solar Power & 4S & 7,397,222 &  Monash~\citep{godahewa2021monash} \\ 
         \midrule
         Transport & Los-Loop & 5T & 7,094,304 &  LibCity~\citep{jiang2023libcity} \\
         & Uber TLC Hourly & H & 1,129,444 &  GluonTS~\citep{alexandrov2020gluonts} \\
         \midrule
         Nature & Subseasonal Precipitation & D & 9,760,426 &  SubseasonalClimateUSA library~\citep{mouatadid2024subseasonalclimateusa} \\
         & Saugeen & D & 23,711 &  Monash~\citep{godahewa2021monash} \\
         \midrule
         Web & Kaggle Web Traffic Daily & D & 116,485,589 &  Monash~\citep{godahewa2021monash} \\
         & Wiki-Rolling & D & 40,619,100 &  GluonTS~\citep{alexandrov2020gluonts} \\
         \midrule
         Sales & M5 & D & 58,327,370 &  GluonTS~\citep{alexandrov2020gluonts} \\
         & Favorita Transactions & D & 84,408 &  Kaggle \\
         \midrule
         Healthcare & MotorImagery & 0.001S & 72,576,000 &  UCR Time Series Archive~\citep{dau2019ucr} \\
         & US Births & D & 7,275 &  Monash~\citep{godahewa2021monash} \\
         % \midrule
         % Econ/Fin & GoDaddy & M & 128,535 &  Kaggle \\
         % & Bitcoin & D & 74,824 &  Monash Time Series Forecasting Repository~\citep{godahewa2021monash} \\
         \bottomrule
    \end{tabular}
    }

## Baselines

We conduct zero-shot forecasting experiments on seven datasets from iTransformer [@liuitransformer]. We apply the same data-split strategy as Autoformer [@wu2021autoformer] and calculate the averaged MSE of all predict-96 windows in the test split. We evaluate five open-source time series foundation model, including Timer [@liu2024timer], Moirai [@woo2024moirai], TimesFM [@das2024timefm], Chronos [@ansari2024chronos], and MOMENT [@goswami2024moment]. However, closed-source models such as TimeGPT [@garza2023timegpt] are not included due to their inaccessibility.

- **MOMENT**: MOMENT[^1] employs a masking modeling approach for zero-shot forecasting by concatenating the lookback series with a mask corresponding to the prediction length. The output of the model, derived from the mask, serves as the forecast. This method involves pre-training a Transformer encoder model in a univariate manner using a curated dataset known as the "Time Series Pile," which encompasses a diverse range of time series data.

- **Chronos**: Chronos[^2] is a probabilistic forecaster. *Chronos~S1~* refers to sampling a single prediction trajectory, while *Chronos~S20~* involves averaging 20 sampled trajectories. Chronos tokenizes the input time series and processes these tokens using a large language model, specifically the T5 model. It is trained on an extensive corpus of time series data, including synthetic data, to enhance generalization.

- **TimesFM**: TimesFM employs a decoder-style attention model, characterized by causal self-attention, which is pre-trained in a univariate manner on an extensive array of both real-world and synthetic datasets. We utilize the official checkpoint available on HuggingFace[^3], which accommodates a variety of input and output lengths.

- **Moirai**: The Moirai family[^4] has three different sizes, labeled as *Moiria~S~*, *Moiria~M~*, and *Moiria~L~*. Moirai pre-trains a Transformer encoder on the extensive "LOTSA" dataset (27B time points) by masking the forecast horizon of each target channel and performing mask reconstruction. By flattening all channels in a multivariate time series, Moirai supports pre-training in "any-variate" settings.

- **Timer**: Timer provides three versions with increased scopes of pre-training. *Timer~1B~* is pre-trained on UTSD[^5]; *Timer~16B~* is pre-trained on UTSD and Buildings900K [@emami2023buildingsbench]; and *Timer~28B~* is pre-trained on UTSD and LOTSA.

- **TTM**: TTM[^6] pre-trains a compact model based on the light-weight TSMixer architecture, incorporates innovations like adaptive patching, diverse resolution sampling, and resolution prefix tuning on Monash and LibCity datasets.

We report the implementation details for all the time series foundation model baselines in Table `\ref{tab:app_baseline}`{=latex}.

\resizebox{\linewidth}{!}{
    \begin{tabular}{lccc}
    \toprule
        Baseline & Used in Table & Results Source & Implementation Link \\
        \midrule
        Moirai & Zero-shot in Table~\ref{tab:lsf_full} and Table~\ref{tab:app_lsf_full} & \citet{liu2024timer} & \href{https://github.com/SalesforceAIResearch/uni2ts}{uni2ts} \\
         Timer & Zero-shot in Table~\ref{tab:lsf_full} and Table~\ref{tab:app_lsf_full} & \citet{liu2024timer} & \href{https://github.com/thuml/Large-Time-Series-Model}{Large-Time-Series-Model} \\
          MOMENT & Zero-shot in Table~\ref{tab:lsf_full} and Table~\ref{tab:app_lsf_full} & \citet{liu2024timer} & \href{https://github.com/moment-timeseries-foundation-model/moment}{moment} \\
           Chronos & Zero-shot in Table~\ref{tab:lsf_full} and Table~\ref{tab:app_lsf_full} & \citet{liu2024timer} & \href{https://github.com/amazon-science/chronos-forecasting}{chronos-forecasting} \\
           TimesFM & Zero-shot in Table~\ref{tab:lsf_full} and Table~\ref{tab:app_lsf_full} & \citet{liu2024timer} & \href{https://github.com/google-research/timesfm}{TimesFM} \\
           TimesFM & Zero-shot in Table~\ref{tab:lsf_full} and Table~\ref{tab:app_lsf_full} & Our reproduction using official implementation & \href{https://github.com/ibm-granite/granite-tsfm}{granite-tsfm} \\
           \bottomrule
    \end{tabular}
    }

`\label{tab:app_baseline}`{=latex}

\begin{table*}[ht]
  
  \caption{Quality evaluation of time series foundation models. \emph{Architecture} denotes the Transformer category. \emph{Model size} presents the parameter counts. \emph{Token type} presents the graininess of time series tokens. \emph{Context length} means the maximum/fixed input length of the model.}
  
  \label{tab:tsfm_comp}
  \vskip 0.15in
  
  \renewcommand{\multirowsetup}{}
 \resizebox{\linewidth}{!}{
  \renewcommand\arraystretch{1.2}
  \begin{tabular}{c|cccccc}
    \toprule
    Method & Timer & Moirai & MOMENT & Chronos & TTM & TimesFM   \\ 
     & \citeyearpar{liu2024timer}  & \citeyearpar{woo2024moirai}& \citeyearpar{goswami2024moment}  & \citeyearpar{ansari2024chronos} &
     \citeyearpar{ekambaram2024ttm} &  \citeyearpar{das2024timefm}   \\
    \toprule
     
    Model size & 29M, 50M,  & 14M, 91M, & 40M, 125M & 20M, 46M,  & 1M, 4M  & 17M, 70M,   \\
     & 67M & 311M & 385M & 200M, 710M & 8M & 200M   \\
    \midrule
    Supported tasks & Forecast  & Forecast & \scalebox{0.8}{Forecast Imputation} & Forecast & Forecast & Forecast  
 \\
     & Imputation & &\scalebox{0.8}{Classification}& & & \\
     & Detection & & \scalebox{0.8}{Detection}& & &   \\
    \midrule
    Pre-training Scale & 28B & 27.65B & 1.13B & 84B & 1B & 100B  \\
    \midrule
    Token type & Segment & Segment & Segment & Point & Segment & Segment  \\
    \midrule
    Context length & $\le$1440 & $\le$5000 & = 512 & $\le$512 & $\le$1536 & $\le$512   \\
    \midrule
    Variable length & True & True & False & True & True & True \\
    \midrule
    Probabilistic & False & True & False & True & False & True  \\
    \bottomrule
  \end{tabular}
    }
\end{table*}

# Additional Experiments

## Zero-shot forecasting evaluation {#sec:app_exp_lsf}

\resizebox{\linewidth}{!}{
    \begin{tabular}{lcccccccccccc}
          \toprule
          % & \textbf{Zero-shot}                 & \textbf{Full-shot} \\
          %  \cmidrule(lr){2-8} \cmidrule(lr){9-14}
          \textbf{Dataset} & {\textbf{{\model}}} & {\textbf{{TTM\textsubscript{B}}}} & 
          {\textbf{Moirai\textsubscript{S}}} &{\textbf{Moirai\textsubscript{B}}} & {\textbf{Moirai\textsubscript{L}}} & {\textbf{MOMENT}} & {\textbf{Timer\textsubscript{1B}}} & {\textbf{Timer\textsubscript{16B}}} &{\textbf{Timer\textsubscript{28B}}} &{\textbf{TimesFM}} & \textbf{Chronos\textsubscript{S1}} & {\textbf{Chronos\textsubscript{S20}}}  \\
          
          \midrule    
          ETTh1 & \textbf{0.359} & \underline{0.364} &  0.441& 0.383 & 0.394  & 0.674 & 0.438 & 0.364 & 0.393 & 0.414 & 0.571 & 0.34\\ \midrule
        ETTh2 & \textbf{{0.276}} & \underline{0.285}  & 0.295 & 0.295 & 0.293 & 0.330 & 0.314 & 0.294 & 0.308 & 0.318 & 0.423 & 0.326  \\ \midrule
        ETTm1 & \underline{0.399} & {0.415} & 0.562  & 0.448 & 0.452 & 0.670 & 0.690 & 0.766 & 0.420 & \textbf{0.354} & 0.632 & 0.451 \\ \midrule
        ETTm2 & \textbf{0.177} & \underline{0.186}  & 0.218 & 0.225 & 0.214 & 0.257 & 0.213 & 0.234 & 0.247 & 0.201 & 0.272 & 0.190  \\ \midrule
        Weather & \textbf{0.152} & \underline{0.158} & 0.195 & 0.197 & 0.221 & 0.255 & 0.181 & 0.203 & 0.243 & - & - & -  \\ \midrule
        Electricity & {0.168} & 0.170 & 0.212  & 0.162 & 0.155 & 0.744 & 0.192 & \textbf{0.139} & \underline{0.147} & - & - & -  \\ 
         
    \bottomrule
    \end{tabular}%
    }

`\label{tab:app_lsf_full}`{=latex}

We provide zero-shot time series forecasting results of `\model `{=latex}and other time series foundation model in Table `\ref{tab:app_lsf_full}`{=latex}. The results highlight the performance of `\model `{=latex}in comparison to other leading time series foundation models, demonstrating its effectiveness in integrating external knowledge. This capability is particularly crucial for industries that require timely and reliable forecasting without the luxury of extensive historical data. Overall, the findings suggest that `\model `{=latex}not only sets a new benchmark in zero-shot time series forecasting but also paves the way for future research on enhancing model architectures and training methodologies in this domain.

## Ablation Study on Candidate Augmentation

During the training process, there is a risk that the retriever may become entrenched in a local optimum, thereby consistently retrieving a limited set or a narrow range of candidates. To address this issue, we employ a straightforward augmentation strategy. We provide experiments results of `\model `{=latex}without candidate augmentation in Table `\ref{tab:app_aug}`{=latex}.

\resizebox{0.5\linewidth}{!}{
    \begin{tabular}{lcc}
    \toprule
        \textbf{Dataset} & \textbf{\model} & \textbf{\model} \\
        & & \textbf{w/o candidate augmentation} \\
        \midrule
        ETTh1 & \textbf{0.359} & 0.363 \\
        ETTh2 & \textbf{0.276} & 0.282 \\
        ETTm1 & \textbf{0.399} & 0.410 \\
        ETTm2 & \textbf{0.177} & 0.186 \\
        Weather & \textbf{0.152} & 0.161 \\
        Electricity & \textbf{0.168} & 0.173 \\
        \bottomrule
    \end{tabular}
    }

`\label{tab:app_aug}`{=latex}

## Size of Knowledge Base {#sec:app_size}

\resizebox{0.8\linewidth}{!}{
       \begin{tabular}{lcccccccccc} 
    \toprule
    \textbf{Dataset} & \textbf{\model} & \textbf{\modeld}\\
       \cmidrule(lr){2-6} \cmidrule(lr){7-11} & \textbf{100\%} & \textbf{50\%} & \textbf{30\%} & \textbf{10\%} & \textbf{1\%} & \textbf{100\%} & \textbf{50\%} & \textbf{30\%} & \textbf{10\%} & \textbf{1\%}\\
    \midrule
       ETTh1   & {\textbf{0.3592}} & 0.3598 & 0.3603 & 0.3608 & 0.3622 & {0.3599} & 0.3600 & 0.3601 & 0.3602 & 0.3611\\
       ETTh2   & {\textbf{0.2763}} & {0.2767} & 0.2773 & 0.2790 & 0.2844 & 0.2779 & 0.2785 & 0.2791 & 0.2804 & 0.2823\\
       ETTm1   & {\textbf{0.3991}} & 0.3995 & 0.4002 &0.4017 & 0.4038 &{0.3998} & 0.4002 & 0.4008 & 0.4015 & 0.4024 \\
       ETTm2   & {\textbf{0.1768}} & 0.1773 & 0.1778 & 0.1792 & 0.1815 & {0.1776} & 0.1780 & 0.1784 & 0.1791 & 0.1807\\
       Weather   & {\textbf{0.1522}} & 0.1527 & 0.1533 & 0.1542 & 0.1558 & {0.1524} & 0.1528 & 0.1533 & 0.1540 & 0.1551\\
       Electricity   & {\textbf{0.1681}} & 0.1686 & 0.1692 & 0.1701 & 0.1715 &{0.1684} &0.1688 & 0.1691 & 0.1698 & 0.1710 \\
       % \midrule
       % \textbf{Improvement(\%)} & - & \textbf{1.81\%} & \textbf{3.03\%} & \textbf{2.86\%} \\
    \bottomrule
    \end{tabular}
   }

`\label{tab:app_size}`{=latex}

We conduct an experiment to examine the impact of knowledge base size on performance. Initially, `\model `{=latex}and `\modeld `{=latex}uses knowledge bases of identical scale, each comprising approximately 3 million data points, as detailed in `\secref{sec:datasets}`{=latex}. We progressively reduce the size of the knowledge base and valuate the task. As shown in Figure 1, the zero-shot forecasting results on the ETTh1 dataset vary with changes in the knowledge base size. As shown in the figure, when the knowledge base becomes smaller, the amount of external knowledge it can provide decreases, leading to a decline in prediction performance. Once the knowledge base is reduced beyond a certain point, using a domain-specific knowledge base can provide more relevant information compared to a multi-domain knowledge base, resulting in better forecasting performance.

## Forecasting Visualization

We provide several visualization of zero-shot forecasting in Figure `\ref{fig:app_vis}`{=latex}. These visualizations illustrate the effectiveness of our proposed method based on leveraging external knowledge. Each subplot in Figure `\ref{fig:app_vis}`{=latex} captures distinct scenarios, allowing for a comprehensive understanding of the model's capabilities under different conditions.

<figure id="fig:app_vis" data-latex-placement="t">
<img src="imgs/exp/vis/etth1_0.png" />
<img src="imgs/exp/vis/etth1_1.png" />
<img src="imgs/exp/vis/etth2_0.png" />
<p><br />
<img src="imgs/exp/vis/etth2_1.png" /></p>
<img src="imgs/exp/vis/ettm1_0.png" />
<img src="imgs/exp/vis/ettm1_1.png" />
<p><br />
<img src="imgs/exp/vis/ettm2_0.png" /></p>
<img src="imgs/exp/vis/ettm2_1.png" />
<img src="imgs/exp/vis/weather_0.png" />
<p><br />
<img src="imgs/exp/vis/weather_1.png" /></p>
<img src="imgs/exp/vis/elc_0.png" />
<img src="imgs/exp/vis/elc_1.png" />
<figcaption>Visualization of zero-shot forecasting across different datasets.</figcaption>
</figure>

[^1]: <https://huggingface.co/AutonLab/MOMENT-1-large>

[^2]: <https://huggingface.co/amazon/chronos-t5-large>

[^3]: <https://huggingface.co/google/timesfm-1.0-200m>

[^4]: <https://huggingface.co/collections/Salesforce/moirai-10-r-models-65c8d3a94c51428c300e0742>

[^5]: <https://huggingface.co/datasets/thuml/UTSD/tree/main>

[^6]: <https://huggingface.co/ibm-granite/granite-timeseries-ttm-v1>