---
abstract: |
  Recently, diffusion probabilistic models have attracted attention in generative time series forecasting due to their remarkable capacity to generate high-fidelity samples. However, the effective utilization of their strong modeling ability in the probabilistic time series forecasting task remains an open question, partially due to the challenge of instability arising from their stochastic nature. To address this challenge, we introduce a novel **M**ulti-**G**ranularity **T**ime **S**eries **D**iffusion (`\mgtsdnospace`{=latex}) model, which achieves state-of-the-art predictive performance by leveraging the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. The way to construct the targets is motivated by the observation that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarse-grained representation, both of which result in a gradual loss of fine distribution features. In the study, we derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. More importantly, our approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate that our `\mgtsd `{=latex}model outperforms existing time series prediction methods. Our code is available at <https://github.com/Hundredl/MG-TSD>.
author:
- "**Xinyao Fan**$^{1*}$, **Yueying Wu**$^{2*}$, **Chang Xu**$^{4\\dagger}$, **Yuhao Huang**$^{3}$, **Weiqing Liu**$^{4}$, **Jiang Bian**$^{4}$"
bibliography:
- iclr2024_conference.bib
title: "MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process"
---

\newcommand{\figleft}{{\em (Left)}}
\newcommand{\figcenter}{{\em (Center)}}
\newcommand{\figright}{{\em (Right)}}
\newcommand{\figtop}{{\em (Top)}}
\newcommand{\figbottom}{{\em (Bottom)}}
\newcommand{\captiona}{{\em (a)}}
\newcommand{\captionb}{{\em (b)}}
\newcommand{\captionc}{{\em (c)}}
\newcommand{\captiond}{{\em (d)}}
\newcommand{\newterm}[1]{{\bf #1}}
\def\figref#1{figure~\ref{#1}}
\def\appref#1{Appendix~\ref{#1}}
\def\Figref#1{Figure~\ref{#1}}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
\def\secref#1{Section~\ref{#1}}
\def\Secref#1{Section~\ref{#1}}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
\def\eqref#1{Equation~\ref{#1}}
\def\Eqref#1{Equation~\ref{#1}}
\def\plaineqref#1{\ref{#1}}
\def\chapref#1{chapter~\ref{#1}}
\def\Chapref#1{Chapter~\ref{#1}}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
\def\algref#1{Algorithm~\ref{#1}}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
\def\partref#1{part~\ref{#1}}
\def\Partref#1{Part~\ref{#1}}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\1{\bm{1}}
\newcommand{\train}{\mathcal{D}}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
\def\eps{{\epsilon}}
\def\reta{{\textnormal{$\eta$}}}
\def\ra{{\textnormal{a}}}
\def\rb{{\textnormal{b}}}
\def\rc{{\textnormal{c}}}
\def\rd{{\textnormal{d}}}
\def\re{{\textnormal{e}}}
\def\rf{{\textnormal{f}}}
\def\rg{{\textnormal{g}}}
\def\rh{{\textnormal{h}}}
\def\ri{{\textnormal{i}}}
\def\rj{{\textnormal{j}}}
\def\rk{{\textnormal{k}}}
\def\rl{{\textnormal{l}}}
\def\rn{{\textnormal{n}}}
\def\ro{{\textnormal{o}}}
\def\rp{{\textnormal{p}}}
\def\rq{{\textnormal{q}}}
\def\rr{{\textnormal{r}}}
\def\rs{{\textnormal{s}}}
\def\rt{{\textnormal{t}}}
\def\ru{{\textnormal{u}}}
\def\rv{{\textnormal{v}}}
\def\rw{{\textnormal{w}}}
\def\rx{{\textnormal{x}}}
\def\ry{{\textnormal{y}}}
\def\rz{{\textnormal{z}}}
\def\rvepsilon{{\mathbf{\epsilon}}}
\def\rvtheta{{\mathbf{\theta}}}
\def\rva{{\mathbf{a}}}
\def\rvb{{\mathbf{b}}}
\def\rvc{{\mathbf{c}}}
\def\rvd{{\mathbf{d}}}
\def\rve{{\mathbf{e}}}
\def\rvf{{\mathbf{f}}}
\def\rvg{{\mathbf{g}}}
\def\rvh{{\mathbf{h}}}
\def\rvu{{\mathbf{i}}}
\def\rvj{{\mathbf{j}}}
\def\rvk{{\mathbf{k}}}
\def\rvl{{\mathbf{l}}}
\def\rvm{{\mathbf{m}}}
\def\rvn{{\mathbf{n}}}
\def\rvo{{\mathbf{o}}}
\def\rvp{{\mathbf{p}}}
\def\rvq{{\mathbf{q}}}
\def\rvr{{\mathbf{r}}}
\def\rvs{{\mathbf{s}}}
\def\rvt{{\mathbf{t}}}
\def\rvu{{\mathbf{u}}}
\def\rvv{{\mathbf{v}}}
\def\rvw{{\mathbf{w}}}
\def\rvx{{\mathbf{x}}}
\def\rvy{{\mathbf{y}}}
\def\rvz{{\mathbf{z}}}
\def\erva{{\textnormal{a}}}
\def\ervb{{\textnormal{b}}}
\def\ervc{{\textnormal{c}}}
\def\ervd{{\textnormal{d}}}
\def\erve{{\textnormal{e}}}
\def\ervf{{\textnormal{f}}}
\def\ervg{{\textnormal{g}}}
\def\ervh{{\textnormal{h}}}
\def\ervi{{\textnormal{i}}}
\def\ervj{{\textnormal{j}}}
\def\ervk{{\textnormal{k}}}
\def\ervl{{\textnormal{l}}}
\def\ervm{{\textnormal{m}}}
\def\ervn{{\textnormal{n}}}
\def\ervo{{\textnormal{o}}}
\def\ervp{{\textnormal{p}}}
\def\ervq{{\textnormal{q}}}
\def\ervr{{\textnormal{r}}}
\def\ervs{{\textnormal{s}}}
\def\ervt{{\textnormal{t}}}
\def\ervu{{\textnormal{u}}}
\def\ervv{{\textnormal{v}}}
\def\ervw{{\textnormal{w}}}
\def\ervx{{\textnormal{x}}}
\def\ervy{{\textnormal{y}}}
\def\ervz{{\textnormal{z}}}
\def\rmA{{\mathbf{A}}}
\def\rmB{{\mathbf{B}}}
\def\rmC{{\mathbf{C}}}
\def\rmD{{\mathbf{D}}}
\def\rmE{{\mathbf{E}}}
\def\rmF{{\mathbf{F}}}
\def\rmG{{\mathbf{G}}}
\def\rmH{{\mathbf{H}}}
\def\rmI{{\mathbf{I}}}
\def\rmJ{{\mathbf{J}}}
\def\rmK{{\mathbf{K}}}
\def\rmL{{\mathbf{L}}}
\def\rmM{{\mathbf{M}}}
\def\rmN{{\mathbf{N}}}
\def\rmO{{\mathbf{O}}}
\def\rmP{{\mathbf{P}}}
\def\rmQ{{\mathbf{Q}}}
\def\rmR{{\mathbf{R}}}
\def\rmS{{\mathbf{S}}}
\def\rmT{{\mathbf{T}}}
\def\rmU{{\mathbf{U}}}
\def\rmV{{\mathbf{V}}}
\def\rmW{{\mathbf{W}}}
\def\rmX{{\mathbf{X}}}
\def\rmY{{\mathbf{Y}}}
\def\rmZ{{\mathbf{Z}}}
\def\ermA{{\textnormal{A}}}
\def\ermB{{\textnormal{B}}}
\def\ermC{{\textnormal{C}}}
\def\ermD{{\textnormal{D}}}
\def\ermE{{\textnormal{E}}}
\def\ermF{{\textnormal{F}}}
\def\ermG{{\textnormal{G}}}
\def\ermH{{\textnormal{H}}}
\def\ermI{{\textnormal{I}}}
\def\ermJ{{\textnormal{J}}}
\def\ermK{{\textnormal{K}}}
\def\ermL{{\textnormal{L}}}
\def\ermM{{\textnormal{M}}}
\def\ermN{{\textnormal{N}}}
\def\ermO{{\textnormal{O}}}
\def\ermP{{\textnormal{P}}}
\def\ermQ{{\textnormal{Q}}}
\def\ermR{{\textnormal{R}}}
\def\ermS{{\textnormal{S}}}
\def\ermT{{\textnormal{T}}}
\def\ermU{{\textnormal{U}}}
\def\ermV{{\textnormal{V}}}
\def\ermW{{\textnormal{W}}}
\def\ermX{{\textnormal{X}}}
\def\ermY{{\textnormal{Y}}}
\def\ermZ{{\textnormal{Z}}}
\def\vzero{{\bm{0}}}
\def\vone{{\bm{1}}}
\def\vmu{{\bm{\mu}}}
\def\vtheta{{\bm{\theta}}}
\def\va{{\bm{a}}}
\def\vb{{\bm{b}}}
\def\vc{{\bm{c}}}
\def\vd{{\bm{d}}}
\def\ve{{\bm{e}}}
\def\vf{{\bm{f}}}
\def\vg{{\bm{g}}}
\def\vh{{\bm{h}}}
\def\vi{{\bm{i}}}
\def\vj{{\bm{j}}}
\def\vk{{\bm{k}}}
\def\vl{{\bm{l}}}
\def\vm{{\bm{m}}}
\def\vn{{\bm{n}}}
\def\vo{{\bm{o}}}
\def\vp{{\bm{p}}}
\def\vq{{\bm{q}}}
\def\vr{{\bm{r}}}
\def\vs{{\bm{s}}}
\def\vt{{\bm{t}}}
\def\vu{{\bm{u}}}
\def\vv{{\bm{v}}}
\def\vw{{\bm{w}}}
\def\vx{{\bm{x}}}
\def\vy{{\bm{y}}}
\def\vz{{\bm{z}}}
\def\evalpha{{\alpha}}
\def\evbeta{{\beta}}
\def\evepsilon{{\epsilon}}
\def\evlambda{{\lambda}}
\def\evomega{{\omega}}
\def\evmu{{\mu}}
\def\evpsi{{\psi}}
\def\evsigma{{\sigma}}
\def\evtheta{{\theta}}
\def\eva{{a}}
\def\evb{{b}}
\def\evc{{c}}
\def\evd{{d}}
\def\eve{{e}}
\def\evf{{f}}
\def\evg{{g}}
\def\evh{{h}}
\def\evi{{i}}
\def\evj{{j}}
\def\evk{{k}}
\def\evl{{l}}
\def\evm{{m}}
\def\evn{{n}}
\def\evo{{o}}
\def\evp{{p}}
\def\evq{{q}}
\def\evr{{r}}
\def\evs{{s}}
\def\evt{{t}}
\def\evu{{u}}
\def\evv{{v}}
\def\evw{{w}}
\def\evx{{x}}
\def\evy{{y}}
\def\evz{{z}}
\def\mA{{\bm{A}}}
\def\mB{{\bm{B}}}
\def\mC{{\bm{C}}}
\def\mD{{\bm{D}}}
\def\mE{{\bm{E}}}
\def\mF{{\bm{F}}}
\def\mG{{\bm{G}}}
\def\mH{{\bm{H}}}
\def\mI{{\bm{I}}}
\def\mJ{{\bm{J}}}
\def\mK{{\bm{K}}}
\def\mL{{\bm{L}}}
\def\mM{{\bm{M}}}
\def\mN{{\bm{N}}}
\def\mO{{\bm{O}}}
\def\mP{{\bm{P}}}
\def\mQ{{\bm{Q}}}
\def\mR{{\bm{R}}}
\def\mS{{\bm{S}}}
\def\mT{{\bm{T}}}
\def\mU{{\bm{U}}}
\def\mV{{\bm{V}}}
\def\mW{{\bm{W}}}
\def\mX{{\bm{X}}}
\def\mY{{\bm{Y}}}
\def\mZ{{\bm{Z}}}
\def\mBeta{{\bm{\beta}}}
\def\mPhi{{\bm{\Phi}}}
\def\mLambda{{\bm{\Lambda}}}
\def\mSigma{{\bm{\Sigma}}}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
\def\tA{{\tens{A}}}
\def\tB{{\tens{B}}}
\def\tC{{\tens{C}}}
\def\tD{{\tens{D}}}
\def\tE{{\tens{E}}}
\def\tF{{\tens{F}}}
\def\tG{{\tens{G}}}
\def\tH{{\tens{H}}}
\def\tI{{\tens{I}}}
\def\tJ{{\tens{J}}}
\def\tK{{\tens{K}}}
\def\tL{{\tens{L}}}
\def\tM{{\tens{M}}}
\def\tN{{\tens{N}}}
\def\tO{{\tens{O}}}
\def\tP{{\tens{P}}}
\def\tQ{{\tens{Q}}}
\def\tR{{\tens{R}}}
\def\tS{{\tens{S}}}
\def\tT{{\tens{T}}}
\def\tU{{\tens{U}}}
\def\tV{{\tens{V}}}
\def\tW{{\tens{W}}}
\def\tX{{\tens{X}}}
\def\tY{{\tens{Y}}}
\def\tZ{{\tens{Z}}}
\def\gA{{\mathcal{A}}}
\def\gB{{\mathcal{B}}}
\def\gC{{\mathcal{C}}}
\def\gD{{\mathcal{D}}}
\def\gE{{\mathcal{E}}}
\def\gF{{\mathcal{F}}}
\def\gG{{\mathcal{G}}}
\def\gH{{\mathcal{H}}}
\def\gI{{\mathcal{I}}}
\def\gJ{{\mathcal{J}}}
\def\gK{{\mathcal{K}}}
\def\gL{{\mathcal{L}}}
\def\gM{{\mathcal{M}}}
\def\gN{{\mathcal{N}}}
\def\gO{{\mathcal{O}}}
\def\gP{{\mathcal{P}}}
\def\gQ{{\mathcal{Q}}}
\def\gR{{\mathcal{R}}}
\def\gS{{\mathcal{S}}}
\def\gT{{\mathcal{T}}}
\def\gU{{\mathcal{U}}}
\def\gV{{\mathcal{V}}}
\def\gW{{\mathcal{W}}}
\def\gX{{\mathcal{X}}}
\def\gY{{\mathcal{Y}}}
\def\gZ{{\mathcal{Z}}}
\def\sA{{\mathbb{A}}}
\def\sB{{\mathbb{B}}}
\def\sC{{\mathbb{C}}}
\def\sD{{\mathbb{D}}}
\def\sF{{\mathbb{F}}}
\def\sG{{\mathbb{G}}}
\def\sH{{\mathbb{H}}}
\def\sI{{\mathbb{I}}}
\def\sJ{{\mathbb{J}}}
\def\sK{{\mathbb{K}}}
\def\sL{{\mathbb{L}}}
\def\sM{{\mathbb{M}}}
\def\sN{{\mathbb{N}}}
\def\sO{{\mathbb{O}}}
\def\sP{{\mathbb{P}}}
\def\sQ{{\mathbb{Q}}}
\def\sR{{\mathbb{R}}}
\def\sS{{\mathbb{S}}}
\def\sT{{\mathbb{T}}}
\def\sU{{\mathbb{U}}}
\def\sV{{\mathbb{V}}}
\def\sW{{\mathbb{W}}}
\def\sX{{\mathbb{X}}}
\def\sY{{\mathbb{Y}}}
\def\sZ{{\mathbb{Z}}}
\def\emLambda{{\Lambda}}
\def\emA{{A}}
\def\emB{{B}}
\def\emC{{C}}
\def\emD{{D}}
\def\emE{{E}}
\def\emF{{F}}
\def\emG{{G}}
\def\emH{{H}}
\def\emI{{I}}
\def\emJ{{J}}
\def\emK{{K}}
\def\emL{{L}}
\def\emM{{M}}
\def\emN{{N}}
\def\emO{{O}}
\def\emP{{P}}
\def\emQ{{Q}}
\def\emR{{R}}
\def\emS{{S}}
\def\emT{{T}}
\def\emU{{U}}
\def\emV{{V}}
\def\emW{{W}}
\def\emX{{X}}
\def\emY{{Y}}
\def\emZ{{Z}}
\def\emSigma{{\Sigma}}
\newcommand{\etens}[1]{\mathsfit{#1}}
\def\etLambda{{\etens{\Lambda}}}
\def\etA{{\etens{A}}}
\def\etB{{\etens{B}}}
\def\etC{{\etens{C}}}
\def\etD{{\etens{D}}}
\def\etE{{\etens{E}}}
\def\etF{{\etens{F}}}
\def\etG{{\etens{G}}}
\def\etH{{\etens{H}}}
\def\etI{{\etens{I}}}
\def\etJ{{\etens{J}}}
\def\etK{{\etens{K}}}
\def\etL{{\etens{L}}}
\def\etM{{\etens{M}}}
\def\etN{{\etens{N}}}
\def\etO{{\etens{O}}}
\def\etP{{\etens{P}}}
\def\etQ{{\etens{Q}}}
\def\etR{{\etens{R}}}
\def\etS{{\etens{S}}}
\def\etT{{\etens{T}}}
\def\etU{{\etens{U}}}
\def\etV{{\etens{V}}}
\def\etW{{\etens{W}}}
\def\etX{{\etens{X}}}
\def\etY{{\etens{Y}}}
\def\etZ{{\etens{Z}}}
\newcommand{\pdata}{p_{\rm{data}}}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
\newcommand{\pmodel}{p_{\rm{model}}}
\newcommand{\Pmodel}{P_{\rm{model}}}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
\newcommand{\pencode}{p_{\rm{encoder}}}
\newcommand{\pdecode}{p_{\rm{decoder}}}
\newcommand{\precons}{p_{\rm{reconstruct}}}
\newcommand{\laplace}{\mathrm{Laplace}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Ls}{\mathcal{L}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\emp}{\tilde{p}}
\newcommand{\lr}{\alpha}
\newcommand{\reg}{\lambda}
\newcommand{\rect}{\mathrm{rectifier}}
\newcommand{\softmax}{\mathrm{softmax}}
\newcommand{\sigmoid}{\sigma}
\newcommand{\softplus}{\zeta}
\newcommand{\KL}{D_{\mathrm{KL}}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\standarderror}{\mathrm{SE}}
\newcommand{\Cov}{\mathrm{Cov}}
\newcommand{\normlzero}{L^0}
\newcommand{\normlone}{L^1}
\newcommand{\normltwo}{L^2}
\newcommand{\normlp}{L^p}
\newcommand{\normmax}{L^\infty}
\newcommand{\parents}{Pa}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\Tr}{Tr}
\let\ab\allowbreak
\def \blue #1{\textcolor{blue}{#1}}
\def\Nn{N_{*}}
\def\baralp{\bar{\alpha}}
\def\dd {\mathrm{d}}
\def\epsb{{\bm{\epsilon}}}
\renewcommand\footnotemark{}
\def\UrlAlphabet{%
      \do\a\do\b\do\c\do\d\do\e\do\f\do\g\do\h\do\i\do\j%
      \do\k\do\l\do\m\do\n\do\o\do\p\do\q\do\r\do\s\do\t%
      \do\u\do\v\do\w\do\x\do\y\do\z\do\A\do\B\do\C\do\D%
      \do\E\do\F\do\G\do\H\do\I\do\J\do\K\do\L\do\M\do\N%
      \do\O\do\P\do\Q\do\R\do\S\do\T\do\U\do\V\do\W\do\X%
      \do\Y\do\Z}
\def\UrlDigits{\do\1\do\2\do\3\do\4\do\5\do\6\do\7\do\8\do\9\do\0}
\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}
\newcommand{\crps}{$\text{CRPS}_{\text{sum}}$ }
\newcommand{\crpsnospace}{$\text{CRPS}_{\text{sum}}$}
\newcommand{\mgtsd}{\textbf{MG-TSD} }
\newcommand{\mgtsdnospace}{\textbf{MG-TSD}}
\maketitle

`\noindent `{=latex}University of British Columbia$^{1}$, Peking University$^{2}$, Nanjing University$^{3}$, Microsoft Research$^{4}$

`\noindent `{=latex}`xinyao.fan@stat.ubc.ca, wuyueying@stu.pku.edu.cn, huangyh@smail.nju.edu.cn,`\
`{chanx, weiqing.liu, jiang.bian}@microsoft.com `

# Introduction

Time series prediction is a critical task with applications in various domains such as finance forecasting [@hou2021stock; @chen2018incorporating], energy planning [@koprinska2018convolutional; @wu2021autoformer], climate modeling [@wu2023interpretable; @wu2021autoformer], and biological sciences [@luo2020hitanet; @rajpurkar2022ai]. Considering that time series forecasting problems can be effectively addressed as a conditional generation task, many works leverage generative models for predictive purposes. For instance, @salinas2019high utilizes a low-rank plus diagonal covariance Gaussian copula; @rasul2021autoregressive models the predictive distribution using normalizing flows. Recent advancements in diffusion probabilistic models [@ho2020denoising] have sparked interest in utilizing them into probabilistic time series prediction. For example, @rasul2020multivariate auto-regressively generates data through iterative denoising diffusion models. @tashiro2021csdi uses a conditional score-based diffusion model explicitly trained for probabilistic time series imputation and prediction. These methods relying on diffusion models have exhibited remarkable predictive capabilities. However, there is still considerable scope for improvement. One challenge that diffusion models face in time series forecasting tasks is the instability due to their stochastic nature when compared to deterministic models like RNNs and variants like LSTMs [@hochreiter1997lstm; @laiModelingLongShortTerm2018], GRUs [@ballakur2020Empirical; @yamak2019comparison], and Transformers that rely on self-attention mechanisms [@Vaswani2017attention; @zhouInformerEfficientTransformer2021; @zhou2022fedformer; @wu2021autoformer]. More specifically, the diffusion models yield diverse samples from the conditional distributions, including possible low-fidelity samples from the low-density regions within the data manifold [@sehwag2022generating]. In the context of time series forecasting, where fixed observations exclusively serve as objectives, such variability would result in forecasting instability and inferior prediction performance.

\begin{wrapfigure}{r}{0.45\textwidth}

\includegraphics[width=0.45\textwidth]{fig_MG-TSD.pdf}
\caption{The process of smoothing data from finest-grained to coarsest-grained naturally aligns with the diffusion process.}
\label{fig-concept}
\end{wrapfigure}

To stabilize the output of a diffusion model in time series prediction, one straightforward method is to constrain the intermediate states during the sampling process. Prior research in the realm of diffusion models has introduced the idea of classifier-guidance [@nichol2021glide] and classifier-free guidance [@ho2022classifier], where the predicted posterior mean is shifted with the gradient of either explicit or implicit classifier. However, these methods require labels as the source of guidance while sampling, which are unavailable during out-of-sample inference. We observe that the forward process of the diffusion model, which sequentially corrupts the data distribution to a standard normal distribution, intuitively aligns with the process of smoothing fine-grained data into a coarser-grained representation, both of which result in a gradual loss of finer distribution features. This provides the insights that intrinsic features within data granularities may also serve as a source of guidance.

In this paper, we propose a novel **M**ulti-**G**ranularity **T**ime **S**eries **D**iffusion (`\mgtsdnospace`{=latex}) model that leverages multiple granularity levels within data to guide the learning process of diffusion models. The coarse-grained data at different granularity levels are utilized as targets to guide the learning of the denoising process. These targets serve as constraints for the intermediate latent states, ensuring a regularized sampling path that preserves the trends and patterns within the coarse-grained data. They introduce inductive bias which promotes the generation of coarser features during intermediate steps and facilitates the recovery of finer features in subsequent diffusion steps. Consequently, this design reduces variability and results in high-quality predictions. Our key contributions can be summarized as below:

1.  We introduce a novel `\mgtsd `{=latex}model with an innovatively designed multi-granularity guidance loss function that efficiently guides the diffusion learning process, resulting in reliable sampling paths and more precise forecasting results.

2.  We provide a concise implementation that leverages coarse-grained data instances at various granularity levels. Furthermore, we explore the optimal configuration for different granularity levels and propose a practical rule of thumb.

3.  Extensive experiments conducted on real-world datasets demonstrate the superiority of the proposed model, achieving the best performance compared to the state-of-the-art methods.

# Background

## Denoising diffusion probabilistic models {#sec-ddpm}

Suppose $\vx_{0}\sim q_{\mathcal{X}}(\vx_{0})$ is a multivariate vector from space $\mathcal{X}=\mathbb{R}^{D}$. Denoising diffusion probabilistic models aim to learn a model distribution $p_\theta(\vx_0)$ that approximates the data distribution $q(\vx_0)$. Briefly, they are latent variable models of the form $p_{\theta}(\vx_{0})=\int p_{\theta}(\vx_{0:N})\dd \vx_{1:N}$, where $\vx_n$ for $n = 1,\ldots,N$ is a sequence of latent variables in the same sample space as $\vx_0$. The denoising diffusion models are composed of two processes: the forward process and the reverse process. During the forward process, a small amount of Gaussian noise is added gradually in $N$ steps to samples. It is characterized by the following Markov chain: ${q(\vx_{1:N}|\vx_0) = \prod^N_{n=1} q(\vx_n | \vx_{n-1})}, \ \text{where} \ q (\vx_{n}|\vx_{n-1}):= \mathcal{N} (\sqrt{1-\beta_n}\vx_{n-1}, \beta_n \mI)$. The step sizes are controlled by a variance schedule $\{\beta_{n}\in (0,1)\}_{n=1}^{N}$, where $n$ represents a diffusion step. A nice property of the above process is that one can sample at any arbitrary diffusion step in a closed form, let $\alpha_n := 1 - \beta_n$ and $\baralp_n =\prod^n_{i=1} \alpha_i$. It has been shown that $\vx_{n}=\sqrt{\baralp_n}\vx_{0}+ \sqrt{1-\baralp_{n}}\epsb$. The reverse diffusion process is to recreate the real samples from a Gaussian noise input. It is defined as a Markov chain with learned Gaussian transitions starting with $p(\vx_N)=\mathcal{N}(\vx_N;\textbf{0},\mI)$. The reverse process is characterized as $p_\theta(\vx_{0:N}) := p(\vx_{N})\prod^1_{n=N} p_\theta(\vx_{n-1} {|} \vx_{n}),\ \text{where}\ p_\theta(\vx_{n-1} {|} \vx_{n}) := \mathcal{N}(\vx_{n-1}; \mu_\theta(\vx_{n}, n), \Sigma_\theta(\vx_n, n)\mI)$; $\mu_{\theta}:\R^{D}\times \mathbb{N} \to \R^{D}$ and $\Sigma_{\theta}: \R^{D}\times \mathbb{N} \to \R^{+}$ take the variable $\vx_{n}\in \R^{D}$ and the diffusion step $n\in \mathbb{N}$ as inputs, and share the parameters $\theta$. The parameters in the model are optimized to minimize the negative log-likelihood $\min_{\theta} \E_{\vx_{0}\sim q(\vx_0)}[-\log p_{\theta}(\vx_{0})]$ via a variational bound. According to denoising diffusion probabilistic models (DDPM) in @ho2020denoising, the parameterization of $p_\theta(\vx_{n-1}| \vx_n)$ is chosen as: $$\begin{equation}
\label{eq-mu}
\mu_\theta(\vx_n, n) = {\frac{1}{\sqrt{\alpha_n}}} \bigg( \vx_n- \frac{1-\alpha_n}{\sqrt{1-\baralp_n}} \epsb_\theta(  \sqrt{\baralp_n}\vx_{0}+\sqrt{1-\baralp_{n}}\epsb,n)\bigg),
\end{equation}$$ where $\epsb_{\theta}$ is a network which predicts $\epsb\sim \mathcal{N}(\textbf{0},\textbf{I})$ from $\vx_{n}$. We simplify the objective function into $$\begin{equation}
\label{eq-diffusion-loss}
L_{n}^{\text{simple}}=
\E_{n,\epsb_n,\vx_0}\bigg[\|\epsb_{n}-\epsb_{\theta}(\sqrt{\baralp_n}\vx_0+\sqrt{1-\baralp_n}\epsb_n,n)\|^2\bigg].
\end{equation}$$ Once trained, we can iteratively sample from the reverse process $p_{\theta}(\vx_{n-1}|\vx_{n})$ to reconstruct $\vx_0$.

## TimeGrad Model {#sec-timegrad}

We treat the time series forecasting task as a conditional generation task and utilize the diffusion models presented in `\secref{sec-ddpm}`{=latex} as the backbone generative model. TimeGrad model is a related work by @rasul2020multivariate which first explored the use of diffusion models for forecasting multivariate time series. Consider a contiguous time series sampled from the complete history training data, indexed from 1 to $T$. This time series is partitioned into a context window of interval $[1,t_0)$ and a prediction interval $[t_0,T]$. TimeGrad utilizes diffusion models from @ho2020denoising to learn the conditional distribution of the future timesteps of the multivariate time series given their past. An RNN is employed to capture the temporal dependencies, and the time series sequence up to timestep $t$ is encoded in the updated hidden state $\rvh_{t}$. Mathematically, TimeGrad models $q_{\mathcal{X}}(\vx_{t_0:T}|\vx_{1:t_0-1})=\prod_{t=t_0}^{T}q_{\mathcal{X}}(\vx_t|\vx_{1:t-1})\approx \prod_{t=t_0}^{T}q_{\mathcal{X}}(\vx_{t}|\rvh_{t-1})$, where $\vx_{t}\in\R^{D}$ denotes the time series at timestep $t$ and $\rvh_{t}=\text{RNN}_{\psi}(\vx_t,\rvh_{t-1})$. Each factor is learned via a shared conditional denoising diffusion model. In contrast to @ho2020denoising, the hidden states $\rvh_{t-1}$ are taken as an additional input in the denoising network $\epsb_{\theta}(\vx_t^{n},\rvh_{t-1},n)$, and the loss function for timestep $t$ and diffusion step $n$ is given by: $$\begin{equation}
\E_{\epsb,\vx_{0,t},n}[\|\epsb-\epsb_{\theta}(\sqrt{\baralp_n}\vx_{0,t}+\sqrt{1-\baralp_n}\epsb,n,\rvh_{t-1})\|^2],
\end{equation}$$ where the first subscript in $\vx_{0,t}$ represents the index of the diffusion step, while $t$ denotes the timestep within the time series.

## Problem formulation

In the time series prediction task, let $\mX^{(1)}$ represent the original observed data. The time series data is denoted as $\mX^{(1)} = [\vx_{1}^1,\ldots,\vx_{t}^1,\ldots, \vx_{T}^1]$, where $t$ represents the timestep $t\in [1,T]$ and $\vx_{t}\in \R^{D}$. Specifically, our task is to model the conditional distribution of future timesteps of the time series $[\vx_{t_0}^1,\ldots,\vx_{T}^1]$ given the fixed window of history context. Mathematically, the problem we consider can be formulated as follows: $$\begin{equation}
q_{\mathcal{X}}\bigg(\vx_{t_0:T}^1|\big\{\vx_{1:t_0-1}^1\big\}\bigg)=\prod_{t=t_0}^{T}q_{\mathcal{X}}\bigg(\vx_{t}^1|\big\{\vx_{1:t-1}^1\big\}\bigg).
\end{equation}$$

# Method

In this section, we provide an overview of the `\mgtsd `{=latex}model architecture in Section `\ref{sec-overview}`{=latex}, followed by a detailed discussion of the novel guided diffusion process module in Section `\ref{sec-guide-module}`{=latex}, including the derivation of the heuristic loss function and its implementation across various granularity levels.

## MG-TSD model architecture {#sec-overview}

The proposed methodology consists of three key modules, as depicted in Figure `\ref{fig-archi}`{=latex}.

**Multi-granularity Data Generator** is responsible for generating multi-granularity data from observations. In this module, various coarse-grained time series are obtained by smoothing out the fine-grained data using historical sliding windows with different sizes. Suppose $f$ is a pre-defined smoothing (for example, average) function, and $s^g$ is the pre-defined sliding window size for granularity level $g$. Then $\mX^{(g)}=f(\mX^{(1)},s^g)$. The sliding windows are non-overlapping and the obtained coarse-grained data for granularity $g$ are replicated $s^g$ times to align over the timeline $[1,T]$.

**Temporal Process Module** is designed to capture the temporal dynamics of the multi-granularity time series data. We utilize RNN architecture on each granularity level $g$ separately to encode the time series sequence up to a specific timestep $t$ and the encoded hidden states are denoted as $\rvh_t^{g}$. The RNN cell type is implemented as GRU in @chung2014empirical.

**Guided Diffusion Process Module** is designed to generate stable time series predictions at each timestep $t$. We utilize multi-granularity data as given targets to guide the diffusion learning process. A detailed discussion of the module can be found in `\secref{sec-guide-module}`{=latex}.

<figure id="fig-archi" data-latex-placement="!htbp">
<img src="fig_model.png" style="width:98.0%" />
<figcaption>Overview of the Multi-Granularity Time Series Diffusion () model, consisting of three key modules: <strong>Multi-granularity Data Generator</strong>, <strong>Temporal Process Module (<span>TPM</span>)</strong>, and <strong>Guided Diffusion Process Module</strong> for time series forecasting at a specific granularity level.</figcaption>
</figure>

## Multi-granularity Guided Diffusion {#sec-guide-module}

In this section, we delve into the details of the **Guided Diffusion Process Module**, a key component in our model. `\secref{sec-2-gran-guidance}`{=latex} presents the derivation of a heuristic guidance loss for the two-granularity case. In `\secref{sec-multi-gran-loss}`{=latex}, we generalize the loss to the multi-granularity case and provide a concise implementation to effectively utilize coarse-grained data across various granularity levels. Briefly, the optimization of the heuristic loss function can be simply achieved by training denoising diffusion models on the multi-granularity data with shared denoising network parameters and partially shared variance schedule.

### Coarse-grained guidance {#sec-2-gran-guidance}

Without loss of generality, consider two granularities: finest-grained data $\vx_{t}^{g_1}$ ($g_1=1$) from $\mX^{(g_1)}$ and coarse-grained data $\vx_{t}^{g}$ from $\mX^{(g)}$ at a fixed timestep $t$, where $1<t<T$. We omit the subscript $t$ in the derivation for notation brevity. Suppose the denoising diffusion models presented in `\secref{sec-ddpm}`{=latex} are employed to approximate the distribution $q(\vx^{g_1})$ and let the variance schedule be $\{\beta_{n}^{1}=1-\alpha_{n}^{1}\in (0,1)\}_{n=1}^{N}$. Suppose $\vx_0^{g_1}\sim q(\vx_0^{g_1})$, where the subscript $0$ denotes the index of diffusion step. The diffusion models in `\secref{sec-timegrad}`{=latex} define a forward trajectory $q(\vx^{g_1}_{0:N})$ and a $\theta$-parameterized reverse trajectory $p_{\theta}(\vx^{g_1}_{0:N})$.

While `\secref{sec-timegrad}`{=latex} focuses on predicting samples over a specific timestep, it does not account for the intrinsic structure of time series, such as trends, which are represented by coarse-grained time series. In this paper, we guide the generation of samples by ensuring that the intermediate latent space retains the underlying time series structure. This is achieved by introducing coarse-grained targets $\vx^{g}$ at intermediate diffusion step $\Nn^g \in [1,N-1]$. Specifically, we establish the objective function as the log-likelihood of observed coarse-grained data $\vx^g$ evaluated at the marginal distributions at diffusion step $\Nn^g$, which can be expressed as $\log p_{\theta}(\vx^{g})$. With an appropriate choice of diffusion step $\Nn^g$, the coarser features recovered from the denoising process could gain information from the realistic coarse-grained sample. Recall that the marginal distribution of latent variable at denoising step $\Nn^g$ determined by the $\theta$-parameterized trajectory $p_{\theta}(\vx_{\Nn^g:N})$ can be expressed as: $$\begin{equation}
\label{eq-ptheta}
    p_{\theta}(\vx_{\Nn^{g}})=\int p_{\theta}(\vx_{\Nn^g:N})\dd \vx_{(\Nn^g+1):N}=\int p(\vx_N)\prod^{N}_{\Nn^g+1}{p_\theta(\vx_{n-1} \vert \vx_n})\dd \vx_{(\Nn^g+1):N},
\end{equation}$$ where $\vx_N\sim \mathcal{N}(\bm{0},\mI)$, $p_\theta(\vx_{n-1} \vert \vx_n) = \mathcal{N}(\vx_{n-1}; \vmu_\theta(\vx_n, n), \boldsymbol{\Sigma}_\theta(\vx_n, n))$.

To make the objective tractable, a common technique involves optimizing a variational lower bound on the likelihood in `\eqref{eq-ptheta}`{=latex}. This can be achieved by specifying a latent variable sequence of length $N-\Nn^g$, such that the joint distribution of $\vx^{g}$ and these latent variables is available. Conveniently, we employ a diffusion process on $\vx^{g}$ with a total of $N-\Nn^g$ diffusion steps, defining a sequence of noisy samples $\vx^{g}_{\Nn^g+1}$, $\ldots$, $\vx^{g}_{N}$ as realizations of the latent variable sequence. Then, the guidance objective can be expressed as: $$\begin{equation}
\label{eq-objective}
\log p_{\theta}(\vx^{g}) = \log \int p_{\theta}(\vx^{g}_{\Nn^g},\vx^{g}_{\Nn^g+1},\ldots,\vx^{g}_{N})\dd \vx^{g}_{(\Nn^g+1):N}.
\end{equation}$$ Applying the same technique as in @ho2020denoising, the guidance objective function in `\eqref{eq-objective}`{=latex} simplifies the loss function of the diffusion models (see the `\appref{sec_derivation_of_loss_function}`{=latex} for proof details): $$\begin{equation}
\label{eq-loss-guidance} 
\E_{\epsb,\vx^{g},n }[\|\epsb-\epsb_{\theta}(\vx^{g}_n,n)\|^2],
\end{equation}$$ where $\vx^{g}_n=(\prod_{i=\Nn^g}^{n}\alpha_{i}^{1})\vx^{g}+\sqrt{1-\prod_{i=\Nn^g}^{n}\alpha_i^{1}}\epsb$ and $\epsb\sim \mathcal{N}(\textbf{0},\mI)$. When the variance schedule is chosen as $\{\alpha^1_n\}_{n=\Nn^g}^{N}$, the loss function of the diffusion model in @ho2020denoising is equivalent to the guidance loss function presented in `\eqref{eq-loss-guidance}`{=latex}.

### Multi-granularity Guidance {#sec-multi-gran-loss}

In general, for $G$ granularity levels, data of different granularities generated by **Multi-granularity Data Generator** can be represented as $\mX^{(1)},\mX^{(2)},\ldots,\mX^{(G)}$. We expect these coarse-grained data can guide the learning process of the diffusion model at different steps, serving as constraints along the sampling trajectory. For coarse-grained data at granularity level $g$, where $g\in \{2, \ldots, G\}$, we define the $\textbf{share ratio}$ as $r_g := 1 - (\Nn^g - 1) / N$. It represents the shared percentage of variance schedule between the $g$th granularity data and the finest-grained data. For the finest-grained data, $\Nn^1=1$ and $r^1=1$. Formally, the variance schedule for granularity $g$ is defined as $$\begin{equation}
\alpha_{n}^{g}(\Nn^g) = 
\begin{cases}
1 & \text{if} \ n = 1,\ldots,\Nn^g \\
\alpha_{n}^{1} & \text{if } n = \Nn^g+1,\ldots,N,
\end{cases}
\end{equation}$$ and $\{\beta_{n}^{g}\}_{n=1}^{N}=\{1-\alpha_{n}^{g}\}_{n=1}^{N}$. Accordingly, define $a_{n}^{g}(\Nn^g)  = \prod_{k=1}^{n} \alpha_{k}^{g}$, and $b_{n}^{g}(\Nn^g) = 1-a_{n}^{g}(\Nn^g)$. We suppose $\Nn^{1}<\Nn^{2}\ldots<\Nn^g<\ldots<\Nn^{G}$, which represents the diffusion index for starting sharing the variance schedule across granularity level $g\in\{1,\ldots,G\}$. The starting index $\Nn^g$ is larger for coarser granularity level, aligning with the intuition that the coarser-grained data loses fine distribution features to a greater extent and is expected to resemble the samples from earlier sampling steps.

Furthermore, we use the temporal hidden states for granularity level $g$ up to timestep $t$ from the **Temporal Process Module** as conditional inputs for the model to generate time series at corresponding granularity levels similar to @rasul2020multivariate. Then the guidance loss function $L^{(g)}(\theta)$ for $g$th-granularity $\vx_{n,t}^{g}$ at timestep $t$ and diffusion step $n$, can be expressed as: $$\begin{equation}
\label{eq-loss2}
L^{(g)}(\theta)=\E_{\epsb, \vx_{0,t}^{g}, n} \|(\epsb-\epsb_{\theta} (\sqrt{a_{n}^{g}} \vx_{0,t}^{g}+\sqrt{b_{n}^{g}}\epsb,n,\rvh^{g}_{t-1})\|^2_2,
\end{equation}$$ where $\rvh_{t}^{g} = \text{RNN}_{\theta}(\vx_{t}^{g}, \rvh_{t-1}^{g})$ is the updated hidden states from the last step.

The guidance loss function with $G-1$ granularity levels of data is $L^{\text{guidance}}=\sum_{g=2}^{G}\omega^gL^{(g)}(\theta)$, where $\omega^g\in [0,1]$ is a hyper-parameter controlling the scale of guidance from granularity $g$.

**Training.** The training algorithm is in Algorithm `\ref{alg-training}`{=latex}. The final training objective is the weighted summation of loss for all granularities, including the finest granularity: $$\begin{equation}
\label{eq-loss}
L^{\text{final}}=\omega^1{L}^{(1)}(\theta) +  L^{\text{guidance}}(\theta)=\sum_{g=1}^{G}\omega^g\E_{\epsb,\vx^{g}_{0,t},n }[\|\epsb-\epsb_{\theta}(\vx^{g}_{n,t},n,\rvh_{t-1}^{g})\|^2],
\end{equation}$$ where $\vx_{n,t}^g=\sqrt{a_{n}^{g}}\vx_{0,t}^{g}+\sqrt{b_{n}^{g}}\epsb$ and $\sum_{g=1}^{G}\omega^g=1$. The denoising network parameters are shared across all granularities during training.

\begin{algorithm}[ht]\caption{Training procedure}
\label{alg-training}
 \textbf{Input:} Context interval $[1, t_0)$; prediction interval $[t_0, T]$; number of diffusion step $N$; a set of share ratio for $g$ granularity (or equivalently $\{N_{*}^{g},g \in \{1,\ldots,G\}\}$); generated multi-granularity data $[\vx_{1}^{g}, \ldots, \vx_{t_0}^{g},\ldots,\vx_{T}^{g}], g \in \{1,\ldots,G\}$; initial hidden states $\rvh_{0}^{g}, g \in \{1,\ldots,G\}]$\\
 \textbf{repeat}
 \begin{algorithmic}[1]
 \State Sample the multi-granularity time series $[\vx_{1}^{g},\ldots,\vx_{T}^{g}], g\in\{1,\ldots,G\}$.
 \State Obtain $\rvh_{t}^{g} = \text{RNN}^{g}(\vx_{{t}}^{g},\rvh_{t-1}^{g}), g \in \{1,\ldots,G\}$, $t \in [1,\ldots, T]$.
 \For{$t = t_0$ to $T$}
 \State Initialize $n \sim \text{Uniform}(1,\ldots, N)$ and $\epsb \sim \mathcal{N}(\mathbf{0},\mI)$
 \State Reset the variance schedule $\{\beta_{n}^{g}=1-\alpha_{n}^{g}(N_{*}^{g})\}_{n=1}^{N}$, $g \in \{1,\ldots,G\}$.
 \State Compute loss $L^\text{final}$ according to \eqref{eq-loss}
 \State Take the gradient $\nabla_\theta L^\text{final}$
 \EndFor
\end{algorithmic}
\textbf{until} converged
\end{algorithm}

**Inference.** Once the model is trained, our goal is to make predictions on the finest-grained data, up to a certain number of future prediction steps. Suppose that the last context window ends at timestep $t_0-1$, we use Algorithm `\ref{alg-inference}`{=latex} to perform the sampling procedure and generate a sample $\vx_{t_0}^{1}$ for the next timestep. This process is repeated until reaching the desired forecast horizon. With different hidden states as conditional inputs, the model can sample time series at respective granularity levels.

\begin{algorithm}[ht]\caption{Inference procedure for each timestep $t\in [t_0,T]$}
\label{alg-inference}
\textbf{Input:} Noise $\vx_t^{N}\sim \mathcal{N}(\mathbf{0},\mI)$ and hidden states $\rvh^{g}_{t-1}$, $g\in \{1,\ldots,G\}$

\begin{algorithmic}[1]
\For{$n = N$ to $1$}
    \If{$n > 1$}
        \State Sample $\vz \sim \mathcal{N}(\mathbf{0}, \mI)$
    \Else
        \State $\vz = \mathbf{0}$
    \EndIf
    \For{$g = 1$ to $G$}
    \State $\vx_{n-1,t}^{g} = \frac{1}{\sqrt{\alpha_{n}^{g}}} (\vx_{n,t}^{g} - \frac{\beta_n^{g}} {\sqrt{1 - a_{n}^{g}}} \epsilon_{\theta}(\vx_{n,t}^{g}, n,\rvh_{t-1}^{g})) + \sqrt{\sigma_{n}^{g}} \vz$, where $\sigma_{n}^{g}=\frac{1-a_{n-1}^{g}}{1-a_{n}^{g}}\beta_{n}^{g}$.
    \EndFor
\EndFor
\end{algorithmic}
\textbf{Return}: $\vx_{0,t}^{g},g=1$(finest-grained data); (Optional: $\vx_{0,t}^{g}$, $g \in \{2,\ldots,G\}$)
\end{algorithm}

**Selection of share ratio.** We propose a heuristic approach to help select the appropriate share ratio $r^{g}$, which is derived from $N_{*}^{g}$. We determine the choice of $\Nn^{g}$ as the diffusion step at which the distance between two distributions $q(\vx^{g})$ and $p_{\theta}(\vx^{g_1}_{n})$ is minimum, as shown below: $$\begin{equation}
    \Nn^g:=\arg \min_n \mathcal{D}(q(\vx^{g}),p_{\theta}(\vx^{g_1}_n)),
\end{equation}$$ where $\mathcal{D}$ is a measure for accessing discrepancy between two distributions, such as KL divergence. In practice, we first pre-train a TimeGrad model and then compute the `\crps `{=latex}between the coarse-grained targets and the samples along the sampling path of finest-grained data during inference. The range of steps where the `\crps `{=latex}values can consistently maintain relatively small values suggests a proper range of share ratios.

# Experiments

In this section, we conduct extensive experiments on six real-world datasets to evaluate the performance of the proposed `\mgtsd `{=latex}model and compare it with previous state-of-the-art baselines.

## Settings

**Datasets.** We consider six real-word datasets characterized by a range of temporal dynamics, namely $\texttt{Solar}$, $\texttt{Electricity}$, $\texttt{Traffic}$, $\texttt{Taxi}$, $\texttt{KDD-cup}$ and $\texttt{Wikipedia}$. The data is recorded at intervals of 30 minutes, 1 hour, or 1 day frequencies. Refer to `\appref{sec_benchmark_datasets}`{=latex} for details.

**Evaluation Metrics.** We assess our model and all baselines using $\text{CRPS}_{\text{sum}}$ (Continuous Ranked Probability Score), a widely used metric for probabilistic time series forecasting, as well as $\text{NMAE}_{\text{sum}}$ (Normalized Mean Absolute Error) and $\text{NRMSE}_{\text{sum}}$ (Normalized Root Mean Squared Error). For detailed descriptions, refer to `\appref{sec_app_metrics}`{=latex}.

**Baselines.** We assess the predictive performance of the proposed `\mgtsd `{=latex}model in comparison with multivariate time series forecasting models, including Vec-LSTM-ind-scaling [@salinas2019high], GP-scaling [@salinas2019high], GP-Copula [@salinas2019high], Transformer-MAF [@rasul2020multivariate], LSTM-MAF [@rasul2020multivariate], TimeGrad [@rasul2021autoregressive], and TACTiS [@drouin2022tactis]. The MG-Input ensemble model serves as the baseline with multi-granularity inputs. It combines two TimeGrad models trained on one coarse-grained and finest-grained data respectively, and generates the final predictions by a weighted average of their outputs.

**Implementation details.** We train our model for 30 epochs using the Adam optimizer with a fixed learning rate of $10^{-5}$. We set the mini-batch size to 128 for solar and 32 for other datasets. The diffusion step is configured as 100. Additional hyper-parameters, such as share ratios, granularity levels, and loss weights, are detailed in `\appref{sec_app_hyper_parameter}`{=latex}. All models are trained and tested on a single NVIDIA A100 80GB GPU.

## Results

The `\crps `{=latex}values averaged over 10 independent runs are reported in Table `\ref{tab-compare}`{=latex}. The results show our model achieves the lowest `\crps `{=latex}and outperforms the baseline models across all six datasets. The MG-Input model exhibits marginal improvement on certain datasets when compared to the TimeGrad. This implies that while integrating multi-granularity information may result in some information gain, direct ensembling of coarse-grained outputs is inefficient in boosting performance.

## Ablation Study

**Share ratio of variance schedule.** To investigate the effect of share ratio, we evaluate the performance of `\mgtsd `{=latex}using various share ratios across different coarse granularities. The experiment is conducted in a two-granularity setting, where one coarse granularity is utilized to guide the learning process for the finest-grained data. Table `\ref{tab-ablation}`{=latex} shows that for each coarse granularity level, the $\text{CRPS}_{\text{sum}}$ values initially decrease to their lowest values and then ascend again as the share ratio gets larger. Furthermore, we observe for coarser granularities, the model performs better with a smaller share ratio. This suggests that the model achieves optimal performance when the share ratio is chosen at the step where the coarse-grained samples most closely resemble intermediate states. Utilizing 4-hour or 6-hour granularity as guidance greatly enhances the model performance. However, the improvement in performance diminishes as the granularity becomes coarser, such as 12 hours or 24 hours, possibly due to the greater loss of information on local fluctuations.

\resizebox{\textwidth}{!}{%
    \begin{tabular}{lllllllll}
    \toprule
    \bfseries Method & \bfseries \texttt{Solar}  &\bfseries \texttt{Electricity} &   \bfseries \texttt{Traffic}  &  \bfseries \texttt{KDD-cup} & \bfseries \texttt{Taxi} & \bfseries \texttt{Wikipedia}  \\ 
    \midrule
    Vec-LSTM ind-scaling & $0.4825_{\pm 0.0027}$ & $0.0949_{\pm 0.0175}$ & $0.0915_{\pm 0.0197}$  & $0.3560_{\pm 0.1667}$ &  $0.4794_{\pm 0.0343}$& $0.1254_{\pm 0.0174}$\\
    GP-Scaling& $0.3802_{\pm 0.0052}$ & $0.0499_{\pm 0.0031}$ & $0.0753_{\pm 0.0152}$ & $0.2983_{\pm 0.0448}$ & $0.2265_{\pm 0.0210}$ & $0.1351_{\pm 0.0612}$ & \\
    GP-Copula & $0.3612_{\pm 0.0035}$ & $0.0287_{\pm 0.0005}$ & $0.0618_{\pm 0.0018}$ & $0.3157_{\pm 0.0462}$ & $0.1894_{\pm 0.0087}$ & $0.0669_{\pm 0.0009}$  \\ 
    LSTM-MAF  & $0.3427_{\pm 0.0082}$ & $0.0312_{\pm 0.0046}$ & $0.0526_{\pm 0.0021}$ & $0.2919_{\pm 0.1486}$ & $0.2295_{\pm 0.0082}$ & $0.0763_{\pm 0.0051}$ \\
    Transformer-MAF  & $0.3532_{\pm 0.0053}$ & $0.0272_{\pm 0.0017}$ & $0.0499_{\pm 0.0011}$ & $0.2951_{\pm 0.0504}$ & $0.1531_{\pm 0.0038}$ &${0.0644}_{\pm 0.0037}$ \\
    TimeGrad & $0.3335_{\pm 0.0653}$ & $0.0232_{\pm 0.0035}$ & $0.0414_{\pm 0.0112}$ & $0.2902_{\pm 0.2178}$ & $0.1255_{\pm 0.0207}$ & $0.0555_{\pm 0.0088}$\\
  

    TACTiS & $0.4209_{\pm 0.0330}$ & $0.0259_{\pm 0.0019}$ & $0.1093_{\pm 0.0076}$ & $0.5406_{\pm 0.1584}$ & $0.2070_{\pm 0.0159}$& $-$ \\  
    MG-Input & $0.3239_{\pm 0.0427}$  &  $0.0238_{\pm 0.0035}$ & $0.0658_{\pm 0.0065}$ & $0.2977_{\pm 0.1163}$ &  $0.1592_{\pm 0.0087}$ & $0.0567_{\pm 0.0091}$\\
    MG-TSD & $\bm{0.3081_{\pm 0.0099}}$ & $\bm{{0.0149}_{\pm 0.0017}}$ & $\bm{0.0323_{\pm 0.0125}}$ & $\bm{{0.1837}_{\pm 0.0865}}$ & $\bm{0.1159_{\pm 0.0132}}$ & $\bm{0.0529}_{\pm 0.0054}$\\
    \bottomrule
    \end{tabular}%
}

::: center
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccccccccccccccc}
\toprule
\textbf{Ratio} & \textbf{4 hour} & \textbf{6 hour}\\
\cmidrule(r){2-4} \cmidrule(r){5-7}
& $\textbf{\crpsnospace}$ &$\textbf{NMAE}_{\textbf{sum}}$
 & $\textbf{NRMSE}_{\textbf{sum}}$ & $\textbf{\crpsnospace}$   & $\textbf{NMAE}_{\textbf{sum}}$ & $\textbf{NRMSE}_{\textbf{sum}}$ \\
\midrule
\textbf{20\%} & $0.3489_{\pm 0.0190}$  &  $0.3826_{\pm 0.0200}$    & $0.7177_{\pm 0.0445}$ & $0.3378_{\pm 0.0305}$ & $0.3703_{\pm 0.0368}$ & $0.6916_{\pm 0.0536}$ \\

\textbf{40\%} & $0.3405_{\pm 0.0415}$   & $0.3792_{\pm 0.0386}$   & $0.6870_{\pm 0.0870}$  & $0.3275_{\pm 0.0250}$ & $0.3608_{\pm 0.0267}$ & $0.6650_{\pm 0.0374}$\\

\textbf{60\%} & $0.3268_{\pm 0.0475}$  & $0.3604_{\pm 0.0463}$   & $0.6579_{\pm 0.0919}$ & $\bm{0.3166}_{\pm 0.0376}$&   $\bm{0.3491}_{\pm 0.0368}$ &   $\bm{0.6478}_{\pm 0.0696}$\\

\textbf{80\%} & $\bm{0.3172_{\pm 0.0249}}$ & $\bm{0.3510_{\pm 0.0240}}$ &  $\bm{0.6515_{\pm 0.051}}$ & $0.3221_{\pm 0.0425}$ & $0.3555_{\pm 0.0443}$ & $0.6542_{\pm 0.0747}$\\

\textbf{100\%} & $0.3178_{\pm 0.0342}$  & $0.3480_{\pm 0.0356}$  &  $0.6591_{\pm 0.0503}$ & $0.3232_{\pm 0.0396}$ & $0.3548_{\pm 0.0417}$ & $0.6550_{\pm 0.0660}$\\

\midrule

\textbf{Ratio} & \textbf{12 hour} & \textbf{24 hour}  \\
\cmidrule(r){2-4} \cmidrule(r){5-7}
& $\textbf{\crpsnospace}$ &$\textbf{NMAE}_{\textbf{sum}}$ & $\textbf{NRMSE}_{\textbf{sum}}$ & $\textbf{\crpsnospace}$   & $\textbf{NMAE}_{\textbf{sum}}$ & $\textbf{NRMSE}_{\textbf{sum}}$ \\

\midrule

\textbf{20\%}& $0.3440_{\pm 0.0391}$  &  $0.3767_{\pm 0.0450}$   & $0.6999_{\pm 0.0772}$  & $0.3315_{\pm 0.0266}$ &  $0.3693_{\pm 0.0298}$ &  $0.6801_{\pm 0.0554}$ \\

\textbf{40\%} &  $0.3374_{\pm 0.0370}$  &  $0.3713_{\pm 0.0346}$  & $0.6837_{\pm 0.0641}$ & $\bm{0.3276_{\pm 0.0358}}$ &    $\bm{0.3612_{\pm 0.0361}}$ &       $\bm{0.6722_{\pm 0.0552}}$ \\

\textbf{60\%}  & $\bm{0.3240}_{\pm 0.0382}$ & $\bm{0.3597}_{\pm 0.0388}$ & $\bm{0.6694}_{\pm 0.0746}$  & $0.3382_{\pm 0.0343}$ &    $0.3737_{\pm 0.0365}$ &      $0.6878_{\pm 0.0655}$ \\

\textbf{80\%} &$0.3391_{\pm 0.0390}$  &  $0.3719_{\pm 0.0403}$   & $0.6953_{\pm 0.0691}$ & $0.3288_{\pm 0.0460}$&    $0.3639_{\pm 0.0476}$  &  $0.6741_{\pm 0.0929}$ \\

\textbf{100\%} & $0.3284_{\pm 0.0323}$ & $0.3538_{\pm 0.0450}$ &   $0.6609_{\pm 0.0917}$ & $0.3407_{\pm 0.0248}$  &  $0.3692_{\pm 0.0244}$ &  $0.6933_{\pm 0.0528}$ & \\

\bottomrule
\end{tabular}
}
:::

In practice, the selection of share ratio can follow the heuristic rule outlined in Section `\ref{sec-multi-gran-loss}`{=latex}. Figure `\ref{fig-select-ratio}`{=latex} provides illustrative plots for the share ratio selection curve of different granularities. The blue curve in each plot represents `\crps `{=latex}values between coarse-grained targets and 1-hour samples come from 1-gran(finest-gran) model at each intermediate denoising step; each point on the orange polylines represents the `\crps `{=latex}value of 1-hour predictions by 2-gran MG-TSD models with different share ratios ranging from \[0.2, 0.4, 0.6, 0.8, 1.0\], and the lowest point of the line segment can be used to characterize the most suitable share ratio for the corresponding granularity.

The diffusion steps that can achieve relatively small `\crps `{=latex}values are colored in grey, suggesting a proper range for the share ratio at which the model can achieve satisfactory performance. From the plots, a strong correlation exists between the polyline of `\crps `{=latex}calculated during test time and the share ratio selection curve, which validates the effectiveness of the selection rule. In addition, as granularity transitions from fine to coarse (4h→6h→12h→24h), the diffusion steps at which the distribution most resembles the coarse-grained targets increase (approximately at steps 20→40→60→60). This comparison shows the similarity between the diffusion process and the smoothing process from the finest-grained to coarse-grained data, both of which involve a gradual loss of finer characteristics from the finest-grained data through a smooth and convex transformation.

<figure id="fig-select-ratio" data-latex-placement="htbp">

<figcaption>Selection of share ratio for models </figcaption>
</figure>

**The number of granularities.** We further explore the impact of the number of granularities on the `\mgtsd `{=latex}model. As presented in Table `\ref{tab-ablation-2}`{=latex}, increasing the number of granularity levels generally boosts the performance of the `\mgtsd `{=latex}model, which demonstrates that the introduction of multi-granularity information effectively guides the learning process of diffusion models. However, the marginal benefit diminishes with the increase in granularity amounts. The results suggest that utilizing four to five granularity levels generally suffices for achieving favorable performance.

\begin{small}
\begin{center}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lllllllll}
\toprule
\textbf{Num of gran} & \textbf{\texttt{Solar}} & \textbf{\texttt{Electricity}}\\
\cmidrule(r){2-4} \cmidrule(r){5-7}
& $\textbf{\crpsnospace}$ &$\textbf{NMAE}_{\textbf{sum}}$ & $\textbf{NRMSE}_{\textbf{sum}}$ & $\textbf{\crpsnospace}$   & $\textbf{NMAE}_{\textbf{sum}}$ & $\textbf{NRMSE}_{\textbf{sum}}$ \\
\midrule
2& ${0.3172}_{\pm 0.0249}$ & ${0.3510}_{\pm 0.0240}$ & ${0.6515}_{\pm 0.0571}$ & ${{0.0174}_{\pm 0.0042}}$ & ${{0.0226}_{\pm 0.0071}}$ & 
${{0.0296}_{\pm 0.0086}}$\\

3 & ${0.3110}_{\pm 0.0329}$ & ${0.3494}_{\pm 0.0378} $ & ${0.6452}_{\pm 0.0632} $ & ${{0.0160}_{\pm 0.0020}}$ & ${{0.0198}_{\pm 0.0029}}$ & 
${{0.0262}_{\pm 0.0039}}$\\
4& $\bm{0.3081}_{\pm 0.0099}$ & $\bm{0.3445}_{\pm 0.0102}$ & $\bm{0.6245}_{\pm 0.0268}$ & $\bm{{0.0149}_{\pm 0.0017}}$ & $\bm{{0.0178}_{\pm 0.0018}}$ & $\bm{{0.0241}_{\pm 0.0030}}$\\
5 & 
$0.3093_{\pm 0.0411}$ & $0.3430_{\pm 0.0451}$  & ${0.6117}_{\pm 0.0746}$
& ${{0.0153}_{\pm 0.0027}}$ & ${{0.0181}_{\pm 0.0043}}$ & ${{0.0254}_{\pm 0.0058}}$\\

\midrule

\textbf{Num of gran} & \textbf{\texttt{Traffic}} & \textbf{\texttt{KDD-cup}}\\
\cmidrule(r){2-4} \cmidrule(r){5-7}
& $\textbf{\crpsnospace}$ &$\textbf{NMAE}_{\textbf{sum}}$ & $\textbf{NRMSE}_{\textbf{sum}}$ & $\textbf{\crpsnospace}$   & $\textbf{NMAE}_{\textbf{sum}}$ & $\textbf{NRMSE}_{\textbf{sum}}$ \\
\midrule
2&  $0.0347_{\pm 0.0020}$ & ${0.0396}_{\pm 0.0022}$ & $0.0593_{\pm 0.0043}$ & $0.2427_{\pm 0.1167}$ & $0.3171_{\pm 0.1557}$ & $0.3745_{\pm 0.1652}$ \\
3 & $0.0334_{\pm 0.0034}$ & $0.0382_{\pm 0.0035}$ & $0.0574_{\pm 0.0066}$ & $0.2414_{\pm 0.1619}$ & $0.3030_{\pm 0.1789}$ & $0.3808_{\pm 0.2168}$ \\
4  & ${0.0326}_{\pm 0.0041}$ &    ${0.0374}_{\pm 0.0048}$ &       ${0.0573}_{\pm 0.0050}$ &  $0.2198_{\pm 0.1162}$ & $0.2893_{\pm 0.1554}$ &$0.3315_{\pm 0.1882}$\\
5 & $\bm{0.0323}_{\pm 0.0125}$ & $\bm{0.0370}_{\pm 0.0140}$ & $\bm{0.0563}_{\pm 0.0230}$ &  $\bm{0.1837}_{\pm 0.0636}$ &
$\bm{0.2463}_{\pm 0.0865}$ &
$\bm{0.3001}_{\pm 0.0997}$\\

\bottomrule
\end{tabular}
}
\end{center}
\end{small}

## Case Study

To illustrate the guidance effect of coarse-grained data, we visualize the ground truth and the predicted mean for both 1-hour and 4-hour granularity time series across four dimensions in the `Solar` dataset in Figure `\ref{fig-case}`{=latex}. For comparison, the prediction results for the 1-hour data from TimeGrad are also included. The results indicate that the TimeGrad model struggles to accurately capture the peaks in the series and tends to underestimate the peaks in solar energy. In the `\mgtsd `{=latex}model, the coarse-grained samples display a more robust capacity to capture the trends, subsequently guiding the generation of more precise fine-grained data.

<figure id="fig-case" data-latex-placement="H">

<figcaption> <span style="color: black">Visualization of the ground-truth (<code>Solar</code> dataset), predicted mean for 4-hour and 1-hour time series, and TimeGrad predicted mean for the 1-hour time series.</span> Additionally, the 50% prediction intervals for the 1-hour data are also included. These plots represent some illustrative dimensions out of 370 dimensions from the first 24-hour rolling-window. </figcaption>
</figure>

# Conclusion

In this paper, we introduce a novel **M**ulti-**G**ranularity **T**ime **S**eries **D**iffusion (`\mgtsdnospace`{=latex}) model, which leverages the inherent granularity levels within the data as given targets at intermediate diffusion steps to guide the learning process of diffusion models. We derive a novel multi-granularity guidance diffusion loss function and propose a concise implementation method to effectively utilize coarse-grained data across various granularity levels. Extensive experiments conducted on real-world datasets demonstrate that `\mgtsd `{=latex}outperforms existing time series prediction methods.

\newpage
\bibliographystyle{iclr2024_conference}
\newpage
\appendix

# Appendix: Derivation of loss function {#sec_derivation_of_loss_function}

Recall that we specify a sequence of noisy samples $\vx^{g}_{\Nn^g+1},\ldots, \vx^{g}_{N}$ by applying the forward process on $\vx^{g}$. The superscript in $\Nn^g$ is suppressed for notation brevity. Suppose the coarse-grained data $\vx^{g}_{\Nn} \sim q(\vx^{g})$, where the subscript notation $\Nn$ indicates that the observed $\vx^{g}$ is treated as a sample from the distribution. (In the diffusion model, the subscript is typically denoted as 0, but we start with $\Nn$ to simplify the derivation).

$$\begin{align}
\log p_\theta(\vx^{g}_{\Nn}) 
&\leq-\log p_\theta(\vx^{g}_{\Nn}) + \KL(q(\vx_{(\Nn+1):N}^{g}\vert  
\vx^{g}_{\Nn}) \| p_\theta(\vx^{g}_{(\Nn+1):N}\vert\vx_{\Nn}^{g}) ) \nonumber\\  
&= -\log p_\theta(\vx_{\Nn}^{g}) + \E_{\vx^{g}_{(\Nn+1):N}\sim q(\vx^{g}_{(\Nn+1):N} \vert \vx^{g}_{\Nn})} \Big[ \log\frac{q(\vx^{g}_{(\Nn+1):N}\vert\vx^{g}_{\Nn})}{p_\theta(\vx^{g}_{\Nn:N}) / p_\theta(\vx_{\Nn}^{g})} \Big] \nonumber\\  
&= -\log p_\theta(\vx_{\Nn}^{g}) 
 + \E_q \Big[ \log\frac{q(\vx^{g}_{(\Nn+1):N}\vert\vx^{g}_{\Nn})}{p_\theta(\vx^{g}_{\Nn:N})} + \log p_\theta(\vx_{\Nn}^{g}) \Big] \nonumber\\  
&= \E_q \Big[ \log \frac{q(\vx^{g}_{(\Nn+1):N}\vert\vx_{\Nn}^{g})}{p_\theta(\vx_{\Nn:N}^{g})} \Big]
\end{align}$$

Then, the training objective can be performed by optimizing the usual variational lower bound shown below: $$\begin{equation}
     L_\text{VLB}  = \mathbb{E}_{q(\vx^{g}_{\Nn:N})} \Big[ \log \frac{q(\vx_{(\Nn+1):N}^{g}\vert\vx_{\Nn}^{g})}{p_\theta(\vx^{g}_{\Nn:N})} \Big] \geq - \E_{q(\vx_{\Nn}^{g})} \log p_\theta(\vx_{\Nn}^{g})
\end{equation}$$

It is obvious that the objective $L_\text{VLB}$ is equivalent to the that of diffusion model in @ho2020denoising when employing diffusion models on $\vx^{g}$ with $N-\Nn$ steps. The forward process is defined as ${q(\vx^{g}_{(\Nn+1):N}|\vx^{g}_{\Nn}) = \prod^N_{n=\Nn} q(\vx_n^{g} | \vx_{n-1}^{g})}, \ \text{where} \ q (\vx^{g}_{n}|\vx^{g}_{n-1}):= \mathcal{N} (\sqrt{1-\beta_n^g}\vx^{g}_{n-1}, \beta_n^{g} \mI)$. The $\{\beta_{n}^g\}_{n=\Nn}^{N}$ share values with the variance schedule $\{\beta_n^1\}_{n=1}^{N}$ of the finest-grained data from index $\Nn$. And, the reverse process is defined by the $\theta$-parameterized trajectory. Then following the same technique in @ho2020denoising, the $L_\text{VLB}$ can reduce to the usual loss of diffusion models.

# Appendix: Experiments

## Benchmark experiments

The results of the benchmark experiments, evaluated based on the metrics $\text{NRMSE}_{\text{sum}}$ and $\text{NMAE}_{\text{sum}}$, are presented in Table `\ref{tab-compare-mse}`{=latex} and Table `\ref{tab-compare-mae}`{=latex} respectively. In the experiments, we include four extra baseline models for a more comprehensive comparison: TimeDiff [@shen2023non], $\text{D}^3\text{VAE}$ [@li2022generative], PatchTST [@nie2022time], and AutoFormer [@wu2021autoformer].

\resizebox{\textwidth}{!}{%
    \begin{tabular}{lllllllll}
    \toprule
    \bfseries Method & \bfseries \texttt{Solar}  &\bfseries \texttt{Electricity} &   \bfseries \texttt{Traffic}  &  \bfseries \texttt{KDD-cup} & \bfseries \texttt{Taxi} & \bfseries \texttt{Wikipedia}  \\ 
    \midrule
    Vec-LSTM ind-scaling & $0.9952_{\pm 0.0077}$ & $0.1439_{\pm 0.0228}$ & $0.1451_{\pm 0.0248}$ & $0.4461_{\pm 0.1833}$ & $0.6398_{\pm 0.0390}$ & $0.1618_{\pm 0.0162}$ \\
    GP-Scaling & $0.9004_{\pm 0.0095}$ & $0.0811_{\pm 0.0062}$ & $0.1469_{\pm 0.0181}$ & $0.3445_{\pm 0.0621}$ & $0.3598_{\pm 0.0285}$ & $0.1710_{\pm 0.1006}$ \\

    GP-Copula & $0.8279_{\pm 0.0053}$ & $0.0512_{\pm 0.0009}$ & $0.1282_{\pm 0.0033}$ & $\bm{0.2605}_{\pm 0.0227}$& $0.3125_{\pm 0.0113}$ & $0.0930_{\pm 0.0076}$\\
    
    Autoformer&$0.7046_{\pm 0.0000}$&$0.0475_{\pm 0.0000}$&$0.0951_{\pm 0.0000}$&$0.8984_{\pm 0.0000}$&$0.3498_{\pm 0.0000}$&$0.1052_{\pm 0.0000}$\\
    PatchTST&$0.7270_{\pm 0.0000}$&$0.0474_{\pm 0.0000}$&$0.1897_{\pm 0.0000}$&$0.5137_{\pm 0.0000}$&$0.3690_{\pm 0.0000}$&$0.0915_{\pm 0.0000}$\\
    $\text{D}^3\text{VAE}$&$0.7472_{\pm 0.0508}$ & $0.1640_{\pm 0.0928}$ &$0.4722_{\pm 0.1197}$ & $0.5628_{\pm 0.0419}$ & $0.7624_{\pm 0.5598}$& $2.2094_{\pm 2.1646}$\\
    TimeDiff & $1.5985_{\pm 0.0359}$ & $0.3714_{\pm 0.0073}$ & $0.5520_{\pm 0.0087}$ & $0.4955_{\pm 0.0147}$ & $0.5479_{\pm 0.0084}$ & $0.1412_{\pm 0.0099}$ \\


    TimeGrad & $0.6953_{\pm 0.0845}$ & $0.0348_{\pm0.0057}$ &$0.0653_{\pm 0.0244}$ & $0.4092_{\pm 0.1332}$ & $0.2365_{\pm 0.0386}$& $0.0870_{\pm 0.0106}$\\
    TACTiS  & $0.8532_{\pm 0.0851}$ &$0.0427_{\pm 0.0023}$ &  $0.2270_{\pm 0.0159}$ &$0.6513_{\pm 0.1767}$ & $0.3387_{\pm 0.0097}$& - \\

    MG-TSD &$\bm{0.6178_{\pm 0.0418}}$ & $\bm{0.0241_{\pm 0.0030}}$ & $\bm{0.0563_{\pm 0.0230}}$ & ${0.3001_{\pm 0.0997}}$& $\bm{0.2334_{\pm  0.0313 }}$ & $\bm{0.0810_{\pm 0.0057}}$ \\
    
    \bottomrule
    \end{tabular}%
}
\resizebox{\textwidth}{!}{%
    \begin{tabular}{lllllllll}
    \toprule
    \bfseries Method & \bfseries \texttt{Solar}  &\bfseries \texttt{Electricity} &   \bfseries \texttt{Traffic}  &  \bfseries \texttt{KDD-cup} & \bfseries \texttt{Taxi} & \bfseries \texttt{Wikipedia}  \\ 
    \midrule

    Vec-LSTM ind-scaling & $0.5091_{\pm 0.0027}$ & $0.1261_{\pm 0.0211}$ & $0.1042_{\pm 0.0228}$ & $0.4193_{\pm 0.1902}$ & $0.4974_{\pm 0.0351}$ & $0.1416_{\pm 0.0180}$ \\

    GP-Scaling & $0.4945_{\pm 0.0065}$ & $0.0648_{\pm 0.0046}$ & $0.0975_{\pm 0.0163}$ & $0.2892_{\pm 0.0550}$ & $0.2867_{\pm 0.0264}$ & $0.1452_{\pm 0.1029}$ \\

    GP-Copula & $0.4302_{\pm 0.0046}$ & $0.0312_{\pm 0.0007}$ & $0.0769_{\pm 0.0022}$ & $\bm{0.2140}_{\pm 0.0124}$ & $0.2390_{\pm 0.0098}$ & $0.0659_{\pm 0.0061}$ \\

    Autoformer&$0.6368_{\pm 0.0000}$&$0.0388_{\pm 0.0000}$&$0.0684_{\pm 0.0000}$&$0.7658_{\pm 0.0000}$&$0.2652_{\pm 0.0000}$&$0.1239_{\pm 0.0000}$\\
    PatchTST&$0.4351_{\pm 0.0000}$&$0.0350_{\pm 0.0000}$&$0.1219_{\pm 0.0000}$&$0.4497_{\pm 0.0000}$&$0.2887_{\pm 0.0000}$&$0.0625_{\pm 0.0000}$\\
    $\text{D}^3\text{VAE}$&$0.4457_{\pm 0.0377}$ & $0.1434_{\pm 0.0892}$ &$0.3992_{\pm 0.1177}$ & $0.4874_{\pm 0.0520}$ & $0.6080_{\pm 0.5061}$& $2.0151_{\pm 2.0005}$\\
    TimeDiff & $1.3343_{\pm 0.0305}$ & $0.3519_{\pm 0.0075}$ & $0.4782_{\pm 0.0058}$ & $0.3630_{\pm 0.0127}$ & $0.4521_{\pm 0.0102}$ & $0.1146_{\pm 0.0106}$ \\

    TimeGrad & $0.3694_{\pm 0.0400}$ & $0.0266_{\pm 0.0049}$ & $0.0410_{\pm 0.0089}$ & $0.3614_{\pm 0.1334}$& $0.1365_{\pm 0.0193 }$ & $0.0631_{\pm 0.008 }$\\
    TACTiS & $0.4448_{\pm 0.0313}$ &$0.0310_{\pm 0.0015}$ & $0.1352_{\pm 0.0159}$& $0.6078_{\pm 0.1718}$ & $0.2244_{\pm 0.0036}$ & -\\

    MG-TSD & $\bm{0.3347_{\pm 0.0220}}$ & $\bm{0.0178_{\pm 0.0018}}$ & $\bm{0.0370_{\pm 0.0140}}$ & ${0.2463_{\pm 0.0865}}$ & $\bm{0.1300_{\pm 0.0150}}$ & $\bm{0.0601_{\pm 0.0057}} $\\
    
    \bottomrule
    \end{tabular}%
}

## More experiment settings

### Performance for long-term forecasting 

To evaluate the performance of MG-TSD for long-term forecasting, we maintain a fixed context length of 24 and extend the prediction length to 24, 48, 96, and 144. The results of the datasets $\texttt{Solar}$ and $\texttt{Eelectrity}$ are displayed in Figure `\ref{fig-pred-length}`{=latex}.

The results in Figure `\ref{fig-pred-length}`{=latex} indicate that MG-TSD performs well for long-time forecasting. The results indicate that as the prediction length increases, the performance of our proposed method stays robust, exhibiting no sudden decline. Furthermore, our method consistently outperforms the competitive baseline. This performance advantage is anticipated to persist in future trends, with no indication of convergence between the approaches.

<figure id="fig-pred-length" data-latex-placement="htbp">

<figcaption>Performance evaluation across different prediction horizons for MG-TSD with TimeGrad as the baseline Model. The context length is fixed at 24h and the prediction length is tested at 24h, 48h, 96h, and 144h. The average CRPS, NRMSE, and NMAE metrics are computed for both MG-TSD and the baseline over 10 independent runs, with error bars indicating the corresponding standard deviations. </figcaption>
</figure>

### Time and memory usage of the MG-TSD model during training

Experiments have been conducted to evaluate the time and memory usage of the MG-TSD model during training across various granularities. These experiments were executed using a single A6000 card with 48G memory capacity. The Solar dataset was utilized in this context, with a batch size of 128, an input size of 552, 100 diffusion steps, and 30 epochs.

<figure id="fig-memory-runtime" data-latex-placement="!ht">
<img src="fig_memory_runtime.png" />
<figcaption>Comparison of Time and Memory Consumption at Different Granularity Levels in MG-TSD Model Training</figcaption>
</figure>

As illustrated in Figure `\ref{fig-memory-runtime}`{=latex}, there is a linear increase in memory consumption with an increase in granularity. A slight surge in training time is also observed. These findings are coherent with the architecture of our model. In particular, each additional granularity results in the introduction of an extra RNN in the Temporal Process Module and an increase in computation within the Guided Diffusion Process Module. As per theoretical expectations, these resource consumptions should exhibit linear growth. The slight increase in training time can be ascribed to the design of the Multi-granularity Data Generator Module which enables parallel forward processes across different granularities, thus promoting acceleration. Moreover, it is pertinent to mention that an excessive increase in granularity may not notably boost the final prediction results, hence the granularity will be kept within a certain range. Therefore, the consumption of memory will not rise indefinitely.

### Variations in the Frequency Domain of Time Series Data: The Impact of Granularity and Denoising Steps

We sampled series from Solar dateset and we conducted a Fast Fourier Transform to extract the seasonality components of the series, as well as the samples of different granularities and corresponding noisy samples along the forward diffusion process.

<figure id="fig-connection" data-latex-placement="htbp">

<figcaption>Variations in the frequency domain of time series data: the impact of granularity and denoising steps. </figcaption>
</figure>

As depicted in Figure `\ref{fig-connection}`{=latex}(a), as granularity becomes coarser, the components of all outstanding frequencies get lower, while the high-frequency peak (around 125 and 80) diminishes quicker than lower-frequency peak (around 45). Figure `\ref{fig-connection}`{=latex}(b) demonstrates the distribution of frequency components of the same noisy series with gradually ascending forward diffusion steps and the same pattern is observable. This empirical study indicates the connection between the forward diffusion process and the smoothing process from fine-grained data to coarse-grained data, both of which result in losing finer informative features.

# Appendix: Implementation details

## Benchmark datasets {#sec_benchmark_datasets}

For our experiments, we use `Solar`, `Electricity`, `Traffic`, `Taxi`, `KDD-cup` and `Wikipedia` open-source datasets, with their properties listed in Table `\ref{tab-dataset}`{=latex}.

The dataset can be obtained through the links below.

\begin{inparaenum}
\renewcommand{\labelenumi}{(\roman{enumi})}

    \item \texttt{Solar}: \texttt{\url{https://www.nrel.gov/grid/solar-power-data.html}}
    
    \item \texttt{Electricity}: \texttt{\url{https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014}}
    
    \item \texttt{Traffic}: \texttt{\url{https://archive.ics.uci.edu/dataset/204/pems+sf}}
    
    \item \texttt{Taxi}: \texttt{\url{https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page}}
    
    \item \texttt{KDD-cup}: \texttt{\url{https://www.kdd.org/kdd2018/kdd-cup}}
    
    \item \texttt{Wikipedia}: \texttt{\url{https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets}}
\end{inparaenum}
\resizebox{\textwidth}{!}{%
    \begin{tabular}{cccccccccc}
    \toprule
    Name  & Frequency & Number of series & Context length & Prediction length & Multi-granularity dictionary\\
    \hline
$\texttt{Solar}$ & 1 hour & 137 & 24 & 24 & [1 hour, 4 hour, 12 hour, 24hour, 48 hour]\\
   $\texttt{Electricity}$   &  1 hour & 370 & 24 & 24 & [1 hour, 4 hour, 12 hour, 24 hour, 48 hour]\\
   $\texttt{Traffic}$  &  1 hour & 963 & 24 & 24 & [1 hour, 4 hour, 12 hour, 24 hour, 48 hour] \\
   $\texttt{Taxi}$   &  30 min & 1214 & 24 & 24 & [30 min , 2 hour, 6 hour, 12 hour, 24 hour] \\
      $\texttt{KDD-cup}$   &  1 hour & 270 & 48 & 48 & [1 hour, 4 hour, 12 hour, 24hour, 48 hour]\\
    $\texttt{Wikipedia}$   &  1 day & 2000 & 30 & 30 & [1 day, 4 day, 7 day, 14 day] \\
         \bottomrule
    \end{tabular}
}

## Libraries used

The `\mgtsd `{=latex}code in this study is implemented using PyTorch [@paszke2019pytorch]. It utilizes the PytorchTS library [@pytorchgithub], which enables convenient integration of PyTorch models with the GluonTS library [@alexandrov2020gluonts] on which we heavily rely for data preprocessing, model training, and evaluation in our experiments.

The code for the baseline methods is obtained from the following sources.

\begin{inparaenum}
\renewcommand{\labelenumi}{(\roman{enumi})}
    \item Vec-LSTM-ind-scaling: models the dynamics via an RNN and outputs the parameters of an independent Gaussian distribution with mean-scaling. \\
    Code: \url{https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release};
    
    \item GP-scaling: a model that unrolls an LSTM with scaling on each individual time series before reconstructing the joint distribution via a low-rank Gaussian. \\    
    Code: \url{https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release}
    
    \item GP-Copula: a model that unrolls an LSTM on each individual time series. The joint emission distribution is then represented by a low-rank plus diagonal covariance Gaussian copula.\\
    Code: \url{https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release};

    \item LSTM-MAF: a model which utilizes LSTM for modeling the temporal conditioning and employs Masked Autoregressive Flow~\citep{papamakarios2017masked} for the distribution emission.\\
    Code: \url{https://github.com/zalandoresearch/pytorch-ts/tree/master/pts/model/tempflow}
    
    \item Transformer-MAF: a model which utilizes Transformer~\citep{Vaswani2017attention} for modeling the temporal conditioning and employs Masked Autoregressive Flow~\citep{papamakarios2017masked} for the distribution emission model.\\
    Code: \url{https://github.com/zalandoresearch/pytorch-ts/tree/master/pts/model/transformer_tempflow}
    
    \item TimeGrad: an auto-regressive model designed for multivariate probabilistic time series forecasting, assisted by an energy-based model.\\
    Code:~\url{https://github.com/zalandoresearch/pytorch-ts}
    
    \item TACTiS: a non-parametric copula model based on transformer architecture.\\
    Code: \url{https://github.com/servicenow/tactis}

    \item $\text{D}^3\text{VAE}$:
    a bidirectional variational auto-encoder(BVAE) equipped with diffusion, denoise, and disentanglement.\\
    Code: \url{https://github.com/ramber1836/d3vae}.
    
    \item TimeDiff: a predictive framework trained by blending hidden contextual elements with future actual outcomes for sample conditioning.\\
    Code: There is no publicly available code; we obtained the code by emailing the author.

    \item Autoformer: redefines the Transformer with a deep decomposition architecture, including sequence decomposition units, self-correlation mechanisms, and encoder-decoders.\\
    Code: \url{https://github.com/thuml/Autoformer}

    \item PatchTST: an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning.\\
    Code: \url{https://github.com/yuqinie98/PatchTST}
    
% \end{enumerate}
\end{inparaenum}

## Hyper-parameter setting for each model {#sec_app_hyper_parameter}

\resizebox{\textwidth}{!}{%
    \begin{tabular}{ccccccc}
    \toprule
    Dataset  & Num gran & Gran dict   & Share ratio   & Loss weight &\\
    \hline
    $\texttt{Solar}$ $\texttt{Electricity}$  $\texttt{Traffic}$  $\texttt{KDD-cup}$

    
    & 2
    &\text{[1h,4h]}  \text{[1h,12h]}
     & [1,0.9] & 
     
     \text{[0.9,0.1]} \\ 
    & &  & [1,0.8] & \\  
    & & & [1,0.6] &  \\
    \cline{2-5}
    
    & 3 & \text{[1h,4h,12h]} \text{[1h,4h,24h]} & [1,0.9,0.8] & [0.8, 0.1, 0.1]\\
    & &  & [1,0.8,0.8] & [0.9, 0.05, 0.05]\\
    & &  & [1,0.8,0.6] & [0.85, 0.10, 0.05]\\
    \cline{2-5}
    
    & 4 & 
    
    \text{[1h,4h,12h,24h]} \text{[1h,4h,12h,48h]}
    &[1,0.9,0.8,0.8]&
    \text{[0.8, 0.1, 0.05, 0.05]}  \text{[0.7,0.1,0.1,0.1]}
    \\
    &  & &[1,0.9,0.8,0.6]&\\
    &  & &[1,0.8,0.6,0.6]& \\
    &  & &[1,0.8,0.6,0.4]&\\
    \cline{2-5}
    
    & 5 &
    \text{[1h,4h,8h,12h,24h]}  \text{[1h,4h,12h,24h,48h]} & [1,0.9,0.8,0.6,0.6]&
    
    \text{[0.8,0.1,0.05,0.04,0.01]}  \text{[0.8,0.05,0.05,0.05,0.05]}  \text{[0.6,0.1,0.1,0.1,0.1]} \\
    &  & & [1,0.9,0.8,0.6,0.4]& \\    
    &  & & [1,0.8,0.6,0.6,0.6]&  \\
    &  &  & [1,0.8,0.6,0.6,0.4]& \\
    &  &  & [1,0.8,0.6,0.4,0.4]& \\

    \midrule
    $\texttt{Taxi}$
    
    & 2 & [30m,2h] & 
    \text{[1,0.8]} \text{[1,0.6]}
     & \text{[0.9,0.1]}\\
    &  & [30m,6h] &  &\\
    &  & [30m,12h] &  &\\
    &  & [30m,24h] &  &\\
    
    \midrule
  
    $\texttt{Wikipedia}$
    
    & 3 & \text{[1d,4d]} \text{[1d,7d]}  \text{[1d,14d]} & 
    \text{[1,0.8]}  \text{[1,0.6]}
    
     & \text{[0.9,0.1]}\\
    &  &  & &\\
    &  &  &  &\\

\bottomrule
    \end{tabular}}

# Appendix: Metrics {#sec_app_metrics}

`\label{sec-app-metrics}`{=latex}

More details about the metrics we adopt can be found in Gluonts documentation  [@gluonts_jmlr]. We briefly summarize them as below:

**`\crps`{=latex}**: From @de2020normalizing, CRPS is a univariate strictly proper scoring rule which measures the compatibility of a cumulative distribution function $F$ with an observation $x\in \mathbb{R}$ as $$\begin{equation*}
\text{CRPS}(F,x) = \int_{\mathbb{R}} (F(y) - \bm{1}(x \leq y))^2 \dd y
\end{equation*}$$ where $\bm{I}\{x\leq y\}$ is the indicator function, which is 1 if $x\leq y$ and 0 otherwise. The CRPS attains the minimum value when the predictive distribution $F$ same as the data distribution. `\crps `{=latex}extends CRPS to multivariate time series with a simple modification. $$\begin{equation*}
\text{CRPS}_{\text{sum}}=\mathbb{E}_t[\text{CRPS}(F^{-1}_{\text{sum}},\sum_{i}x_t^{i})],
\end{equation*}$$ where $F^{-1}_{\text{sum}}$ is calculated by summing samples across dimensions and then sorted to get quantiles. A smaller `\crps `{=latex}indicates better performance.

**NMAE**: NMAE is a normalized version of the Mean Absolute Error (MAE) that takes into consideration the scale of the target values. The formula for NMAE is as follows: $$\begin{equation*}
\text{NMAE} = \frac{\text{mean}(|(\hat{Y}-{Y})|)}{\text{mean}(|Y|)}
\end{equation*}$$ Similarly, in this formula, $\hat{Y}$ represents the predicted time series, and $Y$ represents the true target time series. NMAE calculates the average absolute difference between predictions and true values, normalized by the mean absolute magnitude of the target values. A smaller NMAE implies more accurate predictions.

**NRMSE**: NRMSE is a normalized adaptation of the Root Mean Squared Error (RMSE) that factors in the scale of the target values. The formula for NRMSE is as follows:

$$\begin{equation*}
\text{NRMSE} = \sqrt{\frac{\text{mean}((\hat{Y}-{Y})^2)}{\text{mean}(|Y|)}}
\end{equation*}$$ Here, $\hat{Y}$ represents the predicted time series, and $Y$ represents the true target time series. NRMSE measures the average squared difference between predictions and true values, normalized by the mean absolute magnitude of the target values. A smaller NRMSE indicates more accurate predictions.

\newpage

# Appendix: More illustrative plots

<figure id="fig:fig-case-appendix" data-latex-placement="H">

<figcaption>and TimeGrad prediction intervals and test set ground-truth for Solar data of some illustrative dimensions of 370 dimensions from first rolling-window.</figcaption>
</figure>
