---
author:
- Limei Wang
- Kaveh Hassani
- Si Zhang
- Dongqi Fu
- Baichuan Yuan
- Weilin Cong
- Zhigang Hua
- Hao Wu
- Ning Yao
- Bo Long
bibliography:
- paper.bib
title: Learning Graph Quantized Tokenizers
---

```{=latex}
\maketitle
```
Introduction {#sec:introduction}
============

Following the success of Transformers [@NIPS2017_3f5ee243] in Natural Language Processing [@devlin-etal-2019-bert; @NEURIPS2020_1457c0d6] and Computer Vision [@dosovitskiy2021an], Graph Transformers (GTs) [@dwivedi2020generalization; @ying2021transformers; @rampavsek2022recipe; @pmlr-v202-shirzad23a; @chen2023nagphormer; @wu2022nodeformer] have emerged as strong models in geometric deep learning. In contrast to message-passing Graph Neural Networks (GNNs), which are inherently constrained by strong locality inductive biases [@battaglia2018relationalinductivebiasesdeep; @veličković2018graph; @Hou2020Measuring; @hamilton2017inductive; @kipf2017semisupervised; @DBLP:conf/TMLR/ZhengFMH24], GTs exhibit greater expressivity due to their capacity to capture long-range interactions between nodes [@pmlr-v202-ma23c; @kim2022pure; @zopf20221]. This is particularly beneficial in heterophilic settings where local alignment does not hold [@fu2024vcrgraphormer]. This dichotomy highlights a fundamental trade-off between GNNs, which focus on local neighborhood aggregation, and GTs, which employ pairwise attention to model global graph structures. A natural question arises: can we synergistically integrate the strengths of both approaches to leverage the complementary benefits of local and global representations? Specifically, is it possible to harness the locality-aware representations learned by GNNs to construct discrete tokens, thereby enabling GTs to operate efficiently while still capturing salient graph properties?

GTs require consideration of both graph structure and features, as nodes with identical features will otherwise be projected into the same representation regardless of their surrounding structures [@hoang2024survey]. There are three general approaches to address this limitation [@hoang2024survey]: (1) node feature modulation, which involves injecting structural information into the node features; (2) context node sampling, where a sampling strategy is used to construct a sequence over the neighbor nodes; and (3) modifying the architecture of a vanilla Transformer to directly incorporate structural biases. Given that Transformers are universal approximators of sequence-to-sequence functions [@Yun2020Are] and considering the rapid developments in efficient implementation of Multi-Head Attention (MHA) module [@dao2022flashattention; @liu2024ringattention], which enables longer context sizes [@reid2024gemini], we argue that a well-designed graph tokenizer can allow a vanilla Transformer to efficiently process even large-scale graphs. Recent studies on applying Large Language Models (LLMs) to Text-Attributed Graphs (TAGs) have shown surprisingly strong performance gains surpassing those of GNNs, suggesting that vanilla Transformers are indeed capable of effectively learning graph structures [@ye-etal-2024-language; @xu2025makellmsstrongnode]. Nonetheless, LLMs are not efficient at inference time. Our goal is to devise a lightweight and efficient graph tokenizater that allows vanilla Transformer encoders to learn graph structures effectively.

Tokenizers typically employ self-supervised objectives to abstract data into a sequence of discrete tokens, allowing Transformers to learn representations across various modalities as a unified stream of data. The discretization is usually achieved through vector quantization techniques [@van2017neural; @lee2022autoregressive], which offer several benefits, including: (1) significantly reduced memory requirements, (2) improved inference efficiency, (3) allowing Transformers to focus on long-range dependencies rather than local information, and (4) the capacity to learn more high-level representations due to a compact latent space [@Yuan_2021_ICCV; @yu2022vectorquantized]. These advantages are particularly important in auto-regressive generative modeling, where quantized tokens allow Transformers to generate high-quality outputs in multiple modalities [@dubey2024llama; @lee2022autoregressive; @dhariwal2020jukebox; @pmlr-v139-ramesh21a; @team2024chameleon]. Despite its importance in other domains, tokenization remains under-explored for graph-structured data. To address this limitation, we propose the **Graph Quantized Tokenizer (GQT)**, a novel approach that learns a hierarchical sequence of tokens over graphs using self-supervised objectives tailored to graph-structured data. More specifically, our contributions are as follows:

-   We propose a graph tokenizer that uses multi-task graph self-supervised learning to train a graph encoder, enabling it to fully capture local interactions and allowing the Transformer to focus on long-range dependencies.

-   Our approach adapts Residual Vector Quantization (RVQ) within the graph tokenizer to learn hierarchical discrete tokens, resulting in significantly reduced memory requirements and improved generalization capabilities.

-   We introduce a novel combination of semantic edges and random walks to facilitate access to long-range interactions, and employ hierarchical encoding and gating mechanisms to modulate the tokens and provide informative representations to the Transformer.

-   Through extensive experiments on both homophilic and heterophilic datasets, including large-scale and long-range benchmarks, we demonstrate that our tokenizer enables Transformer encoders to achieve state-of-the-art performance on 20 out of 22 benchmarks while substantially reducing the memory footprint of the embeddings.

Related Works {#sec:related_works}
=============

**Graph Transformers (GTs)** have shown promising performance on various graph learning tasks, surpassing GNNs on many benchmarks. Designing GTs can be broadly categorized into two directions [@hoang2024survey; @muller2024attending]: (1) modifying the vanilla Transformer architecture to incorporate structural inductive biases, or (2) encoding the input graph to make it compatible with the vanilla Transformer. Early examples of the first approach include Graph Attention Network (GAT) [@veličković2018graph], which uses an attention module to compute pairwise node attention and masks the attention matrix based on connectivity information. Subsequent works have replaced the scaled-dot attention module with various structure-aware sparse attention modules [@rampavsek2022recipe; @bo2023specformer; @ying2021transformers; @deng2024polynormer; @wu2023simplifying; @10.24963/ijcai.2023/244; @pmlr-v162-chen22r; @dwivedi2020generalization; @pmlr-v202-shirzad23a; @pmlr-v202-ma23c]. Graph Memory Network (GMN) [@Khasahmadi2020Memory-Based] is an example of the second approach, which passes non-linear projections of node features and structural encoding to a Transformer-like model. Structural encodings such as Laplacian eigenvectors or Random walk-based encoding [@dwivedi2022graph; @pmlr-v202-ma23c; @pmlr-v235-canturk24a], allow injecting structural information directly into the node features. Some works use GNNs to encode local structure along with node features into embeddings that are passed to vanilla Transformers to capture long-range dependencies [@rong2020self; @wu2021representing; @chen2023nagphormer; @pmlr-v162-chen22r]. Recent studies leverage LLMs, where graphs are represented through natural language, and an LLM performs graph-related tasks through in-context learning, instruction-tuning, or soft-prompting [@fatemi2024talk; @ye-etal-2024-language; @he2024harnessing]. For a detailed survey on GTs, see [@muller2024attending; @hoang2024survey].

**Graph Tokenization** provides GTs with rich node tokens that encapsulate both structural and semantic information. TokenGT [@kim2022pure] treats nodes and edges as independent tokens defined by their features, type identifiers, and structural encodings. NAGphormer [@chen2023nagphormer] represents each node with $L$ tokens, where the $l^{th}$ token is the representation of the node from the $l^{th}$ hop aggregation. GraphiT [@mialon2021graphit] defines a node token as the concatenation of its feature and representation from a graph convolutional kernel network (GCKN). VCR-Graphormer [@fu2024vcrgraphormer] expands the notion of node tokens to include sequences comprising the node feature and features of semantically and community-related neighboring nodes. SGT [@liu2023rethinking] is a non-parametric tokenizer designed for molecular tasks, which unlike motif-based tokenizers [@zhang2021motif; @jin2018junction] or GNN pre-training methods [@xia2023molebert], simplifies the tokenization process to a non-parametric graph operator without non-linearity. NodePiece [@galkin2022nodepiece] is a knowledge-graph tokenizer that represents a target node as a hash of its top-k closest anchors, their distances, and relational context. While Vector Quantization (VQ) has been explored in other modalities [@van2017neural; @lee2022autoregressive; @yu2022point; @van2024fast; @li2024geometry], its application in graph learning is limited. Notable exceptions include VQ-GNN [@ding2021vqgnn], which uses quantized representations combined with a low-rank graph convolution matrix to avoid neighbor explosion problem, VQGraph [@yang2024vqgraph], which employs VQ for distilling a GNN into an MLP, and NID [@luo2024structure], which uses VQ to learn discrete node IDs for downstream prediction tasks.

Preliminaries {#sec:preliminaries}
=============

**Messag-Passing GNNs**. Let $\mathcal{G}$ denote the space of graphs. A graph $g \in \mathcal{G}$ is defined as $\left(\mathcal{V}, \mathcal{E}, \textbf{X}, \textbf{E} \right)$ where $\mathcal{V}$ is the set of nodes and $\mathcal{E} \subseteq \mathcal{V} \times \mathcal{V}$ is the set of edges. $\textbf{X} \in \mathbb{R}^{\left|\mathcal{V}\right| \times d_x}$ represents the node features of dimension $d_x$, and $\textbf{E} \in \mathbb{R}^{\left|\mathcal{V}\right| \times \left|\mathcal{V}\right| \times d_e}$ represents the edge features of dimension $d_e$. A message-passing GNN takes $g$ as input and learns representations $h^l_v$ for $v \in \mathcal{V}$ ($h^0_v=x_v$) in each layer $l$ as follows [@gilmer2017neural]: $$h^l_v = f^l_{\theta}\left(h^{l-1}_v, g^l_{\phi} \left(\left\{\left(h^{l-1}_v, h^{l-1}_u, e_{uv} \right) | u \in \mathcal{N}_v\right\}\right) \right)$$ where $f_{\theta}$ and $g_{\phi}$ are known as combine and aggregate functions, respectively. $\mathcal{N}_v$ denotes the set of immediate neighbors of the node $v$. Once the node representations are computed, we can perform various tasks including node classification as $\text{MLP}\left(h_v\right)$, edge prediction as $\text{MLP}\left(h_u \odot h_v\right)$, or graph classification as $\text{MLP}\left(\mathcal{R}\left(\left\{h_u |u \in \mathcal{V}\right\}\right)\right)$, where $\mathcal{R}$ is a pooling (readout) function.

**Graph Transformers** use a tokenizer $T_v = \mathcal{T}_{\psi}\left(\mathcal{N}(v)\right)$ to map each node $v \in \mathcal{V}$ into a sequence of tokens $T_v$ by considering a notion of neighborhood $\mathcal{N}$. The simplest design is when $\mathcal{N}$ is zero-hop neighborhood (i.e., the node itself) and $\mathcal{T}_{\psi}$ is a node feature lookup function. The neighborhood $\mathcal{N}$ can be extended to include the node's ego network [@zhao2021gophormer] or top-k Random Walk based neighbors [@fu2024vcrgraphormer], and $\mathcal{T}_{\psi}$ can be enhanced to representations from a GNN [@chen2023nagphormer]. Node tokens along with positional encodings ($\text{PE}$) are passed to the Transformer as $h^0_v = \left[T_v || \text{PE}\left(v\right) \right]$. The representations in the $l^{th}$ layer of a Transformer encoder are computed as: $$\begin{aligned}
        h^l_v &= \text{LN}\left(\text{MHA}\left(\text{LN}\left(h^{l-1}_v\right)\right) + h^{l-1}_v\right) \\
        h^l_v &= h^l_v + \text{MLP}\left(h^l_v\right)\end{aligned}$$ where LN and MHA are LayerNorm and multi-head attention, respectively. Similar to Transformer encoders in other modalities [@devlin-etal-2019-bert; @dosovitskiy2021an], we can append a special classification token ($\left[\text{CLS}\right]$) to the input and use its representation to perform various classification tasks on the graph: $\text{MLP}\left(h_{\left[\text{CLS}\right]}\right)$.

**Vector Quantization** projects embeddings $\textbf{X} \in \mathbb{R}^{n \times d_x}$ into a more compact space of codebooks $\textbf{C} \in \mathbb{R}^{k \times d_c}$, where $k \ll n$. The codebooks can be learned by minimizing various objectives such as K-means clustering. The new representation of $x_i$ is then computed as [@van2017neural]: $$z(x_i) = c_k \quad \text{where} \quad k=\argmin_j \Vert x_i - c_j \Vert_2^2$$ Building upon this concept, Residual-VQ (RVQ) [@lee2022autoregressive] extends VQ to a sequence of codebooks, where each consecutive codebook quantizes the residual error from the previous codebook, i.e., $r_i=z_i - c_k$. This hierarchical approach constructs a multi-level quantized representation, enhancing the overall quantization quality. More details of RVQ are included in Appendix `\ref{sec:appendix_model_details}`{=latex}.

Self-Supervised Graph Tokenization {#sec:tokenizer}
==================================

Tokenizer Properties
--------------------

Our goal is to design a graph tokenizer that learns node tokens that exhibit three key characteristics:

**Modeling Local Interactions**. The tokens should encapsulate local interactions, allowing the Transformer to focus on long-range dependencies. This is analogous to Vision Transformers (ViTs), where the Transformer attends to image patches instead of pixels [@dosovitskiy2021an; @liu2021swin]. To achieve this, we leverage GNNs as the tokenizer encoder to model local interactions in the representation space [@battaglia2018relationalinductivebiasesdeep]. Our design accommodates various GNN layer choices without constraints; for simplicity, we opt for a GAT encoder [@veličković2018graph].

**Memory Efficiency**. The tokens also should be compact to facilitate efficient memory usage. To achieve this, we introduce a Residual-VQ (RVQ) [@lee2022autoregressive] layer to quantize the GNN representations into a sequence of discrete tokens. Quantization not only helps with generalization due to its regularization effect but also significantly reduces memory usage. Using an RVQ with $c$ codebooks (typically $c=\{2, \cdots, 8\}$), a graph with feature matrix $\textbf{X} \in \mathbb{R}^{N \times d_x}$ can be represented as $\textbf{X}_Q \in  \mathbb{N}^{N \times c}$ and codebook representation of $\textbf{C} \in  \mathbb{R}^{c \times K \times d_c}$, where $c$ is the number of codebooks (i.e., levels of quantization), $K$ is the codebook size, and $d_c$ is the code dimension. To illustrate the benefits of this approach, consider a graph with $10^6$ nodes and a feature dimension of 1024 ($\textbf{X} \in \mathbb{R}^{10^6 \times 1024}$). Using an RVQ with 3 codebooks and a codebook size of 256, this graph can be represented as $\textbf{X}_Q \in \mathbb{N}^{10^6 \times 3}$ plus $\textbf{C} \in  \mathbb{R}^{3 \times 256 \times 1024}$, resulting in a 270-fold reduction in memory.

**Robustness and Generalization**. The tokens should be robust and generalizable. To achieve this, we rely on graph self-supervised learning. Self-supervised representations have been shown to be more robust to class imbalance [@liu2022selfsupervised] and distribution shift [@shi2023how], while also capturing better semantic information [@assran2023self] compared to representations learned through supervised objectives. Moreover, self-supervised graph representations have demonstrated superior performance on downstream tasks compared to representations learned in a fully supervised manner, indicating better generalization capabilities [@Hu2020Strategies; @Sun2020InfoGraph; @you2020graph; @DBLP:conf/cikm/FuXLTH20; @you2021graph; @pmlr-v119-hassani20a; @velickovic_2019_iclr; @zhu2020deep]. Additionally, multi-task learning with self-supervised objectives has been shown to achieve better performance on downstream tasks [@doersch2017multi; @ghiasi2021multi]. To leverage these benefits, we propose training the GNN encoder with three self-supervised objectives. Unlike RQ-VAE [@lee2022autoregressive], which uses reconstruction as its primary objective, we employ graph-specific objectives to capture the nuances of both structure and features.

![`\small `{=latex}Overview of our proposed Graph Quantization Transformer (GQT) consisting of three main components: (1) a GNN to encode local interactions, (2) vector quantization for compact representation, and (3) generative and contrastive heads for robust representation learning. We also utilize a Transformers encoder to model long-range interactions. We augment the graph with semantic edges (dashed lines) and generate a sequence for each node based on Personalized PageRank scores. We then modulate the tokens through hierarchical encoding and structural gating, and feed them into the Transformer and aggregate the learned representations through an attention module before passing it to the classification head.](assets/tokenizer.png){#fig:overview width="5.9in"}

Training
--------

To capture different aspects of information, we employ a multi-task learning framework that leverages three distinct families of graph self-supervised objectives: student-teacher distillation [@thakoor2022largescale], masked autoencoding [@hou2022graphmae], and Infomax [@velickovic_2019_iclr]. We also introduce a commitment loss [@van2017neural] to enforce alignment between learned node representations and the codebook representations. Specifically, the GNN encoder is trained through gradient descent to minimize a loss function comprising of three terms, where $\beta$ is the loss weight: $$\mathcal{L} = \mathcal{L}_\text{dgi} + \mathcal{L}_\text{gmae2} + \beta \mathcal{L}_\text{commit}$$ The first term is the Deep Graph Infomax (DGI) [@velickovic_2019_iclr] objective, which maximizes mutual information (MI) between node representations and graph (sub-graph) representations, based on the Jensen-Shannon divergence between the joint and product of marginals as follows: $$\mathcal{L}_\text{dgi} = \mathbb{E} \left(\sum\limits_{v\in g} \log\left(\mathcal{D}\left(h_v, h_g\right)\right) + 
    \sum\limits_{u\in \tilde{g}} \log\left(1-\mathcal{D}\left(h_u, h_g\right)\right) \right)$$ where $h_u$ is the representation of node $u$. $h_g$ is the global (sub-graph/graph) representation, computed as the mean of node representations. $\tilde{g}$ is the corrupted version of the original graph, with the same structure but randomly shuffled features, providing negative examples for contrastive learning. $\mathcal{D}\left(h_u, h_g\right)=\sigma\left(h_u^T\textbf{W}h_g \right)$ is the discriminator that scores whether a node belongs to the graph, and is defined as a bilinear classifier.

The second term is the GraphMAE2 objective [@hou2023graphmae2], which combines the generative loss of GraphMAE [@hou2022graphmae] with the teacher-(noisy)student distillation loss of BGRL [@thakoor2022largescale]. This combination enables the model to avoid overfitting and learn more semantic representations. The GraphMAE2 loss is computed as follows: $$\mathcal{L}_\text{gmae2} = \sum\limits_{v \in \tilde{g}}\left(1-\frac{x_v^T.\tilde{h}_v}{\Vert x_v^T\Vert.\Vert\tilde{h}_v\Vert} \right)^\gamma + 
    \lambda \sum\limits_{v \in g}\left( 1-\frac{h_v^T.\tilde{h}_v}{\Vert h_v^T\Vert.\Vert\tilde{h}_v\Vert}\right)^\gamma$$ where $\tilde{g}$ is the masked graph, $\tilde{h}_v$ is the node representation of a masked node learned by the noisy student, $h_v$ is the corresponding node representation learned by the teacher over the original graph, and $\gamma \geq 1$ is a scaling factor. The teacher's parameters are updated using an Exponential Moving Average (EMA) of the noisy student's parameters.

The third term is the commitment loss, which encourages the representations to get close to their corresponding codebook embeddings within the RVQ layer. This loss is computed as: $$\mathcal{L}_\text{commit} = \frac{1}{|\mathcal{V}|}\sum\limits_{v\in g} ||h_v - \text{sg}\left[c_k \right] ||_2$$ where sg is the stop-gradient operator, and $c_k$ is the representation of the codebook that $h_v$ is assigned to (i.e., the centroid or prototype vector). Note that this loss only affects the node representations and does not update the codebooks.

To initialize and update the codebooks, we employ K-Means clustering and EMA with weight decay $\tau \in [0, 1]$, respectively. Specifically, the codebooks are updated as follows: $$c_k^{t} = \tau c_k^{t-1} + (1-\tau) \frac{1}{|\mathcal{V}_k|}\sum\limits_{v \in \mathcal{V}_k} h_v$$ where $\mathcal{V}_k$ is the set of nodes assigned to codebook $c_k$. This update rule allows the codebooks to adapt to the changing node representations while maintaining stability.

Graph Transformer {#sec:transformer}
=================

Graph Serialization
-------------------

Once the tokenizer is trained, each node $v \in \mathcal{V}$ is mapped to $c$ discrete tokens: $T_{v}=\left[t^v_1, \cdots,t^v_c\right] \in \mathbb{N}^{c}$ (i.e., $T_{v}=\textbf{X}_Q[v]$), encoding local interactions of that node. We then need to serialize the graph in order to input it to the Transformer.

**Semantic Edges**. To enable the Transformer to capture long-range interactions, the input should consist of a sequence of tokens from nodes that are likely to have long-range dependencies. To facilitate this, we first augment the graph with *semantic edges* denoted as $\mathcal{E}_s$, which are computed as follows: $$\mathcal{E}_s = \left\{e_{u,v} \mid \argtopk\limits_{u \in \mathcal{V}} \text{sim} \left(f\left(x_u\right), f\left(x_v\right)\right) \forall v \in \mathcal{V}\right\}$$ where $\text{sim}(\cdot, \cdot)$ denotes the similarity function, $x_u$ is the feature vector of node $u$, and $f$ is a projection function. We use cosine similarity as the similarity function and Principal Component Analysis (PCA) as the projection function. The semantic edge augmentation effectively creates sparse edges between each node and its k-nearest neighbors in the feature space, enhancing the model's ability to recognize and utilize long-range dependencies.

**Structural Serialization**. We combine the semantic edges with the original edges and use Personalized PageRank (PPR) to generate a sequence per target node. This enriches the sequence with information beyond local interactions, allowing the Transformer to access potential long-range dependencies. We construct the sequence $S_v$ for each node $v$ as follows: $$S_v = \left[T_v \Vert T_u\Vert_{u \in \argtopk \text{PPR}\left(v, \mathcal{E} \cup \mathcal{E}_s\right)}\right]$$

where $S_v=\left[t^v_1 \cdots t^v_c \mid t^{u_1}_1\cdots t^{u_1}_c\mid \cdots \mid  t^{u_k}_1 \cdots t^{u_k}_c\right]$ is a sequence of length $c\times (k+1)$, comprising discrete tokens that represent the target node $v$, followed by discrete tokens of the top-$k$ relevant nodes to node $v$. These relevant nodes are determined based on PPR scores. Note that the computation of semantic edges and PPR sequences is performed only once as a pre-processing step, thereby reducing the computational overhead during training.

Token Modulation
----------------

**Token Embeddings**. There are $c \times K$ possible discrete tokens, where $c$ is the number of codebooks and $K$ is the codebook size. We randomly initialize an embedding matrix $\textbf{X}_T \in \mathbb{R}^ {c \times K \times d_x}$, which is trained end-to-end alongside the Transformer. To further enrich the token representations, we introduce an additional token for each node that aggregating the embeddings of its assigned codebooks from the pretrained tokenizer: $$h_c^v = \sum\limits_{i=1}^{c}\textbf{C}[i, t_i^v]$$ where $\textbf{C}[i, j]$ is the embedding corresponding to index j in the ith codebook. We found that adding this explicit aggregated token from the codebook leads to better performance compared to initializing $\textbf{X}_T$ directly with $\textbf{C}$. The input representation of the sequence for node $v$ is then defined as: $$S_v=\left[\textbf{X}_T[i, t^v_i]\operatorname*{\Vert}_{i=1}^c h_c^v \ \Vert \ \textbf{X}_T[i, t^{u_1}_i]\operatorname*{\Vert}_{i=1}^c \  h_c^{u_1}\ \Vert \cdots\Vert \textbf{X}_T[i, t^{u_k}_i]\operatorname*{\Vert}_{i=1}^c \  h_c^{u_k}\right]$$ where $[\textbf{X}||\textbf{Y}]$ denotes concatenation of sequences $\textbf{X}$ and $\textbf{Y}$. This representation combines the individual token embeddings with the aggregated codebook embeddings, providing a more comprehensive and nuanced input to the Transformer.

**Structural Gating**. In order to provide the Transformer with the global structural importance scores of the nodes within the sequence with respect to the target node, we introduce a gating mechanism over the input token embeddings as follows: $$S_v = S_v \odot \text{Softmax}\left(\topk \text{PPR}\left(v, \mathcal{E} \cup \mathcal{E}_s\right)\right)$$ where we first apply a softmax function with temperature $\tau=1$ to normalize the PPR scores, and then multiply each node token's representation by its corresponding normalized score.

**Positional Encoding**. We also introduce two trainable positional encodings to the input tokens. The first positional encoding enables the Transformer to distinguish between tokens from different nodes, while the second encoding, referred to as hierarchical encoding, allows the Transformer to recognize the hierarchy level of each token within the codebooks. We randomly initialize the positional encodings $\textbf{PE} \in \mathbb{R}^{(k+1)\times d_x}$ and $\textbf{HE} \in \mathbb{R}^{c\times d_x}$ and sum them with the encoding of their corresponding token. For example, the final encoding of the token $j$ of the node $i$ within the sequence is computed as: $x=\textbf{X}_T[j, t^{u_i}_j]+\textbf{PE}[i] + \textbf{HE}[j]$. Note that we did not use any structural encoding, such as Laplacian eigenvectors, as we did not observe any significant gains from them.

Transformer Encoder
-------------------

We use $l$ layers of standard Transformer encoder with flash attention [@dao2022flashattention] to generate contextual representations per token in the sequence: $\textbf{H}^{(l)}\in \mathbb{R}^{(c+1)\times(k+1)\times d_h}$. We then aggregate the token representations for $j$-th node in the sequence by summing along the token dimension: $$\textbf{H}_{v_j}=\sum\limits_{i=1}^{c+1}\textbf{H}^{(l)}[i, j]\in \mathbb{R}^{(k+1)\times d_h}$$ To obtain a single representation for the entire sequence, We further aggregate the representation using a linear attention layer: $$h=\sum\limits_{i=1}^{k+1}\alpha_ih_i \quad \text{where} \quad \alpha_i=\frac{\exp(\textbf{W}h_i)}{\sum_j\exp(\textbf{W}h_j)}$$ We feed the resulting representation into a fully-connected classifier and train the model end-to-end using cross-entropy loss. Note that during inference, only the Transformer and classifier are utilized, as the tokenizer is pretrained and the sequences are pre-computed. Furthermore, since we only require discrete tokens and codebook embeddings, our approach enables efficient memory usage, regardless of graph size, allowing for efficient training and inference on large-scale graphs.

Experiments
===========

We evaluate GQT on both medium- and large-scale graph learning tasks, encompassing 22 homophilic, heterophilic, and long-range benchmarks. We follow the established experimental protocols from previous works to ensure fair comparisons. Details of the datasets, experimental setup, and hyperparameters are provided in Appendices `\ref{appendix:datasets}`{=latex} and `\ref{appendix:hyperparameters}`{=latex}, respectively.

Comparison with State-of-the-Art {#sec:main_results}
--------------------------------

**Long-Range Benchmarks**. We use four datasets from the Long-Range Graph Benchmark (LRGB) [@dwivedi2022long], including the Peptides-Func dataset for graph classification with Average Precision (AP) metric, the Peptides-Struct dataset for graph regression with Mean Absolute Error (MAE) metric, the COCO-SP dataset for inductive node classification with macro F1 metric, and the PCQM-Contact for link prediction with Mean Reciprocal Rank (MRR) metric. We compare our results to baselines reported in [@wang2024graph]. The results shown in Table `\ref{tab:LRGB}`{=latex} suggest that GQT is able to capture long-range dependencies and performs well on various graph prediction tasks.

```{=latex}
\resizebox{0.75\textwidth}{!}{
    \setlength{\tabcolsep}{3pt}
    \begin{tabular}{lcccc}\toprule
    \textbf{Task} &\textbf{Graph Classification} &\textbf{Graph Regression} &\textbf{Node Classification} &\textbf{Link Prediction} \\\midrule
    Dataset &Peptides-Func &Peptides-Struct &COCO-SP &PCQM-Contact \\
    \#Graphs & 15,535 & 15,535 & 123,286 & 529,434 \\
    Avg. \#Nodes & 150.94 & 150.94 & 476.88 & 30.14 \\
    Avg. \#Edges & 307.30 & 307.30 & 2,693.67 & 61.09 \\
    Metric &AP ↑ &MAE ↓ &F1 ↑ &MRR ↑ \\\midrule
    GCN &0.5930$\pm$0.0023 &0.3496$\pm$0.0013 &0.0841$\pm$0.0010 &0.3234$\pm$0.0006 \\
    Exphormer &0.6258$\pm$0.0092 &0.2512$\pm$0.0025 &0.3430$\pm$0.0108 &\textbf{0.3587$\pm$0.0025} \\
    GPS &0.6535$\pm$0.0041 &0.2500$\pm$0.0005 &0.3412$\pm$0.0044 &0.3337$\pm$0.0006 \\
    Graph-Mamba &0.6739$\pm$0.0087 &0.2478$\pm$0.0016 &0.3960$\pm$0.0175 &0.3395$\pm$0.0013 \\
    GQT (Ours) &\textbf{0.6903$\pm$0.0085} &\textbf{0.2452$\pm$0.0018} &\textbf{0.4007$\pm$0.0125} &0.3427$\pm$0.0012 \\
    \bottomrule
    \end{tabular}}
```
`\label{tab:LRGB}`{=latex}

```{=latex}
\setlength{\tabcolsep}{3pt}
```
`\label{tab:results_small_homophilic}`{=latex} `\resizebox{\textwidth}{!}{
\begin{tabular}{llcccccccc}
\toprule
{\begin{turn}{90}Dataset\end{turn}}
& & \textbf{CoraFull} &\textbf{CiteSeer}&\textbf{PubMed} &\textbf{Computer} &\textbf{Photo} &\textbf{CS} &\textbf{Physics}  &\textbf{WikiCS}\\
\cmidrule{2-10}
& \#Nodes & 19,793 & 3,327 & 19,717 & 13,752 & 7,650 & 18,333 & 34,493 & 11,701 \\
& \#Edges & 126,842 & 4,522 & 88,651 & 491,722 & 238,163 & 163,788 & 495,924 & 216,123 \\
& \#Features & 8,710 & 3,703 & 500 & 767 & 745 & 6,805 & 8,415 & 300 \\
& \#Classes & 70 & 6 & 3 & 10 & 8 & 15 & 5 & 10 \\
\midrule
{\begin{turn}{90}\textbf{GNN}\end{turn}} 
&GCN & 61.76$\pm$0.14 & 76.50$\pm$1.36 & 86.54$\pm$0.12 &89.65$\pm$0.52 &92.70$\pm$0.20 &92.92$\pm$0.12 &96.18$\pm$0.07 & 77.47$\pm$0.85\\
&GAT & 64.47$\pm$0.18 &76.55$\pm$1.23&86.32$\pm$0.16 &90.78$\pm$0.13 &93.87$\pm$0.11 &93.61$\pm$0.14 &96.17$\pm$0.08  & 76.91$\pm$0.82\\
&APPNP & 65.16$\pm$0.28 &76.53$\pm$1.16&88.43$\pm$0.15 &90.18$\pm$0.17 &94.32$\pm$0.14 &94.49$\pm$0.07 &96.54$\pm$0.07  & 78.87$\pm$0.11\\
&GPRGNN & 67.12$\pm$0.31 &77.13$\pm$1.67&89.34$\pm$0.25 &89.32$\pm$0.29 &94.49$\pm$0.14 &95.13$\pm$0.09 &96.85$\pm$0.08  &78.12$\pm$0.23\\
&GraphSAINT & 67.85$\pm$0.21 &$-$&88.96$\pm$0.16 &90.22$\pm$0.15 &91.72$\pm$0.13 &94.41$\pm$0.09 &96.43$\pm$0.05  &$-$\\
&GraphSAGE & $-$ & 75.58$\pm$1.33 & 87.48$\pm$0.38 &91.20$\pm$0.29 &94.59$\pm$0.14 &93.91$\pm$0.13 &96.49$\pm$0.06 & 74.77$\pm$0.95\\
&PPRGo & 63.54$\pm$0.25 &$-$&87.38$\pm$0.11 &88.69$\pm$0.21 &93.61$\pm$0.12 &92.52$\pm$0.15 &95.51$\pm$0.08  &78.12$\pm$0.23\\
&GRAND+ & 71.37$\pm$0.11 &$-$&88.64$\pm$0.09 &88.74$\pm$0.11 &94.75$\pm$0.12 &93.92$\pm$0.08 &96.47$\pm$0.04  &$-$\\
\midrule
{\begin{turn}{90}\shortstack[c]{\textbf{GT}}\end{turn}}
& GT & 61.05$\pm$0.38 &$-$&88.79$\pm$0.12 &91.18$\pm$0.17 &94.74$\pm$0.13 &94.64$\pm$0.13 &97.05$\pm$0.05  &$-$\\
&Graphormer & OOM &$-$&OOM &OOM &92.74$\pm$0.14 &94.64$\pm$0.13 &OOM  &$-$\\
&SAN & 59.01$\pm$0.34 &$-$&88.22$\pm$0.15 &89.93$\pm$0.16 &94.86$\pm$0.10 &94.51$\pm$0.15 &OOM  &$-$\\
&GraphGPS & 55.76$\pm$0.23 &76.99$\pm$1.12&88.94$\pm$0.16 &OOM &95.06$\pm$0.13 &93.93$\pm$0.15 &OOM  &78.66$\pm$0.49\\
 & GOAT       & $-$ & 76.89$\pm$1.19 & 86.87$\pm$0.24 & 90.96$\pm$0.90 & 92.96$\pm$1.48 & 94.21$\pm$0.38 & 96.24$\pm$0.24 & 77.00$\pm$0.77 \\
 & NodeFormer &$-$ & 76.33$\pm$0.59 & 89.32$\pm$0.25  & 86.98$\pm$0.62 & 93.46$\pm$0.35 & 95.64$\pm$0.22 & 96.45$\pm$0.28 &74.73$\pm$0.94 \\
 & DIFFormer  & $-$ & 76.72$\pm$0.68 & 89.51$\pm$0.67  & 91.99$\pm$0.76& 95.10$\pm$0.47  & 94.78$\pm$0.20 & 96.60$\pm$0.18 & 73.46$\pm$0.56 \\
 & NAGphormer & 71.51$\pm$0.13 &77.42$\pm$1.41&89.70$\pm$0.19 &91.22$\pm$0.14 &95.49$\pm$0.11 &95.75$\pm$0.09 &97.34$\pm$0.03  &77.16$\pm$0.72\\
 & Exphormer & 69.09$\pm$0.72 &76.83$\pm$1.24&89.52$\pm$0.54 &91.59$\pm$0.31 &95.27$\pm$0.42 &95.77$\pm$0.15 &97.16$\pm$0.13  &78.54$\pm$0.49\\
 & VCR-Graphormer & 71.67$\pm$0.10 &$-$&89.77$\pm$0.15 &91.75$\pm$0.15 &95.53$\pm$0.14 &95.37$\pm$0.04 &97.34$\pm$0.04  &$-$\\
\cmidrule{2-10}
&{GQT (ours)} & \textbf{71.81$\pm$0.21} &\textbf{77.84$\pm$0.94} &\textbf{90.14$\pm$0.16} &\textbf{93.37$\pm$0.44} &\textbf{95.73$\pm$0.18} &\textbf{96.11$\pm$0.09} &\textbf{97.53$\pm$0.06}  &\textbf{80.14$\pm$0.57}\\
\bottomrule
\end{tabular}}`{=latex}

**Homophilic Node Classification.** We use eight medium-scale homophilic datasets including: CoraFull [@bojchevski2017deep], CiteSeer, PubMed [@yang2016revisiting], Amazon Computers, Amazon Photos, Co-author CS, Co-author Physics [@shchur2018pitfalls], and WikiCS [@mernyei2020wiki]. We compare our results with eight GNNs including: GCN [@kipf2017semisupervised], GAT, APPNP [@gasteiger2018predict], GPRGNN [@chien2020adaptive], GraphSAINT [@zeng2019graphsaint], GraphSAGE [@hamilton2017inductive], PPRGo [@bojchevski2020scaling], and GTAND+ [@feng2022grand+]. We also compare against ten GTs including GT [@dwivedi2020generalization], Graphormer [@ying2021transformers], SAN [@kreuzer2021rethinking], GraphGPS [@rampavsek2022recipe], GOAT [@kong2023goat], NodeFormer [@wu2022nodeformer], DiffFormer [@wu2023difformer], NAGphormer [@chen2023nagphormer], Exphormer [@pmlr-v202-shirzad23a], and VCR-Graphormer [@fu2024vcrgraphormer]. The baseline performance is reported from [@wu2023simplifying; @luo2024structure]. GQT outperforms the baseline GNN and GT models on 7 out of 8 benchmarks (Table `\ref{tab:results_small_homophilic}`{=latex}). Notably, this achievement comes with a significant memory reduction. For example, on the Physics dataset with 34,493 nodes, we only use $256 \times 6$ tokens, i.e., a 23-fold memory reduction.

**Heterophilic Node Classification.** We also evaluate GQT on six medium-scale heterophilic datasets: Squirrel, Chameleon [@rozemberczki2021multi], Questions, Roman-Empire, Amazon-Ratings, and Minesweeper [@platonov2023critical]. We compare the performance with seven GNNs: GCN, GraphSAGE, GAT, GPRGNN, H2GCN [@zhu2020beyond], CPGNN [@zhu2021graph], and GloGNN [@li2022finding], and six GTs: GraphGPS, GOAT, NodeFormer, SGFormer, NAGphormer, and Exphormer. The baseline performance is reported from [@wu2023simplifying; @luo2024classic; @platonov2023critical; @behrouz2024graph]. As shown in Table `\ref{tab:results_heterophilic}`{=latex}, GQT outperforms the baselines on five out of six datasets. We observe that introducing semantic edges and structural gating specifically benefits the heterophilic setting (Appendix `\ref{appendix:ablation}`{=latex}), as they enable the Transformer to capture long-range dependencies that are not easily accessible through the original graph structure.

**Large-scale Node Classification** We also use four large-scale datasets: ogbn-proteins, ogbn-arxiv, ogbn-products [@hu2020open], and pokec (heterogeneous) [@leskovec2016snap]. We compare the performance against six GNN: LINKX [@lim2021large], SIGN [@frasca2020sign], GCN, GAT, GraphSAGE, and GPRGNN; and six GTs: GraphGPS, GOAT, NodeFormer, NAGphormer, Exphormer, and SGFormer [@wu2023simplifying]. We report the baseline performance from [@wu2023simplifying; @luo2024structure]. The results (Table `\ref{tab:results_large}`{=latex}) show that GQT outperforms the baseline models on all large-scale benchmarks. This achievement comes with a significant reduction in required memory. For instance, on the ogbn-products dataset with 2,449,029 nodes and 100-dimensional node features, GQT requires only 3 codebooks of size 4096, resulting in a 30-fold memory reduction.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\resizebox{0.9\textwidth}{!}{
\begin{tabular}{llcccccc}
\toprule
{\begin{turn}{90}Dataset\end{turn}}
 & & \textbf{Squirrel} & \textbf{Chameleon} &\textbf{Amazon-Ratings} &\textbf{Roman-Empire} &\textbf{Minesweeper} &\textbf{Questions} \\
 \cmidrule{2-8}
 & \#Nodes & 5,201  &2,277  &22,662 & 24,492 & 10,000 & 48,921 \\
 & \#Edges & 216,933  &36,101  &32,927 & 93,050 & 39,402 & 153,540 \\
 & \#Features & 2,089 & 2,325 & 300 & 300 & 7 & 301\\
 & \#Classes & 5 & 5 & 18 & 5 & 2 & 2 \\
 & Measure & Accuracy$\uparrow$ & Accuracy$\uparrow$ & Accuracy$\uparrow$ &Accuracy$\uparrow$ &ROC-AUC$\uparrow$&ROC-AUC$\uparrow$  \\
 \midrule 
{\begin{turn}{90}GNN\end{turn}} 
& GCN & 38.67$\pm$1.84& 41.31$\pm$3.05&48.70$\pm$0.63&73.69$\pm$0.74&  89.75$\pm$0.52& 76.09$\pm$1.27 \\
& GraphSAGE & 36.09$\pm$1.99& 37.77$\pm$4.14&53.63$\pm$0.39&85.74$\pm$0.67 &93.51$\pm$0.57 &76.44$\pm$0.62\\
& GAT & 35.62$\pm$2.06&39.21$\pm$3.08  &52.70$\pm$0.62&88.75$\pm$0.41 &93.91$\pm$0.35 & 76.79$\pm$0.71\\
& H2GCN &  35.10$\pm$1.15& 26.75$\pm$3.64& 36.47$\pm$0.23&60.11$\pm$0.52 &89.71$\pm$0.31 &63.59$\pm$1.46 \\ 
& CPGNN &  30.04$\pm$2.03& 33.00$\pm$3.15& 39.79$\pm$0.77&63.96$\pm$0.62 &52.03$\pm$5.46 &65.96$\pm$1.95 \\ 
& GPRGNN&  38.95$\pm$1.99& 39.93$\pm$3.30& 44.88$\pm$0.34&64.85$\pm$0.27 &86.24$\pm$0.61 &55.48$\pm$0.91 \\ 
% & FSGNN&  35.92$\pm$1.32& 40.61$\pm$2.97  & 52.74$\pm$0.83&79.92$\pm$0.56 &90.08$\pm$0.70 &78.86$\pm$0.92 \\ 
& GloGNN&  35.11$\pm$1.24& 25.90$\pm$3.58& 36.89$\pm$0.14&59.63$\pm$0.69 &51.08$\pm$1.23 &65.74$\pm$1.19  \\ 
 \midrule 
{\begin{turn}{90}GT\end{turn}} 
& GraphGPS& 39.67$\pm$2.84&40.79$\pm$4.03   & 53.10$\pm$0.42&82.00$\pm$0.61 &90.63$\pm$0.67 &71.73$\pm$1.47 \\ 
& NodeFormer&  38.52$\pm$1.57& 34.73$\pm$4.14&  43.86$\pm$0.35&64.49$\pm$0.73& 86.71$\pm$0.88& 74.27$\pm$1.46\\ 
& SGFormer&  41.80$\pm$2.27& \textbf{44.93$\pm$3.91}&  48.01$\pm$0.49&79.10$\pm$0.32& 90.89$\pm$0.58& 72.15$\pm$1.31 \\
& NAGphormer &35.80$\pm$1.33 &$-$ & 51.26$\pm$0.72 &74.34$\pm$0.77 &84.19$\pm$0.66 &$-$ \\
& Exphormer &36.04$\pm$1.45 &$-$ & 53.51$\pm$0.46 & 89.03$\pm$0.37 &90.74$\pm$0.53 &$-$ \\ 
\cmidrule{2-8}
& GQT(ours) &\textbf{42.72$\pm$1.69} &44.23$\pm$3.05  &\textbf{54.32$\pm$0.41}& \textbf{90.98$\pm$0.24} &\textbf{97.36$\pm$0.35} & \textbf{78.94$\pm$0.86} \\
\bottomrule
\end{tabular}}
```
```{=latex}
\setlength{\tabcolsep}{4.5pt}
```
```{=latex}
\resizebox{0.65\textwidth}{!}{
\begin{tabular}{llcccc}
\toprule
{\begin{turn}{90}Dataset\end{turn}}
 & & \textbf{ogbn-proteins}&\textbf{ogbn-arxiv} &\textbf{ogbn-products} &\textbf{pokec}\\
 \cmidrule{2-6}
 & \#Nodes & 132,534  &169,343  &2,449,029 & 1,632,803 \\
 & \#Edges & 39,561,252  &1,166,243  &61,859,140 & 30,622,564 \\
 & \#Features & 128 & 8 & 100 & 65 \\
 & \#Classes & 40 & 2 & 47 & 2 \\
 & Measure &  ROC-AUC$\uparrow$&Accuracy $\uparrow$ & Accuracy $\uparrow$ & Accuracy $\uparrow$ \\ 
\midrule
{\begin{turn}{90}\textbf{GNN}\end{turn}}
 & GCN & 72.51$\pm$0.35 & 71.74$\pm$0.29 & 75.64$\pm$0.21  &75.45$\pm$0.17 \\
 & GAT & 72.02$\pm$0.44 & 71.95$\pm$0.36 & 79.45$\pm$0.59 & 72.23$\pm$0.18 \\
 & GPRGNN & 75.68$\pm$0.49 & 71.10$\pm$0.12 & 79.76$\pm$0.59 & 72.23$\pm$0.18  \\
 & LINKX & 71.37$\pm$0.58 & 66.18$\pm$0.33 & 71.59$\pm$0.71 & 82.04$\pm$0.07\\
 & GraphSAGE & 77.68$\pm$0.20&71.49$\pm$0.27 &78.29$\pm$0.16 & 75.63$\pm$0.38\\
 & SIGN &$-$ &71.95$\pm$0.11 &80.52$\pm$0.16 &$-$ \\
\midrule
{\begin{turn}{90}\textbf{GT}\end{turn}}
  & GraphGPS & 76.83$\pm$0.26 & 70.97$\pm$0.41 & OOM & OOM  \\
  & GOAT   & 74.18$\pm$0.37 & 72.41$\pm$0.40 & 82.00$\pm$0.43 & 66.37$\pm$0.94 \\
  & NodeFormer & 77.45$\pm$1.15 & 59.90$\pm$0.42 & 72.93$\pm$0.13 &71.00$\pm$1.30 \\
  & SGFormer   & 79.53$\pm$0.38 & 72.63$\pm$0.13 & 74.16$\pm$0.31 &73.76$\pm$0.24 \\
  & NAGphormer & 73.61$\pm$0.33  & 70.13$\pm$0.55 & 73.55$\pm$0.21 &76.59$\pm$0.25\\
  & Exphormer  & 74.58$\pm$0.26  & 72.44$\pm$0.28 & OOM & OOM   \\
 \cmidrule{2-6}
  & GQT(ours) &\textbf{82.13$\pm$0.34} &\textbf{73.14$\pm$0.16} &\textbf{82.46$\pm$0.17} &\textbf{83.76$\pm$0.24} \\
\bottomrule
\end{tabular}}
```
Ablation Study
--------------

**Effect of Tokenization**. We examine the performance of the tokenizer by training a linear model on the representations of the learned tokens without modulation, augmentation, or Transformer (1). As shown in Table `\ref{tab:ablation}`{=latex}, within the linear evaluation protocol, the tokenizer shows strong performance, surpassing that of GTs such as GraphGPS and NAGphormer, as well as GNNs like GAT and SIGN (Table `\ref{tab:results_large}`{=latex}). This implies that the tokenizer is capable of learning effective token representations. To further investigate the importance of the tokenizer, we exclude it and train the Transformer directly on the original node features (2). As expected, this results in significant degradation in performance, highlighting the crucial role of the tokenizer. Additionally, to study the effects of vector quantization, GraphMAE2, and DGI objectives, we train the model by excluding each component (3-5). The results suggest that the SSL objectives contribute more significantly to the performance compared to vector quantization. This is because the primary purpose of vector quantization is to compress information into discrete tokens, reducing memory requirements. Between GraphMAE2 and DGI, the former yields the highest gain. This is due to its composition of two objectives: masked reconstruction and teacher-(noisy)student distillation. Both of these objectives have been shown to outperform InfoMax objectives on downstream tasks [@hou2022graphmae; @thakoor2022largescale].

**Effect of Modulation**. We also investigate the impact of codebook embeddings, positional encoding, and structural gating on the model's performance (6-8). As shown in Table `\ref{tab:ablation}`{=latex}, introducing aggregated codebook embeddings leads to improved downstream performance because it provides the Transformer with richer representations of each token. Positional encoding, as observed in other domains, contributes moderately to downstream performance. We also note that introducing structural gating yields moderate improvements in homophilic settings, whereas the gains are significant in heterophilic benchmarks (`\ref{appendix:ablation}`{=latex}). This disparity can be attributed to the ability of structural gating to provide the Transformer with importance scores computed over the global graph structure, which is particularly beneficial in heterophilic scenarios.

**Effect of Augmentation**. We study the effect of semantic edges on downstream performance (9). The results suggest that augmenting the graph structure with semantic edges yields significant gains. This is because introducing semantic edges allows the Transformer to access semantic information that may not be captured by the original graph structure. Furthermore, when combined with random walks, this also enables the Transformer to attend to long-range dependencies which is particularly important in heterophilic benchmarks, where semantic relationships between nodes are more nuanced.

```{=latex}
\setlength{\tabcolsep}{3pt}
```
```{=latex}
\resizebox{0.5\textwidth}{!}{
    \begin{tabular}{lcccc}\toprule
    Attack & GR-BCD & PR-BCD\\\midrule
    & PubMed & ogbn-arxiv & PubMed & ogbn-arxiv \\\midrule
    RQ-VAE &   20.40\% & 14.80\% &23.30\% &17.20\% \\
    GQT (ours) & 15.80\% &10.40\% &18.10\% &11.30\% \\
    \bottomrule
    \end{tabular}}
```
`\label{tab:robust}`{=latex}

**Robustness Analysis**. To measure robustness, we use Greedy Randomized Block Coordinate Descent (GRBCD) and Projected Randomized Block Coordinate Descent (PR-BCD) [@geisler2021robustness] adversarial attacks to measure the accuracy degradation. We compare the GQT with an RQ-VAE [@lee2022autoregressive]. The results (Table `\ref{tab:robust}`{=latex}) show that our tokenizer is more robust to attacks. This is because GQT is trained with multi-task self-supervised objectives while RQ-VAE is trained with reconstruction objective. More details are provided in Appendix `\ref{appendix:additional_results}`{=latex}.

```{=latex}
\setlength{\tabcolsep}{3pt}
```
```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lccc ccc cc c c}\toprule
 & \textbf{Graph Tokenizer} & \textbf{Token Modulation} & \textbf{Augmentation} &\textbf{Model} & \textbf{Performance}\\
 \cmidrule{2-9}\cmidrule{11-11}
 & RVQ  & GMAE2 & DGI & Codebook   & Positional & Structural & Semantic & PPR      & & Accuracy$\uparrow$ \\
 &     &        &     & Embeddings & Encoding   & Gating     & Edges    & Sequence & & \\
\midrule
(1) & \cmark &\cmark &\cmark & \cmark & & & & & Linear & 71.91$\pm$0.13 \\
(2) &  &  & & &\cmark & & &\cmark & Transformer & 70.68$\pm$0.17 \\
\midrule
(3) & &\cmark &\cmark &\cmark & \cmark & \cmark&\cmark & \cmark & Transformer & 72.84$\pm$0.23 \\
(4) & \cmark & &\cmark &\cmark & \cmark & \cmark&\cmark & \cmark & Transformer & 71.83$\pm$0.19 \\
(5) & \cmark & \cmark & &\cmark & \cmark & \cmark &\cmark & \cmark & Transformer & 72.71$\pm$0.24 \\
\midrule
(6) & \cmark &\cmark &\cmark &  &\cmark & \cmark& \cmark &\cmark & Transformer & 71.34$\pm$0.16 \\
(7) & \cmark  &\cmark &\cmark &\cmark &  & \cmark &\cmark & \cmark & Transformer &  72.69$\pm$0.21 \\
(8) & \cmark & \cmark & \cmark &\cmark & \cmark & &\cmark & \cmark & Transformer &  73.08$\pm$0.14 \\
\midrule
(9) & \cmark &\cmark &\cmark &\cmark &\cmark &\cmark & &\cmark & Transformer & 72.59$\pm$0.25 \\
\midrule
(10) & \cmark & \cmark & \cmark &\cmark & \cmark & \cmark &\cmark & \cmark & Transformer & \textbf{73.14$\pm$0.16} \\
\bottomrule
\end{tabular}}
```
```{=latex}
\newpage
```
Conclusion
==========

We introduced the **G**raph **Q**uantized **T**okenizer (GQT) to provide standard Transformer encoders to acces discrete graph tokens that encapsulate local interactions and allow Transformers to attend to long-range dependencies within the graph structure. This allows us to seamlessly take advantage of the rapid advances in scaling Transformers. We achieved state-of-the-art performance on 20 out of 22 datasets, including large-scale and long-range homophilic and heterophilic datasets. As future directions, we plan to explore the potential of GQT in generative graph learning. Additionally, we aim to couple GQT with LLMs to provide a shared feature space across various graph datasets, paving the way for true Graph Foundational Models (GFMs) [@liu2023towards; @mao2024position].

```{=latex}
\clearpage
```
```{=latex}
\newpage
```
```{=latex}
\bibliographystyle{assets/plainnat}
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
```{=latex}
\beginappendix
```
Preliminaries {#preliminaries}
=============

**Graph Attention Networks (GAT)**. The representation of node $i$ in layer $l$ is computed as: $$h_i^l=\sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{ij} \textbf{W} h_j^{(l-1)}\right),  
    \quad
    \alpha_{ij}=\frac{\exp\left(\sigma\left(\textbf{W}_2 \left[\textbf{W}_1h_i^{(l-1)}\Vert\textbf{W}_1h_j^{(l-1)}\right] \right) \right)}
    {\sum\limits_{k \in \mathcal{N}_i} \exp\left(\sigma\left(\textbf{W}_2 \left[\textbf{W}_1h_i^{(l-1)}\Vert\textbf{W}_1h_k^{(l-1)}\right] \right)\right)}$$ where $\sigma$ is a non-linearity, and $\alpha_{ij}$ is the normalized attention score between two connected nodes $i$ and $j$.

**Personalized PageRank (PPR)**. A PPR vector for a node $u$ captures the relative importance of other nodes with respect to node $u$ by exploring the graph structure through iterative random walks: $$\label{eq: ppr}
    r = \alpha \mathbf{P} r + (1-\alpha) q$$ where $\mathbf{P}=\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}} \in \mathbb{R}^{n \times n}$, $q$ is a stochastic personalized vector, $r$ is the stationary distribution of random walks, and $\alpha$ is a damping factor.

Model Details {#sec:appendix_model_details}
=============

```{=latex}
\begin{algorithm}[H]\caption{Graph Tokenizer}
\label{alg:tokenizer}
  \begin{algorithmic}[1]
    \STATE {\bfseries Input:} Graph $g = \left(\mathcal{V}, \mathcal{E}, \textbf{X} \right)$, Graph Encoder $\text{GNN}_\theta$, Residual Quantizer $\text{RQ}_\Phi$, BGRL Loss $\text{RQ}_\Phi$
    \STATE $\mathbf{H_v}=\text{GNN}_\theta(g)$\hfill\COMMENT{Node Representations}
    \STATE $\mathbf{C}, \mathbf{Z}, \mathbf{T}, \mathcal{L}_{commit} = \text{RQ}_\Phi(\mathbf{H_v})$\hfill\COMMENT{codebooks, quantized representation, discrete tokens}
    \STATE $\mathcal{L}_{dgi}=\text{DGI}(\mathbf{Z})$\hfill\COMMENT{Compute DGI Loss}
    \STATE $\mathcal{L}_{bgrl}=\text{BGRL}(\mathbf{Z})$\hfill\COMMENT{Compute BGRL Loss}
    \STATE $\mathcal{L}_{mae}=\text{MAE}(\mathbf{Z})$\hfill\COMMENT{Compute MAE Loss}
    \STATE $\mathcal{L} = \mathcal{L}_{dgi} + \mathcal{L}_{bgrl} + \mathcal{L}_{mae} + \beta \times \mathcal{L}_{commit}$\hfill\COMMENT{Compute Multi-Task Loss}
    \RETURN $\mathbf{Z}$, $\mathbf{T}$, $\mathcal{L}$
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\begin{algorithm}[H]\caption{Residual Vector Quantization}
\label{alg:RVQ}
  \begin{algorithmic}[1]
    \STATE {\bfseries Input:} Data $\mathbf{X}$, Number of codebooks $N$, Size of codebook $K$, Dimension of codebooks $d$
    \STATE $\mathcal{L}_\text{commit} = 0$
    \STATE $\mathbf{C} =$ Random($N$, $K$, $d$)   
    \STATE $\mathbf{Z} =$ Zeros($|\mathbf{X}|$, $d$)
    \STATE $\mathbf{T} =$ Zeros($|\mathbf{X}|$, $N$)
    \FOR{$i$ in $|\mathbf{X}|$}
        \STATE $r = \mathbf{X}[i]$
        \FOR{$j=1$ to $N$}
            \STATE $k=\argmin_k \Vert r - \mathbf{C}[j, k] \Vert_2^2$
            \STATE $r = \Vert r-\mathbf{C}[j, k] \Vert_2^2$
            \STATE $\mathbf{T}[i][j]=k$
            \STATE $\mathbf{Z}[i] = \mathbf{Z}[i] + \mathbf{C}[j][k]$
            \STATE $\mathcal{L}_\text{commit} = \mathcal{L}_\text{commit} + \Vert r- \text{sg}[\mathbf{C}[j, k]] \Vert_2^2$
        \ENDFOR
    \ENDFOR
    \RETURN $\mathbf{Z}$, $\mathbf{T}$, $\mathcal{L}/|\mathbf{X}|$
\end{algorithmic}
\end{algorithm}
```
Datasets {#appendix:datasets}
========

We provide a detailed description of the datasets used in this study. All datasets are publicly available.

-   **CoraFull** [@bojchevski2017deep], **CiteSeer**, and **Pubmed** [@namata2012query] are citation datasets, where nodes represent documents and edges represent citation links. Labels indicate the paper category.

-   **Computer** and **Photo** [@shchur2018pitfalls] are from the Amazon co-purchase graph [@mcauley2015image], where nodes represent goods and edges indicate that two goods are frequently bought together. Node features are bag-of-words encoded product reviews, and class labels are given by the product category.

-   **CS** and **Physics** [@shchur2018pitfalls] are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenges. Here, nodes are authors connected by an edge if they co-authored a paper; node features represent paper keywords for each author's papers, and class labels indicate the most active fields of study for each author.

-   **WikiCS** [@mernyei2020wiki] is derived from Wikipedia, where nodes represent Computer Science articles, and edges are based on hyperlinks. The nodes are classified into 10 classes representing different branches of the field.

-   **Squirrel** and **Chameleon** [@rozemberczki2021multi; @pei2020geom] are Wikipedia page-page networks, where nodes represent articles from the English Wikipedia, and edges reflect mutual links between them. The nodes were classified into five classes based on their average monthly traffic.

-   **Amazon-Ratings** [@platonov2023critical] is based on Amazon product co-purchasing data. Nodes represent products (books, music CDs, DVDs, VHS video tapes), and edges connect products that are frequently bought together. The task is to predict the average rating given to a product by reviewers.

-   **Roman-Empire** [@platonov2023critical] is based on the Roman Empire article from the English Wikipedia. Each node in the graph corresponds to one word (not necessarily unique) in the text, so the number of nodes equals the length of the article. Two words are connected if they follow each other in the text or are linked in the sentence's dependency tree. A node's class represents its syntactic role.

-   **Minesweeper** [@platonov2023critical] is inspired by the Minesweeper game. The graph consists of regular 100x100 grid, where each node (cell) is connected to eight neighboring nodes (except for nodes at the edge of the grid, which have fewer neighbors). 20% of the nodes are randomly selected as mines. The task is to predict which nodes are mines. Node features are one-hot-encoded numbers of neighboring mines, however, for 50% of the nodes, these features are unknown, indicated by a separate binary feature.

-   **Questions** [@platonov2023critical] is based on data from the question-answering website Yandex Q, where nodes represent users, and edges connect two nodes if one user answered another user's question during a one-year time interval. The task is to predict which users remained active on the website, forming a binary classification task.

-   **ogbn-proteins** [@hu2020open] is a protein-protein association network, where nodes represent proteins, and edges indicate biologically meaningful associations between proteins, such as physical interactions, co-expression, or homology. The task is to predict the presence of protein functions in a multi-label binary classification setup.

-   **ogbn-arxiv** [@hu2020open] is a citation network between all Computer Science (CS) arXiv papers indexed by MAG [@wang2020microsoft]. Each node presents an arXiv paper, and directed edges indicate that one paper cites another. The task is to predict the 40 subject areas of arXiv CS papers, such as cs.AI, cs.LG, and cs.OS.

-   **ogbn-products** [@hu2020open] is an Amazon product co-purchasing network[^1] of 2 million products. Edges indicate that products are purchased together. The task is to predict the product category.

-   **pokec** [@leskovec2016snap; @lim2021large] is a social network, where nodes represent users, and edges represent friendships. The task is to predict the gender of users.

-   **Peptides-Func** is a peptide dataset retrieved from SATPdb [@singh2016satpdb] with over 15K peptides. Each node corresponds to a heavy atom, and edges are chemical bonds. The task is to predict 10 peptide functions, forming a multi-label graph classification task.

-   **Peptides-Struct** consists the same graphs as Peptides-Struct, but with different task. Here the task is to predict aggregated 3D properties (i.e., mass, valence) of the peptides at the graph level.

-   **COCO-SP** is a node classification dataset based on the MC COCO image dataset [@lin2014microsoft]. Each node corresponds to a region of the image belonging to a particular class. These superpixels nodes are extracted with the SLIC algorithm [@achanta2012slic], and two nodes are connected with an edge if the node regions share a common boundary. The task is to predict the semantic segmentation label for each superpixel node out of 81 classes.

-   **PCQM-Contact** is a molecule dataset with over 529K molecules [@nakata2017pubchemqc]. Atoms are nodes, and chemical bonds are edges. The task is to predict pairs of nodes that will be contacting with each other in the 3D space.

For CoraFull, Pubmed, PubMed, Computer, Photo, CS, and Physics, we follow previous work and use 60%/20%/20% train/valid/test split. For WiKiCS, we follow the official split in @mernyei2020wiki. For Squirrel, Chameleon, Amazon-Ratings, Roman-Empire, Minesweeper, and Questions, we follow the splits in @platonov2023critical. For ogbn-proteins, ogbn-arxiv, and ogbn-products, we follow the splits in @hu2020open. For pokec, we follow the split used in @lim2021large. For Peptides-Func, Peptides-Struct, COCO-SP, and PCQM-Contact, we follow the split provided in @dwivedi2022long.

Experimental Setup {#appendix:hyperparameters}
==================

**Software & Hardware.** GQT is implemented using PyTorch[^2], PyG[^3], DGL[^4], and the vector-quantize-pytorch package[^5]. Most datasets can be accessed through PyG and DGL. All experiments are conducted on a single Nvidia A100 GPU.

**Hyperparameters & Experimental Details.** As illustrated in Figure `\ref{fig:overview}`{=latex}, our method consists of two parts: the tokenizer and the Transformer encoder. We provide the hyperparameters and experimental details for each part below.

During the training of the graph tokenizer, we use full-graph training for small- and medium-scale datasets, and apply sampling for large-scale graphs. We consider different sampling methods, including random partitioning, which randomly samples nodes within a graph and returns their induced subgraph; neighbor sampling [@hamilton2017inductive], GraphSAINT [@zeng2019graphsaint], and local clustering @hou2023graphmae2. For the GNN encoder and decoder, we use GCN or GAT as our backbone and tune the number of layers from {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and hidden dimensions from {128, 256, 512, 1024}. For the quantizer, we use residual-VQ (RVQ) [@lee2022autoregressive] and tune the number of codebooks from {1, 2, 3, 6, 9} and the codebook size from {128, 256, 512, 1024, 2048, 4096}. We set the code dimension to be equal to the hidden dimension of the GNN encoder.

During the training of the Transformer, we use KNN to add semantic edges and tune the number of semantic neighbors from {0, 5, 10, 15, 20}. Then, we use PPR to generate a sequence of nodes for each target node. We tune the number of PPR neighbors from {0, 5, 10, 20, 30, 50}. For the Transformer model, we use the TransformerEncoder module in PyTorch as our backbone, and tune the number of layers from{1, 2, 3, 4, 5, 6}, the number of heads from {4, 8}, and the feedforward dimension from {512, 1024, 2048}. Note that for some small- and medium-scale datasets, we do not need PPR-based sequences, instead, we can directly serialize all nodes within the graph as in @rampavsek2022recipe.

```{=latex}
\setlength{\tabcolsep}{3pt}
```
```{=latex}
\resizebox{\textwidth}{!}{
    \begin{tabular}{lccccccccc}\toprule
        &\textbf{GNN Encoder} &\textbf{Quantizer} &\textbf{Transformer} \\\midrule
        &\# layers &\# Hidden dim &\# Codebooks &Codebook size &KNN &PPR &\# Layers &\# Heads &\# FFN dim \\\midrule
        CoraFull &2 &256 &3 &128 &0 &15 &2 &4 &512 \\
        CiteSeer &2 &256 &3 &128 &5 &15 &2 &4 &512 \\
        PubMed &2 &256 &3 &256 &0 &15 &2 &4 &512 \\
        Computer &2 &256 &3 &128 &5 &30 &2 &4 &512 \\
        Photo &3 &512 &3 &128 &5 &30 &2 &4 &1024 \\
        CS &2 &512 &3 &128 &5 &20 &2 &4 &1024 \\
        Physics &2 &256 &3 &256 &5 &30 &2 &4 &512 \\
        WikiCS &2 &256 &3 &128 &5 &30 &2 &4 &512 \\
        Squirrel &3 &256 &3 &128 &5 &30 &2 &4 &512 \\
        Chameleon &3 &256 &3 &128 &5 &30 &2 &4 &512 \\
        Amazon-Ratings &4 &512 &3 &128 &5 &20 &2 &4 &1024 \\
        Roman-Empire &6 &256 &3 &256 &10 &15 &3 &4 &512 \\
        Minesweeper &6 &128 &3 &128 &10 &15 &2 &4 &512 \\
        Questions &3 &256 &3 &512 &10 &15 &2 &4 &512 \\
        ogbn-proteins &6 &256 &3 &512 &0 &50 &3 &4 &512 \\
        ogbn-arxiv &4 &512 &3 &512 &5 &30 &2 &4 &1024 \\
        ogbn-products &4 &1024 &3 &4096 &5 &30 &2 &8 &2048 \\
        pokec &6 &256 &3 &512 &0 &50 &3 &4 &512 \\
        Peptides-Func &4 &128 &3 &128 &0 &0 &2 &4 &512 \\
        Peptides-Struct &4 &128 &3 &128 &0 &0 &2 &4 &512 \\
        COCO-SP &4 &128 &3 &128 &0 &0 &2 &4 &512 \\
        PCQM-Contact &4 &128 &3 &128 &0 &0 &2 &4 &512 \\
        \bottomrule
    \end{tabular}}
```
`\label{tab:selected_hyperparameters}`{=latex}

Additional Results {#appendix:additional_results}
==================

Further Ablation Study {#appendix:ablation}
----------------------

We also provide an ablation study on one of the heterophilic datasets. The results shown in Table `\ref{tab:appendix_ablation}`{=latex} suggest that introducing semantic edges and structural gating mechanisms specifically benefits the heterophilic setting.

```{=latex}
\resizebox{\textwidth}{!}{
\begin{tabular}{lccc ccc cc c c}\toprule
 & \textbf{Graph Tokenizer} & \textbf{Token Modulation} & \textbf{Augmentation} &\textbf{Model} & \textbf{Performance}\\
 \cmidrule{2-9}\cmidrule{11-11}
 & RVQ  & GMAE2 & DGI & Codebook   & Positional & Structural & Semantic & PPR      & & ROC-AUC$\uparrow$ \\
 &     &        &     & Embeddings & Encoding   & Gating     & Edges    & Sequence & & \\
\midrule
(1)  & \cmark &\cmark &\cmark & \cmark & & & & & Linear & 90.24$\pm$0.49 \\
(2)  &  &  & & &\cmark & & &\cmark & Transformer & 90.52$\pm$0.39  \\

\midrule

(3)  & &\cmark &\cmark &\cmark & \cmark & \cmark&\cmark & \cmark & Transformer & 95.27$\pm$0.46  \\
(4)  & \cmark & &\cmark &\cmark & \cmark & \cmark&\cmark & \cmark & Transformer & 92.91$\pm$0.55  \\
(5)  & \cmark & \cmark & &\cmark & \cmark & \cmark &\cmark & \cmark & Transformer & 93.82$\pm$0.46  \\
\midrule

(6)  & \cmark &\cmark &\cmark &  &\cmark & \cmark& \cmark &\cmark & Transformer & 93.24$\pm$0.36 \\
(7)  & \cmark  &\cmark &\cmark &\cmark &  & \cmark &\cmark & \cmark & Transformer & 94.82$\pm$0.41 \\
(8)  & \cmark & \cmark & \cmark &\cmark & \cmark & &\cmark & \cmark & Transformer & 93.97$\pm$0.58 \\
\midrule

(9)  & \cmark &\cmark &\cmark &\cmark &\cmark &\cmark & &\cmark & Transformer &92.83$\pm$0.35 \\
\midrule
(10) & \cmark & \cmark & \cmark &\cmark & \cmark & \cmark &\cmark & \cmark & Transformer & 95.28$\pm$0.44\\
\bottomrule

\end{tabular}}
```
```{=latex}
\newpage
```
Generalization Analysis {#appendix:generalization}
-----------------------

```{=latex}
\begin{wraptable}{r}{0.4\textwidth}
    
    \setlength{\tabcolsep}{3pt}
    
    \caption{Comparison between mean GQT and RQ-VAE performance over five runs.}
    \resizebox{0.4\textwidth}{!}{
    \begin{tabular}{lcc}\toprule
    ~ & ogbn-arxiv & Minesweeper \\\midrule
    RQ-VAE &66.05$\pm$0.48 &89.69$\pm$0.35 \\
    GQT (ours) & 73.14$\pm$0.16 & 95.28$\pm$0.44 \\
    \bottomrule
    \end{tabular}
    }
    \label{tab:generalization}
\end{wraptable}
```
To measure improved generalization, we follow the common practice of treating downstream predictive performance as a proxy for generalization. As shown in Table `\ref{tab:ablation}`{=latex} and Table `\ref{tab:appendix_ablation}`{=latex}, every component of the tokenizer, including both SSL objectives and the quantization layer, contributes to the downstream predictive performance, thereby improving the model's generalizability. Furthermore, to evaluate the contribution of multi-task SSL objectives to downstream performance, we compare our results with those of a tokenizer trained using the RQ-VAE [@lee2022autoregressive] design, which employs a reconstruction objective. The results presented below indicate that using multi-task SSL objectives significantly improves downstream predictive performance, which is strongly correlated with the method's generalization.

Efficiency Analysis
-------------------

```{=latex}
\begin{wraptable}{r}{0.6\textwidth}
    
    \setlength{\tabcolsep}{3pt}
    
    \caption{Memory and run time during inference.}
    \resizebox{0.6\textwidth}{!}{
    \begin{tabular}{lcccc}\toprule
    Attack & GPU Memory & Full Inference Time\\\midrule
    & ogbn-arxiv & Minesweeper & ogbn-arxiv & Minesweeper \\\midrule
    GAT &   2715MB & 2108M &5s & 1s \\
    GQT (ours) & 1324MB &1037MB & 4s & 1s \\
    \bottomrule
    \end{tabular}}
    \label{tab:efficiency}
\end{wraptable}
```
As mentioned in Section `\ref{sec:main_results}`{=latex}, using discrete tokens instead of node features results in significant memory reduction. For instance, on the ogbn-products dataset with 2,449,029 nodes and 100-dimensional node features, GQT requires only 3 codebooks of size 4096, resulting in a remarkable 30-fold reduction in memory usage. This memory reduction occurs after training the tokenizer. Since the encoder of the tokenizer is a GNN that processes the graph with original node features, its memory footprint is comparable to that of any arbitrary GNN. However, because the Transformer encoder only consumes discrete tokens, which are significantly fewer than the total number of nodes, we achieve a substantial reduction in memory footprint. As an additional experiment, we compare the inference time and memory usage between our Transformer encoder and a Graph Attention Network (GAT) when performing inference on all graph nodes. The results shown in Table `\ref{tab:efficiency}`{=latex} show that while our Transformer is on par with a sparse implementation of GAT in terms of inference time, it requires half the GPU memory.

[^1]: <http://manikvarma.org/downloads/XC/XMLRepository.html>

[^2]: <https://pytorch.org/>

[^3]: <https://pyg.org/>

[^4]: <https://www.dgl.ai/>

[^5]: <https://github.com/lucidrains/vector-quantize-pytorch>
