Evolution Strategies at the Hyperscale

Source

Authors

Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, Uljad Berdica, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, and Jakob Nicolaus Foerster.

Core Claim

This paper introduces EGGROLL, a low-rank perturbation method that makes evolution strategies hardware-efficient enough for billion-parameter models and very large populations.

Key Contributions

  • Replaces full-rank Gaussian perturbation matrices with low-rank factors, reducing auxiliary perturbation storage from O(mn) to O(r(m+n)) per matrix layer.
  • Uses counter-based deterministic RNG and batched low-rank adapter-style inference to avoid materializing perturbations.
  • Reports up to a hundredfold speedup for billion-parameter models at large population sizes, reaching up to 91% of pure batch-inference throughput and near-linear cluster scaling.
  • Shows EGGROLL can pretrain nonlinear recurrent language models in integer datatypes, compete with GRPO for reasoning post-training, and preserve ES behavior in tabula rasa RL.

Method Notes

EGGROLL samples factors A and B, forms a rank-r perturbation AB^T / sqrt(r), and weights the perturbation by scalar fitness. Individual perturbations are low-rank, but the aggregate update can be full-rank when population size times rank exceeds the matrix dimension.

Evidence And Results

The paper’s most distinctive evidence is systems-level: EGGROLL improves arithmetic intensity by sharing the base matrix multiply and applying perturbations in a LoRA-like batched form. It also reports int8 recurrent LM pretraining from scratch and reasoning experiments on Countdown and GSM8K with RWKV-family models. The theory section argues that low-rank EGGROLL updates converge toward full-rank Gaussian ES updates at an O(1/r) approximation rate.

Alex Context

Alex marked this source as normal and none. His note highlights this as the next step after ES-at-scale work: low-rank factorization turns the simple ES idea into a more practical hyperscale implementation for non-differentiable systems and fully integer language models.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Control-oriented fine-tuningadjacentThe raw appendix fine-tunes an S5 order-flow foundation model with EGGROLL for limit-order-book execution, optimizing realized PnL in a simulator.Evidence is one financial execution setting; no broad TSFM control benchmark or public reusable protocol.
Dynamic compute and non-differentiable objectivesadjacentEGGROLL evaluates large low-rank perturbation populations with inference-like throughput and can optimize reward signals that need not be differentiable.This is a training optimizer, not a model architecture or context interface for time-series foundation models.
Numeric event streamswarningThe HFT section treats limit-order messages as categorical-plus-numeric sequential data where magnitude carries meaning.Tokenization and action semantics are domain-specific and not validated for general multivariate time series.

Open Questions

  • Does rank-1 or very low-rank perturbation remain enough for dense Transformer LLMs beyond recurrent architectures and selected reasoning tasks?
  • How should EGGROLL be compared against LoRA-style gradient updates when both use low-rank structure but optimize through different signals?
  • Can integer-only or non-differentiable language-model components become a practical advantage rather than a niche demonstration?