LLM-Guided Safe Reinforcement Learning for Energy System Topology Reconfiguration

Source

Publication And Credibility

  • Paper date: arXiv published 2026-03-14.
  • Venue/status: arXiv preprint.
  • Credibility: Current exploratory preprint. It is worth tracking because it tests LLM-guided safe RL on 36-bus and 118-bus Grid2Op benchmarks, but the LLM component and safety claims need careful reproduction before being treated as SOTA.

Core Claim

The paper combines Safety-SAC with a knowledge-based Safety-LLM module that refines unsafe or suboptimal transitions and inserts safer refinements into the RL replay buffer.

L2RPN / Grid2Op Notes

The reported experiments use IEEE 36-bus and 118-bus Grid2Op benchmarks and compare against SAC, ACE, and safety-enhanced variants on reward, survival time, overloads, voltage violations, and safety costs.

Action-Time-Series / World-Model Notes

This is not a world model. It is an LLM-guided transition-refinement and exploration-shaping layer for safe RL. Its value for the wiki is as a cautionary current branch: language reasoning may help action proposal, but the actual dynamics still come from the simulator/environment.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Causal structure, counterfactuals, and controladjacentUses control actions and safety-cost signals in Grid2Op.No learned transition model or candidate-action rollout interface.
Safety and rare eventspartially closesReformulates voltage and thermal violations into safety costs.Need audits of LLM hallucination, prompt stability, and reproducibility.
Context interfaceadjacentNatural-language domain knowledge is used as guidance.Needs structured action/state schemas before TSFM transfer.
Benchmark hygienewarningLLM-guided transition refinement depends on invocation frequency, reward threshold, LLM choice, training-only compute, simulator validation, and metric weighting.Safety ranking differs between 36-bus and 118-bus results; prompt/model/version ablations and replay-buffer accounting are needed.

Limitations / Gotchas

  • Do not treat this as a dynamics model. The LLM refines selected unsafe or suboptimal transitions, then the simulator/environment supplies validation.
  • The LLM component is training-time exploration and replay-buffer shaping, not a runtime safety certificate.