LLM-Guided Safe Reinforcement Learning for Energy System Topology Reconfiguration
Source
- Raw Markdown: llm-guided-safe-rl-grid-topology-2026
- Rendered / retrieved PDF: paper_llm-guided-safe-rl-grid-topology-2026.pdf
- External source: https://arxiv.org/abs/2603.14018
Publication And Credibility
- Paper date: arXiv published 2026-03-14.
- Venue/status: arXiv preprint.
- Credibility: Current exploratory preprint. It is worth tracking because it tests LLM-guided safe RL on 36-bus and 118-bus Grid2Op benchmarks, but the LLM component and safety claims need careful reproduction before being treated as SOTA.
Core Claim
The paper combines Safety-SAC with a knowledge-based Safety-LLM module that refines unsafe or suboptimal transitions and inserts safer refinements into the RL replay buffer.
L2RPN / Grid2Op Notes
The reported experiments use IEEE 36-bus and 118-bus Grid2Op benchmarks and compare against SAC, ACE, and safety-enhanced variants on reward, survival time, overloads, voltage violations, and safety costs.
Action-Time-Series / World-Model Notes
This is not a world model. It is an LLM-guided transition-refinement and exploration-shaping layer for safe RL. Its value for the wiki is as a cautionary current branch: language reasoning may help action proposal, but the actual dynamics still come from the simulator/environment.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Causal structure, counterfactuals, and control | adjacent | Uses control actions and safety-cost signals in Grid2Op. | No learned transition model or candidate-action rollout interface. |
| Safety and rare events | partially closes | Reformulates voltage and thermal violations into safety costs. | Need audits of LLM hallucination, prompt stability, and reproducibility. |
| Context interface | adjacent | Natural-language domain knowledge is used as guidance. | Needs structured action/state schemas before TSFM transfer. |
| Benchmark hygiene | warning | LLM-guided transition refinement depends on invocation frequency, reward threshold, LLM choice, training-only compute, simulator validation, and metric weighting. | Safety ranking differs between 36-bus and 118-bus results; prompt/model/version ablations and replay-buffer accounting are needed. |
Limitations / Gotchas
- Do not treat this as a dynamics model. The LLM refines selected unsafe or suboptimal transitions, then the simulator/environment supplies validation.
- The LLM component is training-time exploration and replay-buffer shaping, not a runtime safety certificate.