---
title: "AI-Based Autonomous Line Flow Control via Topology Adjustment for Maximizing Time-Series ATCs"
source_url: "https://arxiv.org/abs/1911.04263"
source: arXiv PDF-only e-print
---

AI-Based Autonomous Line Flow Control via
           Topology Adjustment for Maximizing
                    Time-Series ATCs
                            Tu Lan, Jiajun Duan, Bei Zhang, Di Shi, Zhiwei Wang, Ruisheng Diao, Xiaohu Zhang
                                              GEIRI North America, AI & System Analytics
                                                            San Jose, CA, USA
                                                   tu.lan@geirina.net; di.shi@geirina.net


    Abstract—This paper presents a novel AI-based approach            deregulated power market or utilities with limited choices
for maximizing time-series available transfer capabilities            (e.g., RTE France with nuclear power supplying vast majority
(ATCs) via autonomous topology control considering various            of its demands). This idea was first proposed in the early
practical constraints and uncertainties. Several AI techniques        1980s when several research efforts were conducted for
including supervised learning and deep reinforcement learning
(DRL) are adopted and improved to train effective AI agents for
                                                                      achieving multiple control purposes such as cost minimization,
achieving the desired performance. First, imitation learning (IL)     voltage, and line flow regulation [3]-[4]. Transmission line
is used to provide a good initial policy for the AI agent. Then,      switching or bus splitting/rejoining is essentially a
the agent is trained by DRL algorithms with a novel guided            multivariate discrete programming problem that is difficult to
exploration technique, which significantly improves the training      solve, given the complexity and uncertainties of bulk power
efficiency. Finally, an Early Warning (EW) mechanism is               systems. Various approaches have been reported to tackle this
designed to help the agent find good topology control strategies      problem. In [5], a mixed-integer linear programming (MIP)
for long testing periods, which helps the agent to determine          model is proposed with DC power flow approximation of the
action timing using power system domain knowledge; thus,              power network, where the generalized optimization solver,
effectively increases the system error-tolerance and robustness.
Effectiveness of the proposed approach is demonstrated in the
                                                                      CPLEX, is adopted to solve the MIP. In [6], the transmission
“2019 Learn to Run a Power Network (L2RPN)” Global                    switching (TS) optimization process with DCOPF is
Competition, where the developed AI agents can continuously           decoupled from a master unit commitment procedure, where
and safely control a power grid to maximize ATCs without              the optimal TS schedule is formulated as a MIP problem that
operator’s intervention for up to 1-month’s operation data and        is again solved using CPLEX. Reference [7] presents a fast
eventually won the first place in both development and final          heuristic method to speed up the convergence using the
phases of the competition. The winning agent has been open-           aforementioned modeling and solution practice. Similar
sourced on GitHub.                                                    approaches with variations are also reported in [8] and [9],
    Keywords—Artificial intelligence, autonomous topology             which use a point estimation method for modeling system
control, available transfer capability, imitation learning, deep      uncertainties with AC power flow feasibility checking and
reinforcement learning, dueling DQN.                                  correction modules.
                                                                          However, several limitations are observed in existing
                       I.     INTRODUCTION                            methods, including: (a) Linear approximation in DC power
    Maximizing available transfer capabilities (ATCs) is of           flow without considering all security constraints is typically
critical importance to bulk power systems from both security          utilized, which affects the solution accuracy for a real-world
and economic perspectives, which represents the remaining             power grid. Using full AC power flow with all security
transfer margin of transmission network for further energy            constraints for optimization becomes non-convex due to the
transactions. Due to environmental and economic concerns,             high nonlinear nature of power grids, which cannot be
transmission expansion via building new lines for enlarging           effectively solved using state-of-the-art techniques without
transfer capabilities is no longer an easy option for many            relaxing/sacrificing certain security constraints or solution
utilities across the world. Additionally, the increasing              accuracy. (b) The combination set of lines and bus-bars to be
penetration of renewable energy, demand response, electric            switched simultaneously grows exponentially; in addition,
vehicles, and power-electronics equipment has caused more             sensitivity-based methods are susceptible to changing system
stochastic and dynamic behavior that threatens safe operation         operating conditions. Thus, it may take a long time to solve
of the modern power grid [1]-[2]. Thus, it becomes essential          such an optimization process for a large power grid,
to develop fast and effective control strategies for maximizing       preventing the solution from being deployed in the real-time
ATCs considering uncertainties while satisfying various               environment.
security constraints.                                                     To fill these technology gaps, this research presents a
    Compared with re-dispatching generators, shedding                 novel method that adopts AI-based algorithms (IL and DRL)
electricity demands, and installing FACTS devices, active             with several innovative techniques (including guided
network topology control via transmission line switching or           exploration and early warning) for training effective agents in
bus splitting for increasing ATCs and mitigating congestions          providing fast and autonomous topology control strategies for
provides a low-cost and effective solution, especially for a          maximizing time-series ATCs. The developed techniques
                                                                      were used to participate in the 2019 L2RPN, a global power
This work was supported by SGCC Science and Technology Program.       system AI competition hosted by RTE France and ChaLearn
[10], considering full AC power flow and practical                   5 key elements: a state space 𝒮, an action space 𝒜, a transition
constraints, which eventually outperformed all competitors’          matrix 𝒫, a reward function ℛ, and a discount factor 𝛾. In this
algorithms. The remainder of the paper is organized as               work, an AC power flow simulator is used to represent the
follows: section II presents the problem formulation and             environment [13]. The agent state ( 𝑠𝑡𝑎 ∈ 𝒮 ) is a partial
introduces the principle of reinforcement learning for solving       observation from the environment state (𝑠𝑡𝑒 ∈ 𝒮 ). State 𝑠𝑡𝑎
the Markov Decision Process (MDP). Section III provides the          contains 538 features, including active power outputs and
detailed architecture design, key steps, AI algorithms with          voltage setpoints of generators, loads, line status, line flows,
several innovative techniques, and implementation of the             thermal limits, timestamps, etc. The action space 𝒜 is formed
proposed methodology for autonomous topology control.                by including line switching, node splitting/rejoining, and a
Case studies are presented in section IV to demonstrate the          combination set of both. An immediate reward 𝑟𝑡 at each time
effectiveness of the proposed method. Finally, conclusions are       step is defined in Eq. (2) to assess the remaining available
drawn in section V with future work discussed.                       transfer capabilities:
                                                                            −1                                                𝑖𝑓 𝑔𝑎𝑚𝑒 𝑜𝑣𝑒𝑟
                II.   PROBLEM FORMULATION                            𝑟𝑡 = { 1 ∑𝑁                       𝑙𝑖𝑛𝑒𝑓𝑙𝑜𝑤𝑖                             (2)
                                                                                𝑖=1 𝑚𝑎𝑥(0, 1 − (𝑡ℎ𝑒𝑟𝑚𝑎𝑙𝑙𝑖𝑚𝑖𝑡           )2 )   𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
                                                                            𝑁
A. Objectives, control measures, and practical constraints                                                         𝑖

                                                                     In MDP, a cumulative future return 𝑅𝑡 is defined which
     The problem to solve in this research is discussed in the
2019 L2RPN challenge with full details [10]. The main                contains the immediate reward and the discounted future
objective is to maximize the ATCs of a given power grid over         rewards, defined in Eq. (3) [12]:
all time steps of various scenarios. Each scenario is defined                 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 +. . . +𝛾 𝑇 𝑟𝑡 + 𝑇 = ∑𝑇𝑘=0 𝛾 𝑘 𝑟𝑡+𝑘 (3)
as operating the grid for a consecutive time period, e.g., four      where T is the length of the MDP chain, and 𝛾 ∈ [0, 1] is the
weeks with a fixed time interval of 5 minutes, considering           discount factor.
daily load variations, pre-determined generation schedules
and real-time adjustment, voltage setpoints of generator             C. Solving MDP via reinforcement learning
terminal buses, network maintenance schedules and                        With recent success in various control problems with high
contingencies. The control decisions only include network            nonlinearity and stochastics, reinforcement learning is
topology adjustment, namely, one node splitting/rejoining            adopted which exhibits great potentials in maximizing long-
operation, one line switching, and the combination of these          term rewards for achieving a specific goal [1]-[2]. Various RL
two. System generation and loads are not allowed to be               algorithms exist with pros and cons. One typical example is
controlled for enhancing ATCs. Several hard constraints are          Q-learning, which utilizes a Q-table to map each state and
considered for all the scenarios of interest: (a) system             action pair using an action-value, 𝑄(𝑠, 𝑎), which evaluates
demands should be met at any time without load shedding;             action a taken at state s by considering the future cumulative
(b) no more than one power plant can be tripped; (c) no              return 𝑅𝑡 . According to the Bellman Equation [12], the
electrical islands can be formed as a result of topology             cumulative return can be represented as an expected return,
control; (d) AC power flow should converge at all time. It           shown in Eq. (4):
will cause “game over” if any hard constraint is violated. For             𝑄(𝑠, 𝑎) = 𝔼[𝑅𝑡 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]
soft constraints, violations lead to certain consequences                                                                                    (4)
                                                                                   = 𝔼[𝑟𝑡 + 𝛾𝑄(𝑆𝑡+1 , 𝐴𝑡+1 ) | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]
instead of immediate “game over”. Overloaded lines over              To obtain the optimal action-value 𝑄∗ (𝑠, 𝑎), Q-learning looks
150% of their ratings are tripped immediately, which can be          one step ahead after taking action a at state 𝑠𝑡 , and greedily
recovered after 50 minutes (10 time steps); while for                considers the action 𝑎𝑡+1 at state 𝑠𝑡+1 for maximizing the
overloaded lines below 150% of their ratings, control                expected target value 𝑟𝑡 + 𝛾𝑄 ∗ (𝑠𝑡+1 , 𝑎𝑡+1 ) . Using the
measures can be used to mitigate the overloading issue with          Bellman equation, the algorithm can perform online updates
a time limit of 10 minutes (2 time steps). If still overloaded,      to control the Q-value towards the Q-target.
the line will be tripped, and cannot be recovered until after 50
                                                                         𝑄(𝑠𝑡 , 𝑎𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎𝑡 ) +
minutes. In addition, a practical constraint is considered that
is to allow a “cooldown time” (15 minutes) before a switched                           𝛼[𝑟𝑡 + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎𝑡+1 ) − 𝑄(𝑠𝑡 , 𝑎𝑡 )]           (5)
                                                                                                𝑎𝑡+1 ∈𝒜
line or node can be reused for action. Both soft and hard
                                                                     where 𝛼 represents the learning rate. Using a Q-table, both the
constraints make the problem more practical and close to
                                                                     state and action need to be discrete, thus making it difficult to
real-world grid operation. To examine the performance of
                                                                     handle complex problems. To overcome this issue, the deep Q
agents, metrics in Eq. (1) are used, which measure the time-
                                                                     network (DQN) method was developed which uses neural
series ATCs for a power grid.
                                                                     networks as a function approximator to estimate the Q-values,
                                           𝑙𝑖𝑛𝑒𝑓𝑙𝑜𝑤𝑖
      𝑠𝑡𝑒𝑝_𝑠𝑐𝑜𝑟𝑒 = ∑𝑛_𝑙𝑖𝑛𝑒𝑠
                    𝑖=1     𝑚𝑎𝑥(0, 1 − (                )2 )         𝑄(𝑠, 𝑎), so it can support continuous states in the RL process
                                        𝑡ℎ𝑒𝑟𝑚𝑎𝑙𝑙𝑖𝑚𝑖𝑡𝑖
                       0                       𝑖𝑓 𝑔𝑎𝑚𝑒𝑜𝑣𝑒𝑟           without discretization of states or building the Q-table.
      𝑐ℎ𝑟𝑜𝑛𝑖𝑐_𝑠𝑐𝑜𝑟𝑒 = { 𝑛_𝑠𝑡𝑒𝑝𝑠                                (1)
                       ∑𝑗=1     𝑠𝑡𝑒𝑝_𝑠𝑐𝑜𝑟𝑒𝑗    𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒             Weights 𝜃 of the neural network represent the mapping from
      𝑡𝑜𝑡𝑎𝑙_𝑠𝑐𝑜𝑟𝑒 = ∑𝑛_𝑐ℎ𝑟𝑜𝑛𝑖𝑐𝑠 𝑐ℎ𝑟𝑜𝑛𝑖𝑐_𝑠𝑐𝑜𝑟𝑒𝑘
                                                                     states to Q-values, and therefore, a loss function 𝐿𝑖 (𝜃) is
                     𝑘=1
                                                                     needed to update the weights and their corresponding Q-
The detailed mathematical formulation can be found in [11]           values, using Eq. (6) [14]:
and therefore is not repeated here due to space limitation.                                                                      2
                                                                                   𝐿𝑖 (𝜃𝑖 ) = 𝔼𝑠,𝑎∼𝜌(⋅) [(𝑦𝑖 − 𝑄(𝑠, 𝑎; 𝜃𝑖 )) ]               (6)
B. Problem formulated as MDP
                                                                     where 𝑦𝑖 = 𝔼𝑠′ ∼ℰ [𝑟 + 𝛾max𝑎′ 𝑄(𝑠 ′ , 𝑎′ ; 𝜃𝑖−1 )|𝑠, 𝑎], and 𝜌 is
   Maximizing time-series ATCs via topology control or
                                                                     the probability distribution of the state and action pair (s, a).
adjustment can be modeled as an MDP [12], which consists of
                                                                     By differentiating the loss function using Eq. (7) and
performing stochastic gradient descent, weights of the agent                      and outputs. The dueling structure decouples the single stream
can be updated [14].                                                              into a state value stream and an advantage stream. The dueling
                                                                                  DQN also uses three important techniques in DQN, including:
     𝛻𝜃𝑖 𝐿𝑖 (𝜃𝑖 ) = 𝔼𝑠,𝑎∼𝜌(⋅); 𝑠′∼ℰ [(𝑟 + 𝛾𝑚𝑎𝑥 𝑄(𝑠′ , 𝑎′ ; 𝜃𝑖−1 ) −
                                                𝑎′                          (7)   (1) an experience replay buffer that allows the agent to be
                     𝑄(𝑠, 𝑎; 𝜃𝑖 ))𝛻𝜃𝑖 𝑄(𝑠, 𝑎; 𝜃𝑖 )]                               trained off-policy and decouples the strong correlations
   Given its advantages, DQN is selected as the fundamental                       between the consecutive training data; (2) importance
                                                                                  sampling is used to increase the algorithm learning efficiency
DRL algorithm in this work to train AI agents for providing
                                                                                  and final policy quality [17], by measuring importance of the
topology control actions. However, overestimation is a well-
                                                                                  data using absolute TD-error and giving important data higher
known and long-standing problem for all Q-learning based                          priority to be sampled from memory buffer during the training
algorithms. To address this issue, Double DQN (DDQN) that                         process; and (3) adoption of a DDQN structure, which fixes
decouples the action selection and action evaluation using two                    the q-targets periodically, and then stabilizes the agent
separate neural networks is proposed in [15]. It demonstrates                     updates. The algorithm for training dueling DQN agents is
good performance in overcoming the overestimation problem                         given in Algorithm I.
and can obtain better results on ATARI 2600 games than other
Q-learning based methods. In addition, a new model
architecture, Dueling DQN is proposed in [16], which
decouples a single-stream DDQN into a state-value stream
and an action-advantage stream, and therefore, the Q-value
can be represented as Eq. (8) [16].
  𝑄(𝑠, 𝑎; 𝜃, 𝛼, 𝛽) = 𝑉(𝑠; 𝜃, 𝛽) +
                                                                                  Fig. 2. Architecture of the Dueling DQN.
                                               1                            (8)
                          (𝐴(𝑠, 𝑎; 𝜃, 𝛼) −           ∑𝑎′ 𝐴(𝑠, 𝑎′ ; 𝜃, 𝛼))
                                               |𝒜|

The stand-alone state value stream is updated at each step of
training process. The frequently updated state-values and the
biased advantage values allow better approximation of the Q-
values, which is the key in value-based methods. It allows a
more accurate and stable update for the agent. Thus, dueling
DQN is selected as the baseline model in this work to achieve
good control performance.
              III.    THE PROPOSED METHODOLOGIES
A. Architecture design
    The architecture of training DRL agents for maximizing
ATCs is shown in Fig. 1, where several novel methods are
developed. First, imitation learning is used to generate a good
initial policy for the dueling DQN agent so that exploration
and training time can be greatly reduced; additionally, the
agent is less likely to fall into a local optimum. Second, a
guided exploration method is used to train the agent instead of
the traditional Epsilon-greedy exploration. Third, importance
sampling is used to increase the mini-batch update efficiency                     C. Imitation Learning
[17]. Moreover, an Early Warning (EW) system is designed to                           Imitation learning is essentially a supervised learning
increase the system robustness. Details regarding these                           method that is used to pre-train DRL agents by providing
techniques are discussed in the following subsections.                            good initial policies in the form of neural network weights. A
                                                                                  power grid simulator is used to generate massive data sets,
                                                                                  which are then further processed before being used to train
                                                                                  the DQN agent. This process allows the RL agent to obtain
                                                                                  good Q(s, a) distributions regarding different input states.
                                                                                  The loss function used to train the agent is defined as
                                                                                  weighted Mean-Squared-Error (MSE), in Eq. (9):
                                                                                                           1
                                                                                               𝐽𝜃 = 𝛼 × ∑𝑁   (𝑄(𝑠, 𝑎𝑖 ) − 𝑄̂ (𝑠, 𝑎𝑖 ))2
                                                                                                       𝑁 𝑖=1
                                                                                                          1
                                                                                                    +𝛽×       ∑|𝒜|     (𝑄(𝑠, 𝑎𝑖 ) − 𝑄̂ (𝑠, 𝑎𝑖 ))2
                                                                                                                       𝑖=𝑁+1                        (9)
                                                                                                               |𝒜|−N

Fig. 1. Overview of the system architecture.                                      where 𝛼, 𝛽 ∈ [0, 1] , 𝛼 + 𝛽 = 1 , |𝒜| is the size of action
                                                                                  space, and vector 𝐐(𝑠, 𝑎) = [𝑄(𝑠, 𝑎𝑖 ), 𝑖 = 1, … , |𝒜|] is
B. Dueling DQN Agent
                                                                                  sorted in descending order. The loss function 𝐽𝜃 gives a
    The architecture of dueling DQN is given in Fig. 2. The                       higher weight to actions resulting in high scores, which
original structure is adopted with a batch normalization layer                    makes the agent more sensitive to score peaks during the
added to the input layer, and the number of neurons in the                        training process, and therefore helps the agent better extract
hidden layer is modified according to the dimensions of inputs
                                                                                  good actions.
D. Guided exploration training method                                      solutions. The framework is developed in Linux, with an
    Imitation learning provides a good initial policy for                  interface designed and provided for Reinforcement Learning.
snapshots, and then DRL is used to train the agent for long-               The RL agents are trained and tuned using python scripts
term planning capability and to obtain a globally-concerned                through massive interactions with Pypowernet. Besides, a
policy. For DRL training in this problem, the traditional                  visualization module is provided for the users to visualize the
Epsilon-greedy exploration method is inefficient. First, the               system operating status and evaluate control actions in real-
action space is pretty large and the MDP chain is long. Second,            time. Several power system models have been provided in this
the agent is easy to fall into a local optimum. Thus, a guided             framework with datasets representing realistic time-series
exploration method is developed, where actions with the                    operating conditions. The dataset for the IEEE 14-bus model
𝑁𝑔 highest Q-values are selected at every timestep, the                    contains 1,000 scenarios with data for 28 continuous days.
performance of which are simulated and evaluated on the fly.               Each scenario has 8,065 time steps, each representing a 5-
Then, the action with the highest reward is chosen for                     minute interval. All models and associated datasets can be
implementation and such experience will be stored in the                   directly downloaded from [10].
memory. The guided exploration helps the agent to further                      With the developed environment and framework, the
extract out the good actions. With the help of the action                  IEEE 14-bus system with the supporting dataset is used to test
simulation function, the training process is more stable, and              performance of the proposed DRL agents in autonomous
better experience is stored and used to update the agent. Thus,            network topology control over long time-series scenarios. In
guided-exploration significantly increases the training                    this system, there are a total of 156 different node splitting
efficiency.                                                                actions and 20 line switching actions. Thus, an action space of
                                                                           3,120 is formed by considering null action and all
E. Early warning                                                           combinations of one node splitting and one line switching
    Power systems are highly sensitive to various operating                without those that can create islands. The DRL agents are
conditions, especially with major topology changes. One bad                trained using Python 3.6 scripts on a Linux server with 48
action may have a long-term adverse effect since the system                CPU cores and 128 GB of memory.
topology control is successive in a long period of time. The               B. Effectiveness of imitation learning for generating good
trained DRL agent is not guaranteed to provide a good action                   initial policies
every time at various complex system states. Thus, an adaptive
mechanism, named Early Warning, is developed in this work                      In the first test, a brute-force method is used to train the
which can help the agent determine when to apply action and                agent using randomly initialized neural network weights and
simulate more actions with high 𝑄(𝑠, 𝑎) values to increase the             the full action space with a dimension of 3,120. As expected,
error-tolerance and enhance system robustness, with Fig. 3                 due to the large action space and the long time-sequences, the
illustrating its operation logic.                                          proposed dueling DQN method didn’t work well. To solve
                                                                           this problem, the following process is employed to effectively
                                                                           reduce the action space, which includes: (1) 155 node
                                                                           splitting/rejoining actions, (2) 19 line switching actions, and
                                                                           (3) 76 most effective actions with one bus action and one line
                                                                           switching action, and one do-nothing action. In this way, the
                                                                           action space 𝒜 is reduced to 251. Then, the imitation
Fig. 3. Early Warning (EW) system workflow.                                learning method introduced in Section III. C is used to obtain
                                                                           good initial policies. Forty scenarios, each with 1,000
    Initially, at every timestep, it simulates the result of taking
                                                                           timesteps (instead of 8,065), are used for imitation learning,
no action to the environment, using a warning flag (WF)
                                                                           yielding a total number of 40,000 sample pairs, (state, Q(s,
defined in Eq. (10).
                             𝑙𝑖𝑛𝑒𝑓𝑙𝑜𝑤𝑖
                                                                           a)), which are then separated into a training set (90%) and a
           𝑇𝑟𝑢𝑒       if                   > 𝜆, ∀𝑖 ∈ {1,2, … ,20}          validation set (10%). Fig. 4 shows a sample prediction and
  𝑊𝐹 = {                   𝑡ℎ𝑒𝑟𝑚𝑎𝑙𝑙𝑖𝑚𝑖𝑡𝑖                            (10)
            𝐹𝑎𝑙𝑠𝑒                                     𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒            label using IL. After training 100 epochs with a batch size of
If the loading level of a line is higher than a pre-determined             1, the weighted MSE decreased to around 0.05, indicating
threshold 𝜆 , a WF is raised. As a result, the 𝑁𝑔 top-scored               neural networks can generally catch the peaks and trends, and
actions are provided by the agent for further simulation.                  provide relatively effective actions.
Consequently, the best action with the highest score will be
taken. Both guided exploration and the early warning
mechanism improve the performance and robustness of the
proposed RL algorithm.
             IV.    CASE STUDIES AND DISCUSSION
A. Environment and framework
                                                                           Fig. 4. Sample DQN prediction after imitation learning (loss function:
    A power grid simulator, Python Power Network                           weighted MSE, optimizer: Adam, learning rate: 1e-3).
(Pypownet) [13], is adopted to represent the environment for
training RL agents, which is built upon the MATPOWER                       C. Improved training performance with guided exploration
open-source tool for power grid simulations. It is able to                     To shorten the MDP chain and decrease the training
emulate a large-scale power grid with various operating                    difficulty, the 28-day scenarios are divided into single days,
conditions that supports both AC and DC power flow                         each with 288 timesteps. For comparison, the training process
of dueling DQN agents with Epsilon-greedy exploration and                     Future work will focus on further improving the
the proposed guided exploration are depicted in Fig. 5(a) and             performance of RL agents, which will be tested on larger
Fig. 5(b), respectively.                                                  power system models. The developed methodologies will
    With Epsilon-greedy exploration, the agent can hardly                 also be merged into an AI-based platform developed by the
control the entire 288 timesteps continuously before Episode              team, Grid Mind [1]-[2], for autonomous grid operation and
7,000, without game over, although the agent’s performance                control.
keeps improving towards higher reward values (defined in                    TABLE I. PERFORMANCE COMPARISON OF DIFFERENT AGENTS ON 200
Eq. (2)). The proposed training process using guided                                    UNSEEN SCENARIOS WITH 288 TIME STEPS
exploration with 𝑁𝑔 =10 is shown in Fig. 5(b). The agent can
control more steps successfully in the earlier phases of the
training process compared to Epsilon-greedy exploration.
More importantly, it takes a much shorter time to train an
agent with a better policy.


                                                                                                       REFERENCES
                                                                          [1]  J. Duan, D. Shi, R. Diao, et al., “Deep-Reinforcement-Learning-Based
Fig. 5(a). Dueling DQN agent training process using epsilon-greedy             Autonomous Voltage Control for Power Grid Operations,” IEEE trans.
exploration.                                                                   Power Syst., Early Access, 2019.
                                                                          [2] R. Diao, Z. Wang, D. Shi, et al., “Autonomous Voltage Control for
                                                                               Grid Operation Using Deep Reinforcement Learning,” IEEE PES
                                                                               General Meeting, Atlanta, GA, USA, 2019.
                                                                          [3] H. Glavitsch, “Switching as means of control in the power system,”
                                                                               International Journal of Electrical Power & Energy Systems, vol. 7,
                                                                               no. 2, pp. 92-100, 1985.
                                                                          [4] A. A. Mazi, B. F. Wollenberg, M. H. Hesse, “Corrective control of
                                                                               power system flows by line and bus-bar switching,” IEEE trans. Power
                                                                               Syst., vol. 1, no. 3, pp. 258-264, 1986.
Fig. 5(b). Dueling DQN agent training process using guided exploration.
                                                                          [5] E. B. Fisher, R. P. O'Neill, M. C. Ferris, “Optimal transmission
D. Testing and performance comparison of different agents                      switching,” IEEE trans. Power Syst., vol. 23, no. 3, pp. 1346-1355,
                                                                               2008.
    With the proposed methodology, several case studies are               [6] A. Khodaei, and M. Shahidehpour, “Transmission switching in
conducted with their performance compared in Table I. It is                    security-constrained unit commitment,” IEEE trans. Power Syst., vol.
observed that the agent trained only with IL failed for most                   25, no. 4, pp. 1937-1945, 2010.
scenarios. With guided exploration, the agent’s performance               [7] J. D. Fuller, R. Ramasra, and A. Cha, “Fast heuristics for transmission-
                                                                               line switching,” IEEE Trans. Power Syst., vol. 27, no. 3, pp. 1377-
is greatly improved, where only 7 out of 200 scenarios failed.                 1386, 2012.
Using EW (with threshold 𝜆 ranging from 0.85 to 0.975), the               [8] P. Dehghanian, Y. Wang, G. Gurrala, et al., “Flexible implementation
agent can almost handle all the scenarios well with very few                   of power system corrective topology control,” Electric Power Syst.
cases failed; and the scores are much improved. Similarly,                     Research, vol. 128, pp. 79-89, 2015.
200 long scenarios with 5,184 time steps are tested using                 [9] M. Alhazmi, P. Dehghanian, S. Wang, et al., “Power grid optimal
DRL agents, where the best score achieved is 82,687.17,                        topology control considering correlations of system uncertainties,”
                                                                               IEEE Tran. Ind Appl., Early Access, 2019.
using an EW threshold of 0.93. Only 12 scenarios out of 200
                                                                          [10] RTE France, ChaLearn, L2RPN Challenge. [Online]. Available:
experienced bad control performance. Finally, a well-trained                   https://l2rpn.chalearn.org/
agent was submitted to the L2RPN competition with EW                      [11] D. Shi, T. Lan, J. Duan, et al., “Learning to Run a Power Network
𝜆 =0.885, which was automatically tested using 10 unseen                       through AI,” slides presentated at the 2019 PSERC Summer Workshop.
scenarios by the host of the competition, outperformed the                     [Online]. Available: https://geirina.net/assets/pdf/2019-PSERC_L2RP
other participants, and eventually won the competition. The                    N%20Presentation.pdf
average decision time for each time step using the proposed               [12] R. S. Sutton, A. G. Barto, Introduction to reinforcement learning. MIT
                                                                               press Cambridge, vol. 2, no. 4, 1998.
agent is roughly 50 ms. The corresponding code and DRL
                                                                          [13] M. Lerousseau, A power network simulator with a Reinforcement
models are open-sourced, which can be found in [18].                           Learning-focused usage. [Online]. Available: https://github.com/Marvi
                                                                               nLer/pypownet
                          V.     CONCLUSION                               [14] V. Mnih, K. Kavukcuoglu, D. Silver, et al., “Playing atari with deep
    This paper presents a novel AI-based method to maximize                    reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
time-series ATCs considering various practical constraints.               [15] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
                                                                               with double q-learning,” in 30th AAAI Conference on Artificial
Several innovative techniques are developed including                          Intelligence, 2016.
dueling DQN, imitation learning for generating good initial               [16] Z. Wang, T. Schaul, M. Hessel, et al., “Dueling network architectures
policies, reduction of action space via simulation and domain                  for deep reinforcement learning,” arXiv preprint arXiv:1511.06581,
knowledge, guided exploration and EW for improving DRL                         2015.
agent’s stability and robustness. Massive experiments                     [17] T. Schaul, J. Quan, I. Antonoglou, et al., “Prioritized experience
demonstrate a well-trained AI agent can learn and master the                   replay,” arXiv preprint arXiv:1511.05952, 2015.
optimal topology control problem for a power grid                         [18] GEIRINA, CodaLab L2RPN: Learning to Run a Power Network.
                                                                               [Online]. Available: https://github.com/shidi1985/L2RPN.
considering various uncertainties and practical constraints.