---
title: "Active Power Correction Strategies Based on Deep Reinforcement Learning--Part II: A Distributed Solution for Adaptability"
source_url: "https://doi.org/10.17775/CSEEJPES.2020.07070"
source: "CSEE Journal PDF"
---

# Active Power Correction Strategies Based on Deep Reinforcement Learning--Part II: A Distributed Solution for Adaptability

1134                                                                         CSEE JOURNAL OF POWER AND ENERGY SYSTEMS, VOL. 8, NO. 4, JULY 2022


 Active Power Correction Strategies Based on Deep
  Reinforcement Learning—Part II: A Distributed
             Solution for Adaptability
 Siyuan Chen, Student Member, IEEE, Jiajun Duan, Member, IEEE, Yuyang Bai, Jun Zhang , Senior Member,
IEEE, Di Shi, Senior Member, IEEE, Zhiwei Wang, Senior Member, IEEE, Xuzhu Dong, Senior Member, IEEE,
                                and Yuanzhang Sun, Senior Member, IEEE


   Abstract—This article is the second part of Active Power                     L(·)         Loss function of Q network.
Correction Strategies Based on Deep Reinforcement Learning. In
Part II, we consider the renewable energy scenarios plugged into              B. Variables
the large-scale power grid and provide an adaptive algorithmic
implementation to maintain power grid stability. Based on the                   Te           Duration of the blackout.
robustness method in Part I, a distributed deep reinforcement                   rl           Resistance of line l.
learning method is proposed to overcome the influence of the                    yl           Active power of line l.
increasing renewable energy penetration. A multi-agent system                   PGi,t        Active power output of generator i at time slot t.
is implemented in multiple control areas of the power system,
which conducts a fully cooperative stochastic game. Based on the
                                                                                Pj,t         Energy loss in the presence of blackout.
Monte Carlo tree search mentioned in Part I, we select practical                Hl           The binary variable of BBS action.
actions in each sub-control area to search the Nash equilibrium                 πi,k         Strategy of agent i.
of the game. Based on the QMIX method, a structure of offline                   θ, θ̂        Parameters of Q network.
centralized training and online distributed execution is proposed               τi           Action-observation history of agent i.
to employ better practical actions in the active power correction
control. Our proposed method is evaluated in the modified global                ak           Joint action of multi-agents.
competition scenario cases of “2020 Learning to Run a Power                     πk           Joint strategy of multi-agents.
Network - Neurips Track 2”.                                                     πk∗          Optimal joint strategy of multi-agents.
                                                                                τ            Joint action-observation history of multi-agents.
  Index Terms—Active power correction strategies, distributed                   θ            Network parameters of multi-agents.
deep reinforcement learning, Nash equilibrium, renewable
energies, stochastic game.                                                    C. Sets
                                                                                Nl           Set of lines.
                          N OMENCLATURE                                         NGen         Set of generators.
A. Functions                                                                    NLoad        Set of loads.
                                                                                Aag          Set of multi-agents.
  Closs (·)    Function of power loss cost.                                     S            State space of multi-agents.
  Credis (·)   Function of generation redispatch cost.                          Oi           Partial observation space of agent i.
  Cbkt (·)     Function of blackout cost.                                       Ai           Actions space of agent i.
  R(·)         Function of cumulative reward.                                   T            Operation duration of power system.
  ri,t (·)     Immediate reward function of agent i at time t.
  gt (·)       Performance function of agent i at time t.                     D. Constants
  Vi (·)       Function of state value.
                                                                                Nlmax        Maximum number of allowable lines switching.
  p(·)         Function of state transition probability.
                                                                                ρl           Usage of lines l capacity.
  Qi (·)       Q function of agent i.
                                                                                φ            Penalty coefficient of heavy load.
  Qtot (·)     Joint Q function of multi-agents.
                                                                                ψ            Penalty coefficient of overload.
   Manuscript received December 30, 2020; revised Feberuary 20, 2021;           κ            Penalty term of blackout in reward function.
accepted April 14, 2021. Date of online publication September 10, 2021;         ζ            Coefficient of soft update.
date of current version November 11, 2021.This work was supported by the
National Key R&D Program of China under Grant 2018AAA0101502.                   ε            Greedy parameter.
   S. Y. Chen, Y. Y. Bai, J. Zhang (corresponding author, e-mail: jun.zhan
g.ee@whu.edu.cn; ORCID: 0000-0001-6908-2671), X. Z. Dong, and Y. Z.
Sun are with the School of Electrical Engineering and Automation, Wuhan                              I. I NTRODUCTION
University, Wuhan 430072, China.
   J. J. Duan, D. Shi, and Z. W. Wang, are with GEIRI North America, San
Jose, CA 95134, USA.
   DOI: 10.17775/CSEEJPES.2020.07070
                                                                              L    ARGE-SCALE plug-in distributed renewable energy re-
                                                                                   sources (RESs) are gradually becoming a significant
                                                                              feature of a power grid. The randomness and volatility of
                                                               2096-0042 © 2020 CSEE
CHEN et al.: ACTIVE POWER CORRECTION STRATEGIES BASED ON DEEP REINFORCEMENT LEARNING—PART II: A DISTRIBUTED SOLUTION FOR ADAPTABILITY   1135


generation from RESs may overload the capacity of trans-                learning framework based on semantic information developed
mission lines when the power system has on-peak demand                  by Didi Chuxing, is presented, which can support both DQN
or scheduled maintenance. Considering that the uncertainty              and advantage actor-critic (A2C) algorithms to solve the
of the power system increases with the penetration rate of              problem of large-scale vehicle scheduling and management.
RESs, active power correction control (APCC) is a primary               In [14], by combining independent Q-learning (IQL) with
method to maintain the stability of the power system’s active           DQN, a DDRL framework is proposed in which each agent has
power. For APCC strategies, a large amount of research                  an independent Q network. However, since the IQL algorithm
focuses on generation redispatch, load shedding, or demand              does not conduct interactions among multiple agents, the
response. Several studies have exploited a less costly method           environment for each agent is unknown and non-static. This vi-
with great potential, namely grid topology reconfiguration.             olates the principle of the Markov decision process (MDP) and
In [1], simulations were conducted to analyze more than                 cannot be proven to achieve the convergence of the algorithm.
1.5 million kinds of faults based on the actual cases of the            In [15], considering the dynamic instability of the environment
American power grid, and results demonstrated that topology             caused by the IQL algorithm, a value-decomposition network
change, e.g., bus bar switching, transmission switching, etc.,          (VDN) algorithm is proposed to obtain the joint action-value
can effectively eliminate or alleviate the transmission line            function by summing the Q-value of each agent. Numerical
overload in most fault scenarios.                                       results demonstrate that the VDN algorithm can improve the
   The expanding scale of the power grid and the widespread             convergence of the DDRL. In [16], based on VDN, a QMIX
application of power electronic devices have continuously               algorithm is proposed by constructing a mixing network to
increased the degree of dimensionality and non-linearity of             integrate local value functions. Global information is added to
the power grids. Power grids are increasingly difficult to              improve the network’s ability to fit Q values in the training
model accurately using traditional mathematical and physical            process. However, DDRL has not been studied and applied in
mechanisms; thus, they provide an opportunity for applica-              the field of power systems.
tions involving artificial intelligence technology in resolving            In this paper, we propose a DDRL framework for joint
dispatching and control problems of the power system. Deep              control strategies of the APCC problem in large-scale power
reinforcement learning (DRL) combines reinforcement learn-              systems. Specifically, an APCC model with topology recon-
ing (RL) and deep learning technologies, which learn directly           figuration actions is established to formulate the fundamental
from the agents’ interactions with an environment. In [2], a Q-         problem. The fully cooperative stochastic game is then utilized
learning-based method is proposed for optimal reactive voltage          to model the interactions between active power controllers
control, which solves the convergence problem of traditional            (APC). A model-free model is adopted to search for the
reactive power optimization for non-linear integer program-             Nash equilibrium (NE) for the game. Considering that the
ming models. In [3] and [4], a DRL method is applied to                 control issue of the APCC problem is discrete, a QMIX
the scenarios of voltage instability, in which low-voltage load         method is adopted, which is modified from [16]. Based on the
shedding strategies and shunt capacitor switching schemes               characteristic of the QMIX method, a structure that comprises
are constructed. The effectiveness of the proposed method               centralized online training and offline distributed execution is
is verified through several unknown fault scenarios. In [5],            proposed to satisfy the practical application requirements of
a voltage control strategy based on Deep Q-Network (DQN)                large-scale power systems. Our method is verified in an open-
and Deep Deterministic Policy Gradient (DDPG) algorithms                source platform with relevant scenarios and cases.
is applied for automatic voltage control (AVC). It ensures the             The rest of this paper is organized as follows: The for-
voltage of each bus stays within the standard range, which is           mulation of the APCC problem and the stochastic game is
based on the information collected by the SCADA or PMUs.                described in Section II. The method based on DDRL used to
At present, DRL has gained extensive attention and remarkable           search for the NE is illustrated in Section III. Case studies in
achievements in the fields of games [6], medical treatment [7],         Section IV verify the performance of the proposed method.
autonomous driving [8], unmanned aerial vehicles [9], smart             The conclusions and suggestions for future work are given in
grid [10], [11], and other fields. However, it is difficult for         Section V.
DRL to perform well in high-dimensional action space, i.e.,
those which generally involve more than 104 kinds of actions.                             II. P ROBLEM F ORMULATION
“The curse of dimensionality, ” caused by the complexity of
a power system, has become one of the biggest obstacles to              A. APCC Model
practical applications of DRL in this field.                               Active power correction control (APCC) of power systems
   As an extension technology of DRL, distributed deep rein-            is a non-linear, mixed, integer programming problem usually
forcement learning (DDRL) has been applied in multi-agent               solved by sensitivity and optimization programming methods.
systems (MAS), and this has effectively solved the problem              Generally, the APCC is achieved by generation redispatch and
of high-dimensional action space faced by DRL [12]. DDRL                load shedding with limited effects on power flow control. Bus
deploys multiple agents on buses of the large-scale power grid          bar switching (BBS), which can switch the elements from
and divides the power grid into multiple sub-control areas. Due         a bus bar to another, is an effective means of correcting
to the interaction and collaboration between multiple agents,           power flow quickly and effectively [17]. Considering that
the dimension of each agent’s action space can be reduced               the influence of RESs requires rapid response to prevent
to an acceptable level. In [13], a multi-agent reinforcement            branch power flow from going off-limit, this paper adopts the
1136                                                                               CSEE JOURNAL OF POWER AND ENERGY SYSTEMS, VOL. 8, NO. 4, JULY 2022


optimization programming method of BBS. The mathematic                              grid, which may cause overloads of transmission lines. If the
model of the correction control is as follows:                                      usage ratios of lines are controlled within a specific range, the
                     tend
                     X                                   Te
                                                         X                          power losses are reduced, and the power system is expected
         min                (Closs (t) + Credis (t)) +            Cbkt (t)   (1)    to survive longer. Hence, based on (1)–(4), the performance
       yl ,PGi,t                                                                    at time t can be formulated as follows [18]:
                     t=1                                 t=tend
                      X
       Closs (t) =           rl yl2 (t)                                      (2)
                                                                                                X
                                                                                          gt =       (max(0, 1 − ρ2l ) − φ · max(0, ρl − 0.9)
                     l∈Nl
                            X                                                                     l∈Nl
       Credis (t) = α             |PGi,t−1 − PGi,t |                         (3)                   − ψ · max(0, ρl − 1))                                        (6)
                        i∈NGen
                                                                                    where ρl is the usage of lines l capacity; and, φ and ψ are the
                      X
       Cbkt (t) =              βPj,t                                         (4)
                     j∈NLoad
                                                                                    penalty coefficient of heavy load and overload, respectively.
                                                                                       Each APC obtains a cumulative reward when the power
where Closs (t), Credis (t), and Cbkt (t) are power loss cost,                      system maintains its normal operation, and get much larger
generation redispatch cost, and blackout cost, respectively;                        negative rewards if a power system blackout occurs. Thus,
tend is the time when the power system blacks out; Te is the                        the APCs perform a cooperative stochastic game to achieve a
duration of the blackout; Nl , NGen , and NLoad are set of lines,                   Nash equilibrium based on their observations. The immediate
generators, and loads, respectively; rl and yl are the resistance                   reward and cumulative reward of APC i at time k, can be
and active power of line l, respectively; PGi,t is the active                       formulated as
power output of generator i at time slot t; Pj,t is the energy                                              (
loss in the presence of blackout.                                                                              κ          if blackout
                                                                                                     ri,t = Pt                                  (7)
   In this paper, we focus on applying the topology actions of                                                        g
                                                                                                                 i=0 t otherwise
BBS, which is less costly than generation redispatch. A binary                                                    T
variable Hl is used to denote BBS action, which is 1 when line
                                                                                                                  X
                                                                                                      R(sk ) =          γit−k ri,t                              (8)
l is switched. Considering the stability of the power system,                                                     t=k
the BBS action in a time should be restricted, which is
                     X                                                              where κ is a negative constant; γi is the discount factor of
                          (1 − Hl ) ≤ Nlmax                   (5)                   APC I; and sk is the states at time k.
                        l∈Nl                                                          The APCs aim to maintain the normal operation of the
where Nlmax is the maximum number of lines allowable for                            power system, i.e., the predicted reward. We can calculate the
switching.                                                                          cumulative reward of all states unless the game is over. Thus,
                                                                                    a value function is introduced to evaluate the potential future
B. Stochastic Game and Nash Equilibrium                                             reward of states, which is:
   Limited by the action space dimension and the observation                         Vi (sk ) = E[R(sk )|St = sk ]
space size of a large-scale power system, a single agent cannot
handle all the functionalities and management schemes in                                       = E[R(sk+1 ) + γvi (sk )|St = sk , At = ak ]
                                                                                                   X
practical applications. A multi-agent system is essential to deal                              =        p(sk+1 , ri,k |sk , ak )[ri,k + γVi (sk+1 )] (9)
with “the curse of dimensionality, ” to implement a cooperative                                   sk+1 ,ri,k
control and management scheme in large-scale power grids.
In the framework of a DDRL, each agent follows the basic                            where ak = [a1,k , · · · , ai,k , · · · , aN,k ]T is the joint action;
learning paradigm of reinforcement learning, which needs to                         and p(sk+1 , ri,k |sk , ak ) is state transition probability, that is,
consider both its exploration and the impact of other agents’                       p : sk × ak × sk+1 → [0, 1]. (9) demonstrates that the stoch-
strategies on the environment. The interaction among agents                         astic game has Markov properties. πi,k : sk → ai,k denotes
can be captured in the form of a cooperative stochastic game.                       the strategy of APC i, and the joint strategy of the APCs
The main components of the game include:                                            can be described as πk = [π1,k , · · · , πi,k , · · · , πN,k ]T . The
                                                                                    value function is related to the APCs’ strategies, which can
   • Agent: APCs in the set Aag .
                                                                                    be denoted by Vi (sk , πk ).
   • State: the states st of the power system include active
      power outputs of generators, usage of lines capacity,                            The solution of the stochastic game is NE, which is a state
      electrical quantities, etc. st ∈ S, where S is the state                      of the game where no agent can benefit by unilaterally chang-
      space.                                                                        ing strategies. Assuming that the NE solution of the APCC
   • Observation: partial observation oi,t based on the func-
                                                                                    problem is denoted by πk∗ = [π1,k     ∗              ∗
                                                                                                                              , · · · , πi,k            ∗
                                                                                                                                             , · · · , πN,k ]T , the
      tionality of agent i. oi,t ∈ Oi , where Oi is the partial                     optimality of the NE solution can be described as:
      observation space of agent i.                                                                 Vi (sk , πk∗ ) ≥ Vi (sk , πk,−i
                                                                                                                               ∗
                                                                                                                                    ), ∀i ∈ N                 (10)
   • Action: BBS actions ai,t of each agent or do nothing.
      ai,t ∈ Ai , where Ai is the action space of agent i.                                    ∗
                                                                                    where πk,−i          ∗
                                                                                                   = [π1,k                           ∗
                                                                                                           , · · · , πi,k , · · · , πN,k ]T are the optimal
   • Reward: immediate reward ri,t .                                                strategies of APCs excluding the APC i.
   At the beginning of the time slot t ∈ T , where T is the                            In the stochastic game of the APCC, we solve the optimiza-
operation duration, RESs are connected randomly to the power                        tion problem of the Q function to obtain the NE solution based
CHEN et al.: ACTIVE POWER CORRECTION STRATEGIES BASED ON DEEP REINFORCEMENT LEARNING—PART II: A DISTRIBUTED SOLUTION FOR ADAPTABILITY                    1137


on the Bellman equation, which is:                                      where θ and θ̂ are the parameters of the main Q network and
                           X                                            the target Q network, respectively, and ζ is the coefficient of
       max Qi (sk , ak ) =       p(sk+1 , ri,k |sk , ak )
         πk                                                             the soft update.
                              sk+1 ∈S

                             [ri,k + γVi (sk+1 )]               (11)    B. QMIX Method
  Therefore, we need to search for the maximum Q value to                  One main issue with DDRL is how to effectively learn
obtain the NE solution πk∗ . To address this issue, some model-         the function and fit a proper approximation function when
based methods in previous literature attempted to solve for the         the parameters of the joint action-value function increase
actions of APCs; however, the performance of these methods              exponentially with the number of agents. Considering the
depends upon the accuracy of the models. Hence, a model-free            complexity of the environment and the uncertainty of inter-
method, i.e., DQN, is adopted in this paper to search for the           agent communication, the problem of DDRL is aimed at a
NE solution.                                                            decentralized, partially observable Markov decision process
                                                                        (Dec-POMDP) [20].
                 III. P ROPOSED DDRL F RAMEWORK                            QMIX is a monotonic value function factorization algorithm
A. Deep Recurrent Q Netowrk                                             for DDRL, which maximizes the joint action-value function
                                                                        in a Dec-POMDP. This approach utilizes a hybrid network
   In the APCC stochastic game, the exploration of each APC
                                                                        to combine the local value functions of agents and use
is a partial observation Markov decision process (POMDP). In
                                                                        global states information in the training process, to improve
this paper, the Deep Recurrent Q-Network (DRQN) algorithm
                                                                        the performance of the DQN algorithm. QMIX adopts the
is proposed to solve the problem with partial observation.
                                                                        “centralized training and distributed execution” framework,
Based on the basic structure of the DQN, DRQN replaces the
                                                                        which uses global states in training and partial observations in
first post-convolutional fully-connected layer with a recurrent
                                                                        execution.
long short-term memory (LSTM) network [19]. In the training
                                                                           Assuming that, at time slot t, agent i has partial observation
process, the convolutional layer and LSTM layer are updated
                                                                        oi,t , action ai,t , and the global states of the system is st . Then,
together. The Q value obtained by partial observation ot can
                                                                        τ = [τ1 , · · · , τi , · · · , τn ] is the joint action-observation history,
be much closer to the real Q value, which is Q(ot , at ; θ) →
                                                                        τi is the action-observation history of a single agent, a =
Q(st , at ; θ).
                                                                        [a1 , · · · , ai , · · · , an ] is the joint action of multi-agents. πi (τi )
   At time-step t, after obtaining a state st from the en-
                                                                        and Qi (τi , ai ; θi ) are the strategy and Q functions of agent
vironment, the agent estimates the Q-value through a fully
                                                                        i, respectively. Qtot (Q1 , · · · , Qn ) is the joint Q function of
connected neural network and chooses actions corresponding
                                                                        multi-agents. It can be seen that Qi (τi , ai ; θi ) is related to τi ,
to the maximum Q-value. DRQN agents get a reward rt and
                                                                        not to global state information st .
the state st+1 at the next time step. Then, the experience
                                                                           The structure of the QMIX network is shown in Fig. 1. The
e(st , at , rt , st+1 ) is stored in a dataset named “replay buffer,”
                                                                        mixing network uses positive weights W1 and W2 to satisfy
which helps the DRQN eliminate the relationship between
                                                                        the monotonicity constraint, which is:
independent identically distributed training datasets.
   Considering that the Q-learning algorithm may overestimate                                 ∂Qtot
                                                                                                    ≥ 0, ∀i ∈ {1, 2, · · · , n}                          (14)
action values under certain conditions, a double Q network                                    ∂Qi
is applied to fit the Q function better. In the agent training             As the Q functions of agents have the same monotonicity in
process, the main network Q is primarily used to find the               the QMIX network, the maximization of the joint Q function
action at+1 with the maximum Q value at the next time step,
and then a target network Q̂, which has the same structure
as the Q-network, is adopted to estimate the Q-value of the                                   Qtot (τ,a)
action at+1 . Since the target Q-value of the action at+1 may
not be the maximum in Q̂, this procedure can effectively avoid                    Mixing
                                                                                  Network
overestimating suboptimal actions. The weights of the main                                                                          Qi (τi,ai,t)
Q-network are copied to the target network at a regular time                          Linear layer with W2
step.                                                                                                                                    π          ε
   After drawing from the replay buffer, the main Q-network
                                                                                      Linear layer with W1              St
is trained by minimizing a sequence of the loss function:                                                                            Qi (τi,⋅)

         L(θ) = E(st ,at )∼p [(yi − Q(st , at ; θ, α, β))2 ]    (12)                                                                   MLP
                                                                                  Q1 (τ1,a1,t)    ...   Qn (τn,an,t)
                                                                                                                             ht−1                   ht
where yi is the target Q-value for iteration i computed by                                                                             GRU
target network Q̂, and p is the probability distribution of
state-action pair (st , at ). The DRQN network weights can be                      Agent 1        ...      Agent N                     MLP
updated using the stochastic gradient with the gradient of the
loss function L(θ). To avoid the algorithm from falling into                                                                        (oi,t,ai,t−1)
                                                                                  (o1,t,a1,t−1)         (oN,t,aN,t−1)
the local optimum, a soft update method is adopted, which is:
                       θ̂ ← ζθ + (1 − ζ)θ̂                      (13)    Fig. 1.   The structure of the QMIX network.
1138                                                                        CSEE JOURNAL OF POWER AND ENERGY SYSTEMS, VOL. 8, NO. 4, JULY 2022


is equivalent to the maximization of each local Q function.                   Algorithm 1: Distributed Active Power Correction
Therefore, we can obtain the optimal joint strategy by:                       Control (DAPCC)
                                                                             1 Initialize DRQN network for each APC with random
                                                          
                              arg maxθ1 Q1 (τ1 , a1 ; θ1 )
                                                                                weights θi , initialize mixing network with random
  arg max Qtot (τ , a; θ) =            ···                 (15)
       θ                                                                        weights W1 and W2 , initialize Replay Buffer ∆
                             arg maxθn Qn (τn , an ; θn )
                                                                             2 for episode = 1 to I do
   As shown in Fig. 1, each agent adopts DRQN to fit its Q 3                       Reset the environment
function Qi (τi , ai ; θi ). Considering the inputs of the APCC              4     for t = 1 to T do
problem are values, we replaced the two convolutional layers                 5          Obtain the partial observations
with two full-connected layers in DRQN. Besides, we chose                                 o1,t , · · · , oi,t , · · · , oN,t of each agent from
                                                                                          state st , check the danger flag (True for
Gated Recurrent Unit (GRU) algorithm rather than LSTM
                                                                                          overload or system failure, False for normal
in the Recurrent part since GRU is less computationally                                   state)
expensive. DRQN uses the observation oi,t and the action 6                              if danger flag is True then
ai,t−1 as input to calculate the Q value through a greedy 7                                    for i = 1 to N do
algorithm with parameter ε.                                                  8                      Choose the feasible action with best
   Finally, the loss function of QMIX is presented as:                                                simulation reward in Nk in the
                X n                                                                                   condition of the gameover flag in
      L(θ) =                        tot
                     E(s,a)∼p [(yi − Qtot (τ , a, s; θ)) ]2
                                                                       (16)                           simulation is True (True for overload
                 i=1
                                                                                                      or system failure, False for normal
                                                                                                      state)
where yitot = ri + γi maxa′ Q̂(τ ′ , a′ , s′ ; θ̂), θ = [θ1 , · · · , θn ]   9                      if feasible action is None then
                                                                            10                             Choose the feasible action with the
C. Centralized Training Algorithm of DDRL                                                                    best simulation reward in Nkb in
   Based on the QMIX method introduced in Section III-B,                                                     the condition of the gameover flag
qw propose a training structure of DDRL to obtain the NE of                                                  in simulation is True
                                                                            11                             if feasible action is None then
the APCC stochastic game. Before starting the training, the
                                                                            12                                  choose an action with the best
networks are initialized with random weights, and a replay                                                         reward in historical actions
buffer is established. In each episode, APCs obtain their partial 13                                       end if
observations from the system state at the beginning. Before 14                                      end if
APCs take their actions, we perform “do-nothing” actions for 15                                end for
each agent in the simulation system, which can predict the 16                                  Validate the actions in the simulation
next observation provided by the Grid2Op platform [21]. This                                     system and combine each agent’s action
procedure is to check whether the system is in danger. We                                        as the output action
predefine the danger status as that when the line capacity usage                                 at = (at1 , · · · , ati , · · · , atN )
                                                                            17          else
of any lines is above 0.95, or there is a system failure. If the
                                                                            18                 select do-nothing action
system is not in danger, APCs still take “do-nothing” actions.
                                                                                                 (a01 , · · · , a0i , · · · , a0N ) as the output action
When the danger flag is true in the simulation system, APCs                                      at
explore their actions in their action spaces.                               19          end if
   Considering that the action space of each APC is large 20                            Execute action at in the environment and
enough, we should guide the exploration of each APC to                                     obtain next state st+1 , reward rt and dt , store
improve the algorithm performance. Before exploring, each                                  the transition (st , at , st+1 , rt , dt )τ in ∆
APC explores all actions if there are any disconnected lines in 21                      Sample a batch of transitions from ∆
its control region. Otherwise, it only explores Nk actions in the 22                    Calculating the Q-values in target QMIX
front rank of all actions’ Q-value. In short, these Nk actions                            network:
are called “top Nk actions.” Among the “top Nk actions,” the                              y(itot =
action with the best simulation reward is chosen. Another set                                  ri                                          if dt is True
                                                                                                                            ′    ′  ′
of Nkb actions in the front rank exclude the “top Nk actions,”                                 ri + γi maxa′ Q̂(τ , a , s ; θ̂) otherwise
which are explored if there are no valid or effective actions. 23                       Update QMIX-network by losses:
Furthermore, we choose an action with the best reward in                                  L(θ) =
                                                                                          P    n                        tot                      2
historical actions when the APC cannot explore a feasible                                      i=1 E(s,a)∼p [(yi − Qtot (τ, a, s; θ)) ]
action.                                                                     24          Hard copy main network weights θ to the
   After action selection is completed, the joint action com-                             target network weights θ̂ regularly
posed of APC selections is executed in the environment. 25                              Update state st = st+1
After obtaining the next state and reward of APCs, we 26                           end for
             t                                                              27 end for
store (st , a , st+1 , rt , dt )τ to the replay buffer with specific
rules. When the memory size of the replay buffer reaches a
threshold, batch samples are used to calculate the Q-values network are updated based on the loss computed in (17).
in the target QMIX network. The parameters of the QMIX The training procedure operates iteratively until a specified
CHEN et al.: ACTIVE POWER CORRECTION STRATEGIES BASED ON DEEP REINFORCEMENT LEARNING—PART II: A DISTRIBUTED SOLUTION FOR ADAPTABILITY      1139


number of episodes is reached. The algorithm is denoted as                    power of tie lines is checked to determine whether the joint
DDRL Training Method for APCC, which is illustrated in                        action is executed. If the joint action leads to an overload of
Algorithm 1.                                                                  any tie lines, we remove it and search for another feasible
                                                                              joint action. It can be seen that we primarily consider the
D. Distributed Execution Structure for APCC                                   transmission power of tie lines to ensure the stability of power
   Based on the training algorithm proposed in Section III-C,                 flow interfaces.
we summarize an online execution procedure to obtain the NE
of the APCC game. In APCC, the primary purpose of agents
                                                                                                    IV. C ASE S TUDY
is to maintain the normal operation of the power grid under
different operation conditions. The system characteristics of                    In this section, the proposed DAPC method is evaluated
high dimensionality and non-linearity make it challenging to                  on an open-source platform, “Grid2Op”. To provide a typical
predict the impact of each control action. For example, when                  application scenario for the DAPCC application, the 118-bus
the disturbance is minor, only a few agents taking action may                 grid provided by the ”2020 Learning to Run a Power Network
outperform the strategy of all the agents taking action. Hence,               - Neurips Track 2” global competition cases [22] is used to
an action optimization mechanism is proposed to obtain the                    evaluate the algorithm performance. In the 118-bus grid, we
optimal joint actions of APCs, whose flowchart is shown in                    divided the centralized control region into three distributed
Fig. 2.                                                                       control regions, as shown in Fig. 3. Three agents based on
                                                                              the QMIX method are implemented to control the elements of
                                     Start                                    three regions, respectively. We chose a multi-mix dataset—a
                                                                              set of cases with a varying number of renewables, as shown
                             Initialize a scenario                            in Fig. 4.
                               and load trained
                               agents of APCs                                    Each mixed dataset includes 576 scenarios covering each
                                                                              month for 48 years. Each scenario contains data for 28 con-
                       Each APC explore an action with                        tinuous days with a 5-minute resolution. The penetration rate
                       best reward in its control region
                                                                              of RESs changes in different scenarios, and the maintenance
                    Get joint action set of APCs based the
                                                                              of lines is considered. We need to train our agents to adapt
                   method of permutation and combination                      to renewable energy production in the grid with an increasing
                                                                              rate of less controllable renewable energies over the years.
                       Simulate each joint action of the                      After training, the trained agents are tested on hidden new
                          action set in the scenario
                                                                              mixed datasets that are not present in the training set, to
                                                                              assess the adaptability of the agent. The 24 test scenarios are
                               Whether all joint             Yes              randomly extracted from the data every month and cover the
                            actions are ineffective?                          characteristics of typical scenarios through one year.
                                                                                 As we implement three agents in our case study, the network
                                         No                                   structure of the QMIX method consists of three identical
     Remove the joint       Select the joint action
       action in the                                                          DRQNs and a mixing network. The DRQNs are composed of
                              with best reward
        action set                                                            two full-connected layers and a GRU layer with 512 neurons.
                                                           Randomly select
                                                           the joint action   The mixing network is composed of a deep neural network
                                  Check if the                                (DNN) with 256 neurons. The rectified linear unit (ReLU) is
                 Yes      transmission power of tie                           used as an activation function, and the batch size is set to 64.
                              lines overflow the
                                 thermal limits                               RMS is adopted for the QMIX network, and the learning rate
                                                                              is set as 0.0005. The maximum number of steps in an episode
                                         No                                   is set to 4000. The reaction time and recovery time of the
                           Execute the joint action                           APCC problem are set to 3 time steps, i.e., 15 minutes. The
                               in the scenario
                                                                              simulations are conducted on a Linux server with 3 GPUs.
                                      End
                                                                              A. Centralized Learning Capability
Fig. 2.   The flowchart of distributed execution.                                In this part, we adopted the online centralized training algo-
                                                                              rithm proposed in Section III-C to train the three APC agents.
   We use permutations and combinations to obtain the joint                   Through the Monte Carlo tree search (MCTS) described in
action set of APCs. This mechanism can improve the adapt-                     Part I, the amounts of selected effective actions of Region 1,
ability of multiple agents control. The action exploring each                 Region 2, and Region 3 are 46, 540, and 565, respectively.
APC is the same as the method mentioned in Algorithm 1.                       It can be seen from the results of MCTS that the network
When executing in a scenario, we need to simulate each joint                  complexity of Region 2 and Region 3 is higher than that of
action of the action set to figure out whether it is effective. If            Region 1. The effective action matrixes of the three regions are
all actions in the action set are ineffective, a randomly selected            used as action spaces for three APCs, respectively. Specifically,
joint action is adopted. Otherwise, we select the joint action                we add “do-nothing” action to the action space of each APC,
with the best reward. After action selection, the transmission                which is explained in Section III-C. In each episode, the
1140                                                                                     CSEE JOURNAL OF POWER AND ENERGY SYSTEMS, VOL. 8, NO. 4, JULY 2022


                                                                                Region 2


       Region 1
       I2rpn_neurips_2020_track2_x1
                  powerline
                  substation
                  load
                  generator
                  no bus
                  bus 1
                  bus 2                                                         Region 3

Fig. 3.   The topology of the 118-bus power grid.

                                          Case 1                                    Case 2                                  Case 3
                                                                                              Solar                                  Solar
                                     Nuclear
                                                          Solar         Nuclear
                                                                                          13.7%            Nuclear           16.8%
                                            16.1% 10.0%                          14.7%                                                        Wind
                                                                Wind                                   Wind         13.5%             15.1%
                                                        9.0%                                 12.3%

                                    41.5%                                                                                           19.7%
                          Thermal                23.4%                          37.9%      21.4%                        34.9%
                                                                  Thermal                                                                    Hydro
                                                          Hydro                                   Hydro    Thermal


                                                      Case 4                                                       Case 5
                                              Solar                                                        Solar

                                                  19.4%                  Wind                                   21.7%                Wind
                                  Nuclear                       17.5%                                                       19.6%
                                             12.5%                                            Nuclear
                                                                                                          11.6%

                                                               18.2%                                                        17.0%
                                               32.3%                    Hydro                                   30.1%                Hydro
                                     Thermal                                                          Thermal

Fig. 4.   Energy profiles of each case.


scenarios of various RES penetration rates in every case are                              indicates that the NE of the APCC game may have been
randomly selected in the training process.                                                obtained.
   After 1000 episodes of training, we take the moving average
                                                                                          B. Distributed Executing Performance
value of cumulative rewards for every 200 episodes; the cumu-
lative reward curve of the DAPCC is shown in Fig. 5. In the                                  In this part, we implement the trained APC agents in
early phase of training, scenarios with a variable penetration                            Section IV-A to evaluate whether they can solve the problem
rate of RESs are randomly selected. As a result, the trend of                             of APCC. The information on 24 test scenarios is provided in
cumulative reward varies dramatically. From 200 episodes to                               Table I, which contains all kinds of cases.
600 episodes, the curve of cumulative reward becomes stable                                  In the evaluation process, we test our trained APCs in
and increasing. It can be seen that the cumulative reward                                 a sequence of the penetration rate of RESs and limit the
finally begins to converge at around 800 episodes, which                                  maximum alive steps of a scenario to 4000. In Fig. 6, the
CHEN et al.: ACTIVE POWER CORRECTION STRATEGIES BASED ON DEEP REINFORCEMENT LEARNING—PART II: A DISTRIBUTED SOLUTION FOR ADAPTABILITY                                      1141


                                                                                                    12 scenarios of cases 3 and 4, the M3A scheme performs more
                     200000                                                                         effectively and stably than the M1A scheme, which indicates
                                                                                                    that the distributed executing scheme we proposed is beneficial
                                                                                                    to the APCC problem. The last 5 scenarios, which contain


     Cumulative reward
                     180000
                                                                                                    41.3% RESs in case 5, are difficult for the agent with only
                                                                                                    BBS control. However, the APCs with the M3A scheme are
                     160000                                                                         also better than other action schemes. Notice that the average
                                                                                                    decision time for each time-step of case 1, case 2, case 3,
                     140000                                                                         and case 4 is around 40 milliseconds. This indicates that the
                                                                                                    proposed method has practicability in the APCC problem.
                     120000                                                                         C. Adaptability of Proposed Control Method
                                0        200                   400     600     800      1000
                                                                 Episode                               To fully evaluate the performance of DAPCC, we randomly
                                                                                                    selected two scenarios, which are case 2 and case 4, to test
Fig. 5. Moving average curve of the cumulative reward of the proposed                               our trained APCs. The comparison of the M3A action scheme
DAPCC.                                                                                              and M1A action scheme is still set up in this part. The alive
                                                                                                    steps of the M3A scheme and M1A scheme are almost the
                                               TABLE I
                               T HE I NFORMATION OF 24 T EST S CENARIOS
                                                                                                    same in case 2, while they are different in case 4.
                                                                                                       We first investigated the maximum line capacity usage of
                              Scenario          The active load consumption
   Mix ID
                              numbers             (MV) in each scenario                             each control region to reflect the effectiveness of the control
   Case 1                     3                    2163 2217 2147                                   strategy directly. As shown in Fig. 7, the system blackouts
   Case 2                     4                2081 2099 2217 2163                                  with a “do-nothing” action at around 500 steps. This illustrates
   Case 3                     6          2104 2033 2228 2157 2145 2183
   Case 4                     6          2143 2161 2142 2130 2188 2143                              that the plugged-in RESs lead to heavy overload in the power
   Case 5                     5             2093 1999 2175 2156 2110                                system if no control measure is taken. In Fig. 7(a), although the
                                                                                                    time-steps during which the system is “alive” using the M1A
                                                                                                    scheme and M3A scheme both reach 3000steps, the maximum
legend “max-3-actions” (M3A) denotes that the number of
APC actions allowed in a single time slot is limited to 3.                                                                          TABLE II
Meanwhile, the legend “max-1-actions” (M1A) denotes that                                                                   T HE PARAMETERS OF TIE L INES
the number of actions is limited to 1. It may be noticed that                                                                                              Transmission
                                                                                                        Line ID   Line Origin bus     Line extremity bus
the performance of the “do-nothing” action is used to compare                                                                                              capacity (MW)
with other schemes.                                                                                     108       14                  32                   208.10
                                                                                                        109       18                  33                   226.90
   In the first 7 scenarios, the mixed data of environments are                                         117       29                  37                   456.00
case 1 and case 2. It can be seen that the control action                                               94        22                  23                   674.20
                                                                                                        7         69                  73                   480.00
of DAPCC is effective, which can keep the power system                                                  8         69                  74                   436.50
alive longer. Considering the scenarios in case 1 and case                                              9         68                  74                   681.80
2 contain only 19% RESs, the performance of the M3A                                                     12        68                  76                   682.50
                                                                                                        185       67                  80                   997.10
scheme is similar to that of the M1A scheme. In the following

                                                             4000
                                                                    Case 1     Case 2          Case 3             Case 4             Case 5
                                                             3500

                                                             3000

                                                             2500


                                               Alive steps
                                                                                                                                  max_3_actions
                                                             2000                                                                 max_1_actions
                                                                                                                                  donothing
                                                             1500

                                                             1000

                                                              500

                                                                0
                                                                    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
                                                                                      Scenario number

Fig. 6.                  Performance of DAPCC in 24 test scenarios.
1142                                                                                                                      CSEE JOURNAL OF POWER AND ENERGY SYSTEMS, VOL. 8, NO. 4, JULY 2022


                                               2.0                                                                                                             2.0
                                                                                                       max_3_actions                                                                                                                  max_3_actions
                                                                                                       max_1_actions                                                                                                                  max_1_actions
                                               1.8                                                                                                             1.8
                                                                                                       do nothing                                                                                                                     do nothing


               Max line capacity usage ratio                                                                               Max line capacity usage ratio
                                               1.6                                                                                                             1.6

                                               1.4                                                                                                             1.4

                                               1.2                                                                                                             1.2

                                               1.0                                                                                                             1.0

                                               0.8                                                                                                             0.8

                                               0.6                                                                                                             0.6

                                                     0   250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000                                                                       0     250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000
                                                                               Alive steps                                                                                                                         Alive steps
                                                               (a) System operating status under different                                                                                          (b) System operating status under different
                                                                    schemes in the scenario of case 2.                                                                                                    schemes in the case 4 scenario.

Fig. 7.                                              The comparison of system operating status under different schemes.

                                                                                                                                                                                                                       Line 109
                                                                                 Line 109


                                                                                                                                                           Line capacity usage ratio
                                                                                                                                                                                       1.00


Line capacity usage ratio
                              1.00                                                                                                                                                                                                        MAX 3SUB
                                                            MAX 3SUB
                                                            MAX 1SUB                                                                                                                   0.75                                               MAX 1SUB
                              0.75                                                                                                                                                                                                        Do Nothing
                                                            Do Nothing
                                                            Baseline                                                                                                                   0.50                                               Baseline
                              0.50

                              0.25                                                                                                                                                     0.25

                              0.00                                                                                                                                                     0.00
                                                     0   250 500 750 1000 1250 1500 1750 2000 2250 2500 2750                                                                                  0   250 500 750 1000 1250 1500 1750 2000 2250 2500 2750
                                                                            Alive steps                                                                                                                              Alive steps
                                                                              Line 8                                                                                                                                   Line 8


                                                                                                                                                           Line capacity usage ratio
                                                                                                                                                                                       1.00


Line capacity usage ratio
                              1.00

                              0.75                                                                                                                                                     0.75

                              0.50                                                                                                                                                     0.50

                              0.25                                                                                                                                                     0.25

                              0.00                                                                                                                                                     0.00
                                                     0   250 500 750 1000 1250 1500 1750 2000 2250 2500 2750                                                                                  0   250 500 750 1000 1250 1500 1750 2000 2250 2500 2750
                                                                            Alive steps                                                                                                                              Alive steps
                                                                             Line 12                                                                                                                                  Line 12


                                                                                                                                                           Line capacity usage ratio
                                                                                                                                                                                       1.00


Line capacity usage ratio
                              1.00

                              0.75                                                                                                                                                     0.75

                              0.50                                                                                                                                                     0.50

                              0.25                                                                                                                                                     0.25

                              0.00                                                                                                                                                     0.00
                                                     0   250 500 750 1000 1250 1500 1750 2000 2250 2500 2750                                                                                  0   250 500 750 1000 1250 1500 1750 2000 2250 2500 2750
                                                                             Alive steps                                                                                                                              Alive steps
                                                                    (a) The scenario of mix x1.5.                                                                                                            (b) The scenario of mix x2.5.

Fig. 8.                                              The line capacity usage of tie lines in the scenarios.


line capacity usage ratio of the M1A scheme increase sharply                                                               proposed control method.
at around 2800steps, which indicate that the robustness of the                                                                After investigating the control performance of each APC,
M3A scheme is better than that of the M1A scheme. From                                                                     we focused on the line capacity usage of tie lines to evaluate
Fig. 7(b), it can be seen that with the penetration rate of                                                                the coordination between the APCs. The parameters of tie lines
RESs increasing, the “do-nothing” scheme and M1A scheme                                                                    are shown in Table II. We recorded the line capacity usage in
cannot eliminate or alleviate the overload as well as the                                                                  the test scenarios of case 2 and case 4. As shown in Fig. 3,
M3A scheme does. This demonstrates the adaptability of our                                                                 the buses 18, 33, 68, 69, 74, and 76, which connect more
CHEN et al.: ACTIVE POWER CORRECTION STRATEGIES BASED ON DEEP REINFORCEMENT LEARNING—PART II: A DISTRIBUTED SOLUTION FOR ADAPTABILITY                      1143


loads and lines than others, are regarded as heavy load buses.                  [10] M. Khodayar, G. Liu, J. Wang and M. E. Khodayar, “Deep learning
Hence, we checked the transmission power of line 109, line                           in power systems research: A review,” in CSEE Journal of Power and
                                                                                     Energy Systems, vol. 7, no. 2, pp. 209–220, Mar. 2021, doi: 10.17775
8, and line 12, as shown in Fig. 8. It can be seen that the line                     /CSEEJPES.2020.02700.
capacity usage of these three tie lines can always be controlled                [11] Z. Zhang, D. Zhang and R. C. Qiu, “Deep reinforcement learning for
below the baseline. This illustrates that the DAPCC method is                        power system applications: An overview,” in CSEE Journal of Power
                                                                                     and Energy Systems, vol. 6, no. 1, pp. 213–225, Mar. 2020, doi: 10.177
available for multi-region coordinative control.                                     75/CSEEJPES.2019.00920.
                                                                                [12] I. Adamski, R. Adamski, T. Grel, A. Jȩdrych, K. Kaczmarek, and
                                                                                     H. Michalewski, “Distributed deep reinforcement learning: Learn how
                          V. C ONCLUSIONS                                            to play Atari games in 21 minutes,” in Proceedings of the 33rd
                                                                                     International Conference on High Performance Computing, 2018, pp.
   To improve the adaptability of the DRL algorithm applied in                       370–388.
                                                                                [13] K. X. Lin, R. Y. Zhao, Z. Xu, and J. Y. Zhou, “Efficient large-
the APCC problem, a DAPCC method is proposed to conduct                              scale fleet management via multi-agent deep reinforcement learning,”
the framework of “centralized training and distributed execut-                       in Proceedings of the 24th ACM SIGKDD International Conference on
ing” scheme for a large-scale power system. Based on the                             Knowledge Discovery & Data Mining, 2018, pp. 1774–1783.
                                                                                [14] M. Tan, “Multi-agent reinforcement learning: Independent vs. coop-
stochastic game theory, we adopted a QMIX method to obtain                           erative agents,” in Proceedings of the 10th International Conference
the Nash Equilibrium of APCs. Considering the cooperation                            Machine Learning, 1993, pp. 330–337.
of APCs, an action-optimization mechanism is proposed to                        [15] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi,
                                                                                     M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and
obtain the global optimality of joint actions. The case studies                      T. Graepel, “Value-decomposition networks for cooperative multi-agent
demonstrate the convergence of centralized learning capability                       learning based on team reward,” in Proceedings of the 17th International
and the adaptability of distributed executing performance in                         Conference on Autonomous Agents and MultiAgent Systems, 2018, pp.
                                                                                     2085–2087.
large-scale APCC with plugged-in RESs. The application of                       [16] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster,
DRL for controlling complex and complicated power systems                            and S. Whiteson, “QMIX: Monotonic value function factorisation for
is still in its infancy. Future work on this topic shall involve                     deep multi-agent reinforcement learning,” in International Conference
                                                                                     on Machine Learning. PMLR, 2018, pp. 4295–4304.
multi-faceted application scenarios, and these applications are                 [17] A. Marot, B. Donnot, C. Romero, B. Donon, M. Lerousseau, L. Veyrin-
expected to solve power system control problems with ultra-                          Forrer, and I. Guyon, “Learning to run a power network challenge for
high dimensionality and non-linearity.                                               training topology controllers,” Electric Power Systems Research, vol.
                                                                                     189, pp. 106635, Dec. 2020.
                                                                                [18] Universidad Nacional de Colombia. L2RPN-NEURIPS-2020. [Online].
                                                                                     Available: https://github.com/unaioperator/l2rpn-neurips-2020.
                             R EFERENCES                                        [19] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially
                                                                                     observable MDPs,” arXiv:1507.06527, Jan. 2017. [Online]. Available:
 [1] X. P. Li, P. Balasubramanian, M. Sahraei-Ardakani, M. Abdi-Khorsand,            https://arxiv.org/pdf/1507.06527.pdf.
     K. W. Hedman, and R. Podmore, “Real-time contingency analysis              [20] C. Amato, G. Chowdhary, A. Geramifard, N. K. Üre, and M. J.
     with corrective transmission switching,” IEEE Transactions on Power             Kochenderfer, “Decentralized control of partially observable Markov
     Systems, vol. 32, no. 4, pp. 2604–2617, Jul. 2017.                              decision processes,” in Proceedings of the 52nd IEEE Conference On
 [2] H. R. Diao, M. Yang, F. Chen, and G. Z. Sun, “Reactive power and volt-          Decision and Control, 2013, pp. 2398–2405.
     age optimization control approach of the regional power grid based on      [21] RTE-France. Grid2Op. [Online]. Available: https://github.com/rte-franc
     reinforcement learning theory,” Transactions of China Electrotechnical          e/Grid2Op.
     Society, vol. 30, no. 12, pp. 408–414, Jun. 2015.                          [22] RTE-France. L2RPN NEURIPS 2020 - robustness track. [Online].
 [3] J. Y. Zhang, C. Liu, J. Si, J. Song, and Y. S. Su, “Deep reinforcement          Available: https://competitions.codalab.org/competitions/25426.
     leaming for short-term voltage control by dynamic load shedding in
     China southem power grid,” in Proceedings of 2018 International Joint
     Conference on Neural Networks, 2018, pp. 1–8.
 [4] Q. L. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and J. Sun, “Two-
     timescale voltage control in distribution grids using deep reinforcement
     learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2313–                               Siyuan Chen (S’19) received the M.S. degree in
     2323, May 2020.                                                                                     School of Electrical Engineering and Automation
 [5] J. J. Duan, D. Shi, R. S. Diao, H. F. Li, Z. W. Wang, B. Zhang, D.                                  from Wuhan University, China, in 2018. He is
     S. Bian, and Z. H. Yi, “Deep-reinforcement-learning-based autonomous                                currently pursuing a Ph.D. degree in School of
     voltage control for power grid operations,” IEEE Transactions on Power                              Electrical Engineering and Automation from Wuhan
     Systems, vol. 35, no. 1, pp. 814–817, Jan. 2020.                                                    University. His research interests include power sys-
 [6] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M.                               tem operation and control, artificial intelligence, and
     Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan,                               machine learning.
     S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D.
     Silver, T. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence,
     A. Ekermo, J. Repp, and R. Tsing, “StarCraft II: A new challenge
     for reinforcement learning,” arXiv:1708.04782, Aug. 2017. [Online].
     Available: https://arxiv.org/pdf/1708.04782.pdf.
 [7] B. K. Petersen, J. C. Yang, W. S. Grathwohl, C. Cockrell, C. Santiago,
     G. An, and D. M. Faissol, “Deep reinforcement learning and simulation                               Jiajun Duan (S’13–M’18) received the B.S. degree
     as a path toward precision medicine,” Journal of Computational Biology,                             in Power System and its Automation from Sichuan
     vol. 26, no. 6, pp. 597–604, Jun. 2019.                                                             University, Chengdu, China, and M.S. degree in
 [8] J. Y. Chen, B. D. Yuan, and M. Tomizuka, “Model-free deep reinforce-                                Electrical Engineering at Lehigh University, Beth-
     ment learning for urban autonomous driving,” in Proceedings of 2019                                 lehem, PA in 2013 and 2015, respectively, and the
     IEEE Intelligent Transportation Systems Conference, 2019, pp. 2765–                                 Ph.D. degree in Electrical Engineering from Lehigh
     2771.                                                                                               University in 2018. Currently, he is a Research Sci-
 [9] C. H. Liu, Z. Y. Chen, J. Tang, J. Xu, and C. Z. Piao, “Energy-efficient                            entist in GEIRI North America, San Jose, CA, USA.
     UAV control for effective and fair communication coverage: A deep                                   His research interests include artificial intelligence,
     reinforcement learning approach,” IEEE Journal on Selected Areas in                                 power system, power electronics, control systems,
     Communications, vol. 36, no. 9, pp. 2059–2070, Sept. 2018.                                          and machine learning.
1144                                                                                CSEE JOURNAL OF POWER AND ENERGY SYSTEMS, VOL. 8, NO. 4, JULY 2022


                          Yuyang Bai received the B.S. degree in School of                                Zhiwei Wang ((M’16–SM’18)) received the B.S.
                          Electrical Engineering and Automation from Wuhan                                and M.S. degrees in Electrical Engineering from
                          University, China, in 2019. He is currently pursuing                            Southeast University, Nanjing, China, in 1988 and
                          an M.S. degree in School of Electrical Engineering                              1991, respectively. He is President of GEIRI North
                          and Automation from Wuhan University. His re-                                   America, San Jose, CA, USA. Prior to this as-
                          search interests include artificial intelligence, power                         signment, he served as President of State Grid US
                          system operation and control, and machine learning.                             Representative Office, New York City, from 2013
                                                                                                          to 2015, and President of State Grid Wuxi Electric
                                                                                                          Power Supply Company from 2012-2013. His re-
                                                                                                          search interests include power system operation and
                                                                                                          control, relay protection, power system planning,
                                                                                     and WAMS.


                         Jun Zhang (M’09–SM’14) received the B.E. and
                         M.E. degrees in Electrical Engineering from the
                         Huazhong University of Science and Technology,                                   Xuzhu Dong (M’02–SM’10) received the Ph.D.
                         Wuhan, China, in 2003 and 2005, respectively, and                                degree in High Voltage Engineering from Tsinghua
                         the Ph.D. degree in Electrical Engineering from Ari-                             University, Beijing, China, in 1998, and the Ph.D.
                         zona State University, Phoenix, AZ, USA, in 2008.                                degree in Electrical Engineering from Virginia Tech
                         He is currently a Professor in the School of Electrical                          University, USA, in 2002. He is currently a Pro-
                         Engineering and Automation, Wuhan University. He                                 fessor in the School of Electrical Engineering and
                         has authored/coauthored more than 80 peer-reviewed                               Automation, Wuhan University. His research inter-
                         publications. His research expertise is in the areas of                          ests include distribution automation, energy storage,
                         complex systems, artificial intelligence, knowledge                              renewable energy and micro-grid.
automation, and their applications in smart grid.


                          Di Shi (M’12–SM’17) received the B.S. degree in                                 Yuanzhang Sun (M’99–SM’01) received the Ph.D.
                          Electrical Engineering from Xi’an Jiaotong Univer-                              degree in Electrical Engineering from Tsinghua Uni-
                          sity, Xi’an, China, in 2007, and M.S. and Ph.D.                                 versity, Beijing, China, in 1988. He is currently a
                          degrees in Electrical Engineering from Arizona State                            Professor in the School of Electrical Engineering
                          University, Tempe, AZ, USA, in 2009 and 2012,                                   and Automation, Wuhan University, and a Chair
                          respectively. He currently leads the AI & System                                Professor in the Department of Electrical Engineer-
                          Analytics Group at GEIRI North America, San Jose,                               ing and Vice Director of the State Key Laboratory
                          CA, USA. His research interests include WAMS,                                   of Power System Control and Simulation, Tsinghua
                          Energy storage systems, and renewable integration.                              University. His main research interests are power
                          He is an Editor of IEEE Transactions on Smart Grid.                             system dynamics and control, wind power, voltage
                                                                                                          stability and control, and reliability.