---
title: "Neural Networks for Power Flow: Graph Neural Solver"
source_url: "https://hal.science/hal-02372741"
source: "PSCC 2020 PDF; HAL metadata: https://hal.science/hal-02372741"
---

# Neural Networks for Power Flow: Graph Neural Solver

Neural Networks for Power Flow:
                                Graph Neural Solver
                   Balthazar Donon, Rémy Clément,                              Isabelle Guyon, Marc Schoenauer
                   Benjamin Donnot, Antoine Marot                           TAU group of Lab. de Rec. en Informatique
                   Réseau de Transport d’Électricité                        UPSud/INRIA Université Paris-Saclay
                             Paris, France                                                 Paris, France
                   {balthazar.donon, remy.clement,                            iguyon@lri.fr, marc.schoenauer@inria.fr
            benjamin.donnot, antoine.marot}@rte-france.com



   Abstract—Recent trends in power systems and those envisioned      be a promising avenue to get closer to this goal. Pioneering
for the next few decades push Transmission System Operators          work by Nguyen [4] demonstrated the feasibility of such
to develop probabilistic approaches to risk estimation. However,     an approach. Donnot et al. continued in this direction, and
current methods to solve AC power flows are too slow to
fully attain this objective. Thus we propose a novel artificial      managed to encode atypical interventions on a given grid [5],
neural network architecture that achieves a more suitable balance    [6]. Duchesne et al. [7] also use Machine Learning coupled
between computational speed and accuracy in this context.            with control variate to speed up a Monte Carlo approach.
Improving on our previous work on Graph Neural Solver for               We propose a novel method based on graph neural networks
Power System [1], our architecture is based on Graph Neural          to solve the AC power flow problem. This method does
Networks and allows for fast and parallel computations. It learns
to perform a power flow computation by directly minimizing the       not aim at imitating another method, but consists in training
violation of Kirchhoff’s law at each bus during training. Unlike     a neural network by direct minimization of the violation
previous approaches, our graph neural solver learns by itself and    of physical laws. Our solver approach lies at the interface
does not try to imitate the output of a Newton-Raphson solver.       between artificial intelligence and optimization, and offers
It is robust to variations of injections, power grid topology, and   interesting computational performance.
line characteristics.
We experimentally demonstrate the viability of our approach on          The artificial neural networks domain [8], [9] is very pro-
standard IEEE power grids (case9, case14, case30 and case118)        ficient and has achieved several major breakthroughs in the
both in terms of accuracy and computational time.                    past decade, thanks to a proactive community of researchers
  Index Terms—Power Flow, Solver, Artificial Neural Networks,        and to the exponentially-increasing computational capability of
Graph Neural Networks, Graph Neural Solver                           modern computers. Application of artificial neural networks
                                                                     to physics problems has also known a surge of interest in
               I. BACKGROUND & M OTIVATIONS                          recent years. Trajectories of interacting particles have been
   Transmission system operators such as RTE (Réseau de             modelled [10], fluid mechanics computations problems have
Transport d’Électricité) are in charge of managing the power       been accelerated [11], and applications to quantum mechanics
system infrastructure. They perform security analyses that           offer promising results [12], [13], [14].
rely on the “N-1” criterion: for every single outage that can
happen, they make sure that the grid remains stable. Recent
developments in the energy ecosystem in Europe, as well as
those envisioned for the next few decades (increasing amount
of uncertainty caused by intermittent sources of energy or by
a less predictible end consumer’s behavior, more open energy
markets, increasing automation of the grid, etc.) push TSOs
                                                                     Fig. 1. Artificial Neural Network example - Artificial Neural Networks
to rethink their approach to security analysis.                      consist in function approximations that typically look like the equation above.
   The GARPUR1 consortium reports advocate for a prob-               The function σ is a non linearity (typically ReLU or tanh), and parameters
abilistic approach to power system risk assessment, going            of the model Wi and bi are fitted by a stochastic gradient descent trying to
                                                                     minimize the distance between the prediction and the ground truth.
beyond the simple “N-1” criterion [2], [3]. Current methods for
AC power flow computations (such as Newton-Raphson) are                 We advocate for a meticulous approach to the way AI tools
too slow to enable this, and a method that would be faster (at       are being developed for critical and industrial systems such as
the cost of a reduced accuracy) could help achieve a reasonable      power grids. More specifically, encoding the known invariants
statistical evaluation of the risk. Speeding up computations of      and symmetries of a system directly inside the structure of
power flow computation using artificial intelligence seems to        artificial intelligence models make them resilient by design.
                                                                     More specifically, and to give a power system example, a
  1 www.sintef.no/projectweb/garpur                                  “black box” model that would be trained on a situation where



21st Power Systems Computation Conference                                                          Porto, Portugal — June 29 – July 3, 2020
                                                               PSCC 2020
every line is connected, will most likely not be accurate                  The overall architecture is presented in Figure 2. The structure
anymore if one of those lines were to be disconnected. On                  of the Graph Neural Network that we used is presented at the
the contrary, a model that directly encodes invariants and                 bottom part of Figure 2 and is organized as follows:
symmetries will better withstand those perturbations. Our work               • First, the input message xi of each bus i is converted
is intrinsically in line with Battaglia et al. [15], as we aim at              into a message m0i using a common encoder (which is a
building a strongly generalizable neural network architecture                  neural network similar to the one shown in Figure 1).
that interwines generic learning blocks in an instance agnostic              • Secondly, those messages are iteratively updated by com-
manner.                                                                        mon neural networks Lk that take as an input - for each
   Graph Neural Networks are a class of Artificial Neural                      bus i - the sum of the messages of its neighbors l ∈ N (i).
Network that are designed to respect the permutation invariants                There are K “correction” updates, and each of the K
and edge symmetries of graph structures. They use neural                       neural networks Lk are different.
network functions such as the one described in Figure 1                                                             K
                                                                             • Finally, the resulting messages mi are all decoded into
as small learning blocks that will be used at several places                   the prediction ŷi by applying another common neural
of the network (see Figure 2). They were first introduced                      network D. All the K + 2 neural network blocks (E,
by Scarselli et al. in [16] and further developed in [17],                     L0 , ..., LK , D) are then all jointly trained by a gradient
[18], and their theoretical properties have been investigated                  descent learning process.
in [19], [20]. They have been successfully applied to graph
classification [21], [22], [23], vertex classification [24], [25]          We experimentally demonstrated that our Graph Neural Solver
and relational reasoning [26]. [27] and [10] are examples of               architecture was able to learn on randomly generated power
hybrid approaches that blend deep learning and knowledge                   grids of 30 buses, and infer (with a better accuracy than the DC
about the structure of the problem. This novel domain of                   approximation) on random power grids of sizes ranging from
artificial intelligence aims at dealing with graph-structured              10 to 110 buses, thus achieving a form of zero-shot learning
data. The input data x relies on a graph structure presented at            [28]. This architecture is by design robust to power grid
the top left of the figure, and our goal is to perform a prediction        topology perturbations and variations of injections, and has
ŷ that is born by the same graph structure. The key idea of               the property to be fully differentiable as it relies exclusively
Graph Neural Networks is to iteratively propagate messages                 on neural networks.
through the edges of the graph structure. These propagations                  However, there were some limits to this first investigation
can be seen as “correction” updates.                                       that were left for future research. First of all, the strong hypoth-
                                                                           esis that all power lines share the same physical characteristics
                                                                           was made, to simplify the problem. Secondly, the datasets
                                                                           that were used both for training and testing had already been
                                                                           processed by RTE’s proprietary power flow solver Hades2 to
                                                                           ensure a global active equilibrium. Another issue was the need
                                                                           to perform expensive calls to a Newton-Raphson solver to
                                                                           generate a dataset that was later used to train the Graph Neural
                                                                           Solver. Finally, the computational speed was not significantly
                                                                           better that the one of a classical approach (such as Newton-
                                                                           Raphson).
                                                                              One contribution of this paper is to build a neural network
                                                                           architecture that solves the AC power flow equations and that
                                                                           answers all the previously mentioned issues. Moreover, our
                                                                           previous work relied on minimizing the distance between our
                                                                           neural network’s predictions and the results of a classical
                                                                           Newton-Raphson method. The main novelty of our present
                                                                           work is that we no longer try to imitate the output of another
                                                                           solver and we directly penalize the violation of physical laws
                                                                           during the training process. Our approach is thus completely
                                                                           independent from any other algorithm, and stands as a new
Fig. 2. Graph Neural Network - The idea of this novel class of neural is   way of performing a power flow. Moreover, we moved closer
to iteratively propagate messages between direct neighbors.                to the classical equations of the power systems literature, thus
   Our previous work on the Graph Neural Solver [1] laid the               enabling the incorporation of line characteristics variations in
ground for a strongly generalizable fast neural solver. It relies          our model. We ensure a global active equilibrium directly as
heavily on graph neural networks, and consists in three main               a part of our architecture. Finally, a better implementation of
parts: first an embedding of the input (injections at each line            the architecture using the deep learning framework Tensorflow
side), then a message propagation, and finally a decoding part             [29], and the use of GPUs allowed to drastically decrease
that converts the resulting messages into flows through lines.             the computational time (during inference), and thus make our



21st Power Systems Computation Conference                                                           Porto, Portugal — June 29 – July 3, 2020
                                                                     PSCC 2020
Graph Neural Solver a relevant new method for power systems             During training, our model is never told what the actual
operation.                                                              solution is. We let it perform a prediction, we compute
   This paper is organized as follows. First we introduce               its error, and tell it how it should alter its neural network
the notations, explain the structure of our proposed neural             blocks to decrease it. This way, it crystallizes a behavior that
network architecture, and how our learning algorithm will be            minimizes the physical laws violation. At inference time, it
trained. Then we experimentally demonstrate the viability of            has no access whatsoever to the global objective function, but
our approach on a set of IEEE power grids of different sizes.           only mechanically applies the behavior that it has previously
Finally we develop on the encountered limits, on how this               learned.
work should be further improved, and on how the properties                 Our architecture consists in an initialization followed by
of this new solver (computational speed and differentiability)          loops of global active compensations, local power imbalance
can be exploited.                                                       computations and neural network updates. It intricates both
                II. P ROPOSED M ETHODOLOGY                              learnable neural network parts and non-learnable physics-
                                                                        based parts. The learnable parts (which are neural networks)
   In this section, we develop on the overall architecture and          are trained through a learning process detailed below. The
equations of our proposed Graph Neural Solver. We also give             overall idea of the architecture is to iteratively and locally min-
details about some power grid modelling choices, and about              imize the violation of physical laws, while ensuring a global
the concept of message propagation.                                     active equilibrium. A full description of how the different steps
A. Graph Neural Solver structure                                        are laid out is shown in Algorithm 1.
    Our goal is to predict v and θ at each bus of a power grid.            The amount of “correction” updates K, the dimension of
We set ourselves in a steady-state setup. We take as input (see         the messages d and the discount factor γ are hyperparameters
Appendix and subsection II-C for more details about those               of the model.
quantities):                                                               1) Initialization: All voltages are set to 1 p.u. (with the
    • a base apparent power: baseM V A;                                 exception of the buses that are connected to a generator, whose
    • a list of buses: (i, Pd,i , Qd,i , Gs,i , Bs,i );                 voltage is set to the generator’s voltage set point), phase angles
    • a list of lines: (f rom bus, to bus, r, x, bc , τ, θshif t );     to 0 rad, and messages (see subsection II-D) to a d-dimensional
    • a list of generators: (gen bus, P g,i , P g,i , P̊g,i , vg,i ).   zero message.
    We choose the following convention for the notations: quan-            Details in the green equations (1) – (3) of Figure 3.
tities such as v, θ, etc. have a superscript k that corresponds to         2) Global active compensation: This step ensures a global
the “correction” update k, and a subscript i that corresponds           active equilibrium. It finds the value of the global parameter
to the bus i of the power grid (there are N buses on the grid).         λ that ensures this equilibrium. As detailed later in subsection
The absence of a subscript i denotes the vector (or matrix in           II-C, λ is a global parameter that controls the active power
the case of the messages m defined in subsection II-D) that             output of every generator on the grid. First, we need to
concatenates the corresponding quantity for all buses.                  compute the global consumption, including the losses via
    Our goal being to develop a power flow solver based on              Joule’s effect, and then solve a simple linear system. We
artificial neural network, we choose to use the same data               also enforce a local reactive equilibrium at each generator by
structure and power grid modelling as MATPOWER [30].                    modifying their reactive power outputs.
More details about notations and power grid modelling are                  Details in the orange equations (4) – (11) of Figure 3.
available in the Appendix. All notations and equations are                 3) Local power imbalance computation: The local power
compliant with the MATPOWER modelling of a power grid.                  imbalance at bus i and at update iteration k is defined
    Our proposed architecture is an iterative process that per-                                                                      √ by
                                                                        ∆Sik = ∆Pik + j∆Qki (where j is such that j 2 = −1).
forms “correction” updates on the variables v, θ and m of each          This term is computed at each K “correction” updates, and
bus of a power grid. These “correction” updates are analogous           its modulus is then incorporated in the Total Loss (see line 9
to the iterations performed by a Newton-Raphson solver.                 of Algorithm 1).
    Previous attempts at using artificial intelligence to tackle
physical problems tried to mimic the behavior of another                   Details in the blue equations (12) – (17) of Figure 3.
solver. In this work, we go a step further and break free                  4) Neural Network Update: This step consists in updating
from the need to imitate a classical solver. This is the main           (or “correcting”) vik , θik and mki based on their current values,
contribution of this paper: instead of minimizing the distance          the local power imbalance, and the sum of messages of their
between our prediction and the output of a solver, we directly          direct neighbors. These updates are performed by learnable
minimize the violation of physical laws (i.e. Kirchhoff’s laws)         neural network blocks that are specific to their corresponding
at training time. This is achieved at the cost of incorporating         “correction” update layer (thus the superscript k), and common
directly into the architecture the physics equations that model         to every bus in the power grid. However, the voltage of the
the behavior of power grids.                                            buses that are connected to a generator remain fixed at their
    Our algorithm is a novel hybrid approach that stands directly       generator’s voltage set point.
at the interface between artificial intelligence and optimization.         Details in the red equations (18) – (20) of Figure 3.



21st Power Systems Computation Conference                                                        Porto, Portugal — June 29 – July 3, 2020
                                                                  PSCC 2020
B. Overall architecture of the Graph Neural Solver                    (in order to let gradients flow through it during the training
   Algorithm 1 diagram details all the different operations           process). This is achieved by the following rule:
performed inside the proposed Graph Neural Solver. The term                                              
input refers to the following pieces of information: a base                        P g,i + 2 P̊g,i − P g,i λ,           if λ < 12
apparent power, a list of buses, a list of lines and a list of          Pg,i (λ) =                                 
                                                                                   2P̊g,i − P g,i + 2 P g,i − P̊g,i λ, otherwise
generators.
1) At first, the algorithm performs an Initialization, that sets                                                                (21)
all θs to 0 rad, all messages to [0, . . . , 0], and all voltages v
to 1 p.u. (except for buses connected to generators: they are            All productions of the power grid depend on a single
set to their corresponding voltage set points).                       common parameter λ. It ensures that all productions are at
2) The Global Active Compensation estimates the value of              their respective minimum values P g,i for λ = 0, at their
λ (defined in subsection II-C) to ensure a global active              initial set points P̊g,i for λ = 21 and at their maximum values
equilibrium of the power grid. It also forces buses that are          P g,i for λ = 1. As detailed hereafter, the proposed Graph
connected to generators to be at a reactive power equilibrium.        Neural Solver architecture will update the global parameter λ
3) The Local Power Imbalance is computed by estimating the            to constantly ensure a global active equilibrium. The equa-
mismatch in both active and reactive power at each bus.               P to solve to find
                                                                      tion                  P the λ that will ensure equilibrium is
4) Then variables v, θ and m (defined in subsection II-D) are            i∈Gen Pg,i (λ) =      i∈Load Pl,i +PJoule , where PJoule is the
updated by a Neural Network Update.                                   total power loss caused by Joule’s effect. Since this equation
The equations of each of these steps are displayed on Figure          is quite simple, there is a closed form solution which is shown
3, in a similar color code. All these operations are fully            in Equation 7. The impact of λ on the active production of a
differentiable, which allows gradients to flow from the Total         generator is illustrated in Figure 4.
Loss to each Neural Network Update during the training                   Constant power generators such as wind turbines and solar
process. Only the Neural Network Updates are learnt during            panels can be modelled as negative consumptions during the
the training process, while all other parameters remain fixed.        global active compensation step so that they are not affected
Unlike our previous approach detailed in Figure 2, we no              by λ. However, those should still be considered as generators
longer use encoders and decoders, and directly act on the             during the neural network updates, otherwise their v would
variables v, θ and m. Indeed, if the model is not learning,           not be controlled.
only the (viK , θiK )i=1,...,n are returned.
   The outputs of the model are the vs and θs of the K-th
correction update.                                                    D. Message propagation

C. Power grid modelling                                                 In addition to well-known variables such as voltage modulus
   The reactive power limits of the generators (Qg,min and            v and voltage angle θ, our algorithm also relies on the
Qg,max ) are not considered here, to be consistent with the           propagation of messages m. These vectors have no direct
default behavior of MATPOWER.                                         physical meaning. [1] experimentally demonstrated that the
   When solving an AC Power Flow system, one has to make              propagation of such variables was a suitable way to perform
sure that the overall active production is equal to the overall       the power flow computation. These messages are defined in
active consumption plus all losses caused by Joule’s effect           Rd , where d is an hyperparameter2 of the architecture.
on transmission lines. To do so, solvers slightly modify the            This can be seen as a way of encoding in an abstract
active productions. There are many ways of doing that : one           continuous space the state of a bus. They capture information
can choose to modify the active power of a single generator           about the current state, the history of updates, and information
(“slack bus”), or to use all generators proportionally. Another       about neighbors as well as their respective history. They can
common approach is to assign a coefficient to each generator          be viewed as an implicit memory and neighbor-awareness
and have all of them contribute proportionally to maintain a          variable.
global active equilibrium. However, generators are limited by
their respective maximum and minimum active power. When
such a threshold is encountered by one generator, all the others      E. Training algorithm
have to compensate a little more. This requires to check a              The violation of Kirchhoff’s law is computed at the end of
number of conditions that scales linearly with the number of          each “correction” update layer, for each bus of every sample of
generators on the power grid, which would be quite hard to            a minibatch of T samples (let t be the index of a sample). The
encode into a differentiable neural network.                          Total Loss3 that will be minimized during the training process
   All those solutions are simplified models of the real-life
compensation process that involves economical, geographical,
                                                                         2 Hyperparameters are parameters of a neural network that are not learnt
organisational and technological considerations.
                                                                      during the training process.
   The compensation rule we choose needs to be both delo-                3 The term “loss” refers to the error of a model in the machine learning
calized (i.e. have all generators contribute), and differentiable     literature




21st Power Systems Computation Conference                                                         Porto, Portugal — June 29 – July 3, 2020
                                                                PSCC 2020
                                         (
                                             vg,i , if i has a generator
                   vi0 =                                                                                                                                             (1)
                                             1,     otherwise
                    θi0 =                0                                                                                                                           (2)
                  m0i =                  [0, . . . , 0]   T
                                                                                                                                                                     (3)
                                           X                      1
              pkJoule =                             |vik vjk yij     (sin(θi − θj − δij − θshif t,ij ) + sin(θj − θi − δij + θshif t,ij ))                           (4)
                                                                 τij
                                         i:“from”
                                          j:“to”
                                                          2
                                                 vik
                                             
                                                                                      2
                                         +                     yij sin(δij ) + vjk         yij sin(δij )|                                                            (5)
                                                 τij
                                         N
                                         X                            2
              pkglobal =                         pd,i + gs,i vik           + pkJoule                                                                                 (6)
                                         i=1
                                          pk         PN
                                             global −  i=1 pg,i                                              PN
                                         
                                          PN         PN                    ,              if pkglobal <     i=1 p̊g,i
                                             2    i=1 p̊g,i −    i=1 pg,i
                   λk =                                                                                                                                              (7)
                                             pk
                                                        PN             PN
                                              global −2     i=1 p̊g,i +        pg,i
                                                                  PN i=1               , otherwise
                                         
                                                  PN
                                                2     (
                                                      i=1 pg,i −     i=1 p̊g,i    )
                  pkg,i =                pkg,i (λk )  cf eq (21)                                                                                                     (8)
                                                         2 
                   k
                  qg,i =                  qd,i − bs,i vik                                                                                                            (9)
                                                  "                                                    k 2                       #
                                             X
                                                       k k      1                                      vi                     bij
                                         −          −vi vl yil     cos(θi − θj − δij − θshif t,il ) +        (yil cos(δil ) −     )                                (10)
                                                               τij                                     τil                     2
                                             l∈N (i)
                                             i:“from”
                                           X            1                                                     bil
                                                                                                                   
                                                k k                                          k 2
                                                                                              
                                         −    −vi vl yil cos(θi − θl − δil + θshif t,il ) + vi (yil sin(δil ) − )                                                  (11)
                                                        τil                                                     2
                                             l∈N (i)
                                              i:“to”
                                                                 2 
                 ∆pki =                   pkg,i − pd,i − gs,i vik                                                                                                  (12)
                                                   "                                               k 2              #
                                             X
                                                      k k      1                                   vi
                                         +           vi vl yil sin(θi − θl − δil − θshif t,il ) +        yil sin(δil )                                             (13)
                                                              τil                                  τil
                                             l∈N (i)
                                             i:“from”
                                                 X             1                                     2
                                                                                                                    
                                         +          vik vlk yil sin(θi − θl − δil + θshif t,il ) + vik yil sin(δil )                                               (14)
                                                               τil
                                             l∈N (i)
                                              i:“to”
                                                                 2 
                 ∆qik =                    k
                                          qg,i − qd,i + bs,i vik                                                                                                   (15)
                                                  "                                                  k 2                      #
                                             X                   1                                   v i                   b il
                                         +          −vik vlk yil cos(θi − θl − δil − θshif t,il ) +        (yil cos(δil ) − )                                      (16)
                                                                τil                                  τil                    2
                                             l∈N (i)
                                             i:“from”
                                                 X              1                                     2               bil
                                                                                                                            
                                         +          −vik vlk yil cos(θi − θl − δil + θshif t,il ) + vik (yil cos(δil ) − )                                         (17)
                                                                τil                                                      2
                                             l∈N (i)
                                              i:“to”
                                                                                                                    
                                                                                              X
                θik+1 =                  θik + Lkθ vik , θik , ∆pki , ∆qik , mki ,                   φ(mkl , lineil )                                            (18)
                                                                                            l∈N (i)
                                         (
                                             vg,i ,                                                           if i has a generator
                vik+1 =                                                                                                                                           (19)
                                           vik + Lkv vik , θik , ∆pki , ∆qik , mki , l∈N (i) φ(mkl , lineil ) , otherwise
                                                                                    P
                                                                                                        
                                                                                    X
               mk+1
                i   =                    mki + Lkm vik , θik , ∆pki , ∆qik , mki ,      φ(mkl , lineil )                                                         (20)
                                                                                              l∈N (i)

Fig. 3. Graph Neural Solver equations - The color code is the same as the one used in Algorithm 1, so that each small block has its corresponding equations
in the same color. In the last three equations, and for the sake of conciseness we introduce the variable lineil = (ril , xil , bc,il , τil , θshif t,il ) which contains
all physical characteristics of line il. The function φ is another neural network that is trained jointly with the other neural network blocks.


21st Power Systems Computation Conference                                                                                  Porto, Portugal — June 29 – July 3, 2020
                                                                                      PSCC 2020
Algorithm 1 Graph Neural Solver architecture
 1: (vi0 , θi0 , m0i )i=1,...,n ← Initialization(input)
 2: λ0 ← Global Active Compensation(input, (vi0 , θi0 , m0i )i=1,...,n )
 3: (∆p0i , ∆qi0 )i=1,...,n ← Local Power Imbalance(input, (vi0 , θi0 , m0i )i=1,...,n , λ0 )
 4: Total Loss ← 0
 5: for k=1,. . . , K do
 6:       (vik , θ,ki , mki )i=1,...,n ←Neural Network Update(input, (vik−1 , θik−1 , mk−1
                                                                                         i  , ∆pk−1
                                                                                                 i     , ∆qik−1 )i=1,...,n )
            k                                               k k    k
 7:       λ ← Global Active Compensation(input, (vi , θi , mi )i=1,...,n )
 8:       (∆pki , ∆qik )i=1,...,n ← Local Power   PnImbalance(input, (vik−1 , θik−1 , mk−1
                                                                                       i   )i=1,...,n , λk−1 )
                                             K−k           k 2       k 2
 9:       Total Loss ← Total Loss +γ                i=1 (∆pi ) + (∆qi )
10: end for
11: if model is training then
12:       Perform gradient descent step: minimize Total Loss by changing neural networks
13: else
14:       return (viK , θiK )i=1,...,n


                                                                             Imbalance (cf. the blue functions of Algorithm 1). Then, the
                                                                             gradients of this Total Loss with regards to the weights of each
                                                                             Neural Network Update Lkv , Lkθ and Lkm (cf. the red functions
                                                                             of Algorithm 1) is estimated using the back-propagation rule
                                                                             [31]. The neural network φ shown in equations (18) – (20)
                                                                             is trained jointly with the others in the same process. These
                                                                             gradients are then used to modify the weights of the Neural
                                                                             Network Updates (gradient descent step). This process is
                                                                             repeated on randomly sampled power grids, until the Total
                                                                             Loss no longer decreases.
                                                                             F. Compatibility with varying input size
Fig. 4. Active production of generator i - The active power of every
generator of a power grid instance depends on a global parameter λ.
                                                                               As detailed in our first paper about Graph Neural Solver,
                                                                             any instance of our model is mathematically compatible with
is a weighted average of those Kirchhoff’s law violations and                any size and shape of power grid. Unlike fully connected
is defined by:                                                               neural networks (similar to the one in Figure 1), a graph neural
                                                                             network is able to deal with changes in the shape of its input
                             T   K        N
                     1 X X γ K−k X    k 2
                                                                             data.
   T otal Loss =                   |∆Si,t |                           (22)
                     T t=1   N i=1                                                                III. E XPERIMENTS
                               k=1

           1 X K
             T X         γ   K−k N
                                 Xh            2              2 i             This section details the experimental protocol used to val-
                                         k
       =                               ∆Pi,t        + ∆Qki,t          (23)   idate the proposed approach, the way data is generated from
           T t=1             N   i=1
                   k=1
                                                                             standard power grids, and discusses the obtained results, as
   The way this loss is integrated into the overall architecture             well as the current limits of the proposed architecture.
is shown on line 9 of Algorithm 1.                                              To validate the approach, we need a dataset of different
   The discount factor γ in Equation (23) puts a stronger                    power grids, with both injections and grid information. To
emphasis on the accuracy of the later “correction” update                    generate this dataset, we take several standard IEEE power
layers, while still allowing gradients to flow to the first few              grids, and add some noise on both injections and power line
“correction” update layers. We experimentally observed that                  parameters, which allows to generate a variety of situations.
penalizing the physical laws violation at every “correction”                 For every power line, r, x and b are uniformly sampled
update with this discount factor, instead of only the last update,           between 90% and 110% of their initial values. The ratio τ
helped converging to a better solution. γ is a hyperparameter                of each line (which are seen as transformers, see Appendix)
of the model that is chosen empirically.                                     are uniformly sampled between 0.8 and 1.2, and the shifting
   During the training process, a set of T different power grids             phase θshif t is uniformly sampled between 0.2 and −0.2 rad.
(each defined by injections, topology and line characteristics)              The maximum and minimum values of active power of each
is fed to the model (the input variable of Algorithm 1). The                 generator remain unchanged, while their voltage set points vg
model performs a prediction on each of these T power grids                   are uniformly sampled between 0.95 p.u. and 1.05 p.u., and
for v and θ of each bus. The Total Loss (i.e. physical laws                  their initial active powers P̊g are uniformly sampled between
violations) is estimated using the output of each Local Power                25% and 75% of their allowed range. The active demands Pd



21st Power Systems Computation Conference                                                            Porto, Portugal — June 29 – July 3, 2020
                                                                       PSCC 2020
are first uniformly sampled between 50% and 150% of their                    computational time, while achieving a better accuracy than
initial values, and are then all multiplied by a common factor               the DC approximation (keeping in mind that the scale is
to make sure that the demand is equal to the consumption. The                logarithmic). Orders of magnitude for the error of the DC
reactive demands Qd are uniformly sampled between 50% and                    approximation can seem surprisingly large. However, the noise
150% of their initial values. This choice of noise amplitude                 we apply on the standard power grids makes most of the
was motivated by the fact that it created situations outside                 generated data points “unrealistic”, thus pushing the snapshots
of the validity domain of the DC-approximation (as will be                   far from the domain of validity of the DC approximation.
experimentally shown by large errors for the latter method).                    In Figure 7 are displayed 2d histograms (logarithmic color
   The proposed Graph Neural Solver is compared to two                       scale): on the left part, the predicted active line flow (MW)
baselines: the DC approximation, and the Newton-Raphson                      of our Graph Neural Solver is plotted against the active line
methods of PYPOWER (which is a port of MATPOWER in                           flow (MW) output by Newton-Raphson for all 4 power grid
python). All of these 3 methods solve different versions of                  neighborhoods. On the right part, the output of the DC-
the same problem (the DC approximation ignores the reactive                  approximation is plotted against the output of a Newton-
part of the problem and the Graph Neural Solver uses an                      Raphson. One can observe that our proposed architecture has a
atypical global active equilibrium mechanism). We do not have                higher correlation with the output of a Newton-Raphson solver
access to a “ground truth”, so it is impossible to compare                   than the DC approximation on the test set (0.995 for GNS vs.
all three models to the same thing. However, we believe that                 0.876 for DC on case188).
the Newton-Raphson method can be considered as the most                         Figure 8 offers an example of how our proposed architecture
precise and trustworthy, so we have decided to compute for                   updates the v and θ of a power grid, and how it affects the
the two remaining models the percentage of error in terms                    violation of Kirchhoff’s law. It starts with a simple initial-
of active flow (which are easily deduced from v and θ) with                  ization, which cause large physical laws violation (many red
regards to the Newton-Raphson solver.4                                       buses). After 10 “correction” updates, the model has changed
                                                                             the vs and θs so as to decrease the physical laws violation
A. Standard experiment                                                       (many green buses). Figure 8 presents how a previously
   In this first experiment, our models are tested on the                    trained Graph Neural Solver modifies its predictions across
same neighborhood they were trained on. We trained 10                        its “correction” updates, and how it succeeds in iteratively
distinct instances of the proposed Graph Neural Solver (dur-                 reducing the violation of Kirchhoff’s law. The instance of
ing 10,000 learning iterations) on the neighborhoods of the                  Graph Neural Solver has 10 “correction” updates (K = 10).
following standard IEEE power grids: case9, case14, case30                   (1) On the left panel are displayed the voltages v in p.u..
and case118. The same set of hyperparameters is used for each                One can see that it is uniformly initialized at 1 p.u. except
of these instances: the amount of correction updates K is set                for buses that are connected to generators (those are set to
to 30, the latent dimension d (which is both the size of the                 their generator’s voltage set points). The voltage then evolves
messages, and the size of latent space of neural networks) is                across “correction” updates until it reaches its final prediction
set to 10, each learning block has 2 layers and leaky ReLU                   at “correction” update 10. (2) On the middle left panel are
activation function. We choose γ = 0.9. These values were                    displayed the phase angles θ in rad. It is at first initialized
chosen empirically after a random hyperparameter search. The                 at 0 rad, and then evolves across “correction” updates until
optimizer used is Adam with standard Tensorflow parameters.                  it reaches its final prediction. (3) On the two right panels
Each instance was trained using a Nvidia Titan XP GPU. A                     are displayed the Kirchhoff’s law violation |∆S| in MVA.
learning curve from these experiments is displayed in Figure                 The boxplot displays the minimum, 25th percentile, median,
5. The test set contains only power grids that converge with a               75th percentile and maximum value of the very same |∆S|
Newton-Raphson solver.                                                       displayed on the graph. One should keep in mind that the
   The results of these experiments are shown on Figure 6. On                quality of a prediction depends on the maximum value of |∆S|
the x-axis is diplayed the mean computational time required                  on the power grid. This maximum value starts quite high at
to perform one load flow, and on the y-axis is shown the                     “correction” update 0 (i.e. before any correction is applied,
% of error (20th percentile, median and 80th percentile) in                  there are many red buses), and decreases with the amount of
terms of active flow with regards to the output of a Newton-                 “correction” update (all buses are green at the end).
Raphson solver. The percentage of error can explode for small                B. Generalization from one grid to another
flows, so we only kept the 50% of largest values in absolute
value. Circles correspond to case9, stars to case14, triangles                  As demonstrated in previous work [1], a key property of
to case30 and squares to case118. Since the % of error is                    the proposed architecture is the ability to learn on one power
a comparison with the Newton-Raphson, the error of this                      grid and perform decently on another, regardless of their sizes
solver are indeed at 0%. One can observe from this plot that                 or shapes. Even though this aspect is not the key point of the
our proposed method offers a significant gain in terms of                    present paper, we display on Figure 9 some statistics about
                                                                             the percentage of error with regards to a Newton-Raphson
  4 One should keep in mind that our proposed architecture does not try to   solver for models that were trained in the neighborhood of one
imitate a Newton-Raphson solver.                                             power grid and then tested on the neighborhood of another.



21st Power Systems Computation Conference                                                            Porto, Portugal — June 29 – July 3, 2020
                                                                       PSCC 2020
                                                                                 The results on the diagonal (red points) are indeed the same
                                                                                 as those that were shown in Figure 6. From this figure we
                                                                                 can see that model instances trained on large power grids tend
                                                                                 to have a good accuracy on smaller power grids, while the
                                                                                 opposite is not necessarily true.
                                                                                 C. Discussion
                                                                                    A major limitation of the current work is the lack of
Fig. 5. Total Loss - Evolution of the Total Loss (see equations 22 and 23)       empirical proof of the scalability of the approach on real-life
during the training of one instance of our Graph Neural Solver. It was trained   power grids, and is an open area for future research.
on a single Nvidia Titan XP GPU, which took approximately 80 minutes.
                                                                                    Some preliminary work has been done as preparation of the
                                                                                 present paper to generate power grids with completely random
                                                                                 injections, topologies and line characteristics so as to observe
                                                                                 a much broader domain during the training of the model.
                                                                                 However, it appears that many of those generated power grids
                                                                                 were unrealistic and lead to non-converging situations. We
                                                                                 have thus chosen to stay within the neighborhood of well
                                                                                 known power grids.
                                                                                    The goal of the present paper is to provide empirical
                                                                                 validation of the viability of the approach, and to try to develop
                                                                                 some intuitions about how it works. The theoretical aspects
                                                                                 and properties should be further developed in future work,
                                                                                 so as to build more confidence in the behavior of this novel
                                                                                 method as well as its proper use.
                                                                                    An important challenge that will be faced in the near future
                                                                                 is the implementation of reactive power limits for generators.
                                                                                 The most straightforward option would be to hard-code an
Fig. 6. Computational time vs. Method accuracy - Our proposed method             iterative behavior that checks if limits are crossed, in which
provides a substantial gain in terms of computational time, while keeping the    case a new computation would be run. Another option would
error reasonably low.
                                                                                 be to incorporate those reactive power limits as input of our
                                                                                 neural network architecture, and incorporating the crossing of
                                                                                 such a limit directly into the loss.
                                                                                                        IV. C ONCLUSION
                                                                                    The main contribution of this paper is to propose a novel
                                                                                 method that lies at the direct interface between artificial intelli-
                                                                                 gence and optimization to solve the power flow problem, while
                                                                                 achieving a substantial gain in terms of computational time.
                                                                                 A traditional algorithm such as Newton-Raphson has access
                                                                                 to the global objective function and proceeds by iteratively
                                                                                 minimizing it by directly acting on the variables. Our method
                                                                                 is quite different: the model (which is basically a long differ-
                                                                                 entiable function) tries to predict v and θ everywhere, based on
                                                                                 the inputs (injections and power grid characteristics). During
                                                                                 training, the model tries to minimize the global objective
                                                                                 function, not by directly acting on the variables v and θ like
                                                                                 traditional optimization, but by altering itself (i.e the weights
                                                                                 of its neural network blocks). The model learns a behavior that
                                                                                 minimizes at test time (inference time) the desired objective,
                                                                                 without solving an optimization problem.
                                                                                    We experimentally demonstrated that our proposed architec-
                                                                                 ture is able to perform predictions faster than methods such
                                                                                 as the DC approximation or Newton-Raphson (up to 2 orders
                                                                                 of magnitude). Predictions in terms of active flow are very
                                                                                 similar to the output of a Newton Raphson method (correlation
Fig. 7. Correlation with Newton-Raphson output - Our proposed method
is generally more linearly correlated with the Newton-Raphson.                   of 0.995 on case118).



21st Power Systems Computation Conference                                                                 Porto, Portugal — June 29 – July 3, 2020
                                                                           PSCC 2020
Fig. 8. Graph Neural Solver on the IEEE case30 - A previously trained instance of Graph Neural Solver minimizes the violation of Kirchhoff’s law by
altering its predictions for v and θ.

                                                                              the door for gradient-descent optimization and control of the
                                                                              power grid. This could also become a part of a larger decision
                                                                              making process.
                                                                                 Another possible avenue would be to train a decentralized
                                                                              controller to perform some kind of optimization (such as
                                                                              minimizing power losses caused by Joule’s effect), jointly with
                                                                              the learning of the Graph Neural Solver. There would be two
                                                                              different types of gradients flowing through the architecture:
                                                                              one that aims at minimizing the violation of Kirchhoff’s law
                                                                              and that affect the neural network updates of v and θ, and
                                                                              another one that would minimize another objective function,
                                                                              and that would affect the decentralized controllers.

                                                                                                     ACKNOWLEDGMENT
                                                                                 We would like to thank Patrick Panciatici, Louis Wehenkel
                                                                              and Patrick Pérez for insightful discussions. We would like
                                                                              to thank Vincent Barbesant, Laure Crochepierre, Stéphane
                                                                              Fliscounakis, Camille Pache and Wojciech Sitarz for their
                                                                              attentive reviews and remarks.
Fig. 9. From one case to another - In this plot, we display the % of error                                R EFERENCES
(20th percentile, median, 80th percentile) in active flow with regards to a
Newton-Raphson solver for Graph Neural Solver models that were trained on     [1] Balthazar Donon, Benjamin Donnot, Isabelle Guyon, and Antoine
the neighborhood of one power grid, and then tested on another.                   Marot. Graph neural solver for power systems. In International Joint
                                                                                  Conference on Neural Networks, 2019.
                                                                              [2] D2.2 - guidelines for implementing the new reliability assessment and
   An interesting property of our Graph Neural Solver is                          optimization methodology. GARPUR consortium, 2016.
the fact that it will always predict something. Some early                    [3] E. Karangelos and L. Wehenkel. Probabilistic reliability management
                                                                                  approach and criteria for power system real-time operation. In Power
experiments showed that in the case where a classical solver                      Systems Computation Conference (PSCC), 2016.
simply does not converge, the graph neural solver shows the                   [4] T.T. Nguyen. Neural network load-flow. IEE Proceedings - Generation,
voltage magnitudes collapse everywhere on the grid. This                          Transmission and Distribution, 1995.
                                                                              [5] Benjamin Donnot, Isabelle Guyon, Marc Schoenauer, Antoine Marot,
empirical observation should be investigated in the future, but                   and Patrick Panciatici. Fast Power system security analysis with Guided
falls out of the scope of the present paper.                                      Dropout. In European Symposium on Artificial Neural Networks, 2018.
                                                                              [6] B. Donnot, B. Donon, I. Guyon, L. Zhengying, A. Marot, P. Panciatici,
   Our Graph Neural Solver is a fully differentiable function                     and M. Schoenauer. Leap nets for power grid perturbations. In European
that predicts power flows. This differentiability could open                      Symposium on Artificial Neural Networks, 2019.




21st Power Systems Computation Conference                                                                Porto, Portugal — June 29 – July 3, 2020
                                                                        PSCC 2020
 [7] L. Duchesne, E. Karangelos, and L. Wehenkel. Using machine learning               Zheng. TensorFlow: Large-scale machine learning on heterogeneous
     to enable probabilistic reliability assessment in operation planning. In          systems, 2015. Software available from tensorflow.org.
     Power Systems Computation Conference (PSCC), 2018.                           [30] R. D. Zimmerman, C. E. Murillo-Sanchez, and R. J. Thomas. Matpower:
 [8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.                Steady-state operations, planning, and analysis tools for power systems
     MIT Press, 2016. http://www.deeplearningbook.org.                                 research and education. IEEE Transactions on Power Systems, 2011.
 [9] J. Schmidhuber. Deep learning in neural networks: An overview. Neural        [31] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
     Networks, 2015.                                                                   Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
[10] Thomas N. Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and                   erative adversarial nets. In Advances in Neural Information Processing
     Richard S. Zemel. Neural relational inference for interacting systems. In         Systems 27. 2014.
     Proceedings of the 35th International Conference on Machine Learning,
     ICML, 2018.                                                                                                 A PPENDIX
[11] Jonathan Tompson, Kristofer Schlachter, Pablo Sprechmann, and Ken
     Perlin. Accelerating eulerian fluid simulation with convolutional net-          In this appendix we give more details about the MATLAB
     works. ArXiv Computing Research Repository, 2016.                            notations and modelling that we use.
[12] Iris Cong, Soonwon Choi, and Mikhail D. Lukin. Quantum convolutional            1) Base apparent power: Each power grid instance has a
     neural networks. 2019.
[13] Kyle Mills, Michael Spanner, and Isaac Tamblyn. Deep learning and            parameter baseM V A in MVA which will serve to normalize
     the schrödinger equation. Physical Review A, 2017.                          the power. We will denote by P , Q, S the active, reactive and
[14] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Ste-      complex power in respectively MW, MVAr and MVA, and by
     fan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Schnet:
     A continuous-filter convolutional neural network for modeling quantum        p, q and s their equivalents normalized by baseM V A. The
     interactions. In Advances in Neural Information Processing Systems 30.       same convention is used for the shunt active and reactive power
     2017.                                                                        Gs and Bs (thus denoted by gs and bs in their normalized
[15] Peter W. Battaglia and al. Relational inductive biases, deep learning,
     and graph networks. ArXiv Computing Research Repository, 2018.               versions), and should not be mistaken with the total charging
[16] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner,            susceptance of a line bc .
     and Gabriele Monfardini. The graph neural network model. IEEE                   2) Bus: Loads and shunts are encoded at the bus. A bus
     Transactions on Neural Networks and Learning Systems, 2009.
[17] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel.            is thus defined by: its index i, its active load Pd,i (MW), its
     Gated graph sequence neural networks. ArXiv Computing Research               reactive load Qd,i (MVAr), its active shunt load Gs,i (MW at
     Repository, 2015.                                                            1.0 p.u.) and its reactive shunt load −Bs,i (MVAr at 1.0 p.u.).
[18] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals,
     and George E. Dahl. Neural message passing for quantum chemistry.               3) Line: MATPOWER adopts a common framework for
     ArXiv Computing Research Repository, 2017.                                   transmission lines and transformers. A line is thus defined by:
[19] Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, and Amir           its “from” bus index, its “to” bus index, its resistance r (p.u.),
     Globerson. Mapping images to scene graphs with permutation-invariant
     structured prediction. ArXiv Computing Research Repository, 2018.            its reactance x (p.u.), its total charging susceptance bc (p.u.),
[20] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos,         its transformation ratio τ (p.u.) and its shift angle θshif t (rad).
     Ruslan Salakhutdinov, and Alexander J. Smola. Deep sets. ArXiv               Transformers are on the “from” side.
     Computing Research Repository, 2017.
[21] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral
                                                                                     We also use the convention to write each of these quantities
     networks and locally connected networks on graphs. In International          with the subscript ij to denote the line between buses i and j.
     Conference on Learning Representations, 2014.                                Moreover we define the following quantities: yij = √ 21 2
[22] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convo-                                                                           rij +xij
     lutional neural networks on graphs with fast localized spectral filtering.   and δij = atan2(rij , xij ).
     In Advances in Neural Information Processing Systems 29. 2016.                  4) Generator: A generator is defined by: its bus index i,
[23] David K. Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre,
     Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and            its maximum provided active power P g,i (MW), its minimum
     Ryan P. Adams. Convolutional networks on graphs for learning                 provided active power P g,i (MW), its active power set point
     molecular fingerprints. ArXiv Computing Research Repository, 2015.
                                                                                  P̊g,i (MW) and its voltage set point Vg,i (p.u.).
[24] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive represen-
     tation learning on large graphs. ArXiv Computing Research Repository,
     2017.
[25] Thomas N. Kipf and Max Welling. Semi-supervised classification with
     graph convolutional networks. ArXiv Computing Research Repository,
     2016.
[26] Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski,
     Razvan Pascanu, Peter Battaglia, and Timothy P. Lillicrap. A simple
     neural network module for relational reasoning. ArXiv Computing
     Research Repository, 2017.
[27] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst.
     Geometric deep learning: Going beyond euclidean data. IEEE Signal
     Processing Magazine, 2017.
[28] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions
     on Knowledge and Data Engineering, 2010.
[29] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
     Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu
     Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey
     Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,
     Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga,
     Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon
     Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-
     cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete
     Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang




21st Power Systems Computation Conference                                                                     Porto, Portugal — June 29 – July 3, 2020
                                                                            PSCC 2020
