The search functionality is under construction.

IEICE TRANSACTIONS on Fundamentals

Open Access
Operational Resilience of Network Considering Common-Cause Failures

Tetsushi YUGE, Yasumasa SAGAWA, Natsumi TAKAHASHI

  • Full Text Views

    197

  • Cite this
  • Free PDF (1.3MB)

Summary :

This paper discusses the resilience of networks based on graph theory and stochastic process. The electric power network where edges may fail simultaneously and the performance of the network is measured by the ratio of connected nodes is supposed for the target network. For the restoration, under the constraint that the resources are limited, the failed edges are repaired one by one, and the order of the repair for several failed edges is determined with the priority to the edge that the amount of increasing system performance is the largest after the completion of repair. Two types of resilience are discussed, one is resilience in the recovery stage according to the conventional definition of resilience and the other is steady state operational resilience considering the long-term operation in which the network state changes stochastically. The second represents a comprehensive capacity of resilience for a system and is analytically derived by Markov analysis. We assume that the large-scale disruption occurs due to the simultaneous failure of edges caused by the common cause failures in the analysis. Marshall-Olkin type shock model and α factor method are incorporated to model the common cause failures. Then two resilience measures, “operational resilience” and “operational resilience in recovery stage” are proposed. We also propose approximation methods to obtain these two operational resilience measures for complex networks.

Publication
IEICE TRANSACTIONS on Fundamentals Vol.E107-A No.6 pp.855-863
Publication Date
2024/06/01
Publicized
2023/09/11
Online ISSN
1745-1337
DOI
10.1587/transfun.2023EAP1011
Type of Manuscript
PAPER
Category
Reliability, Maintainability and Safety Analysis

1.  Introduction

Modern society depends on various critical infrastructures such as energy, water, communication and transportation systems and so on. An increasing number of high-impact and unpredictable events are affecting these systems with severe consequences to our society. The disaster of Fukushima Daiichi nuclear power plant in 2011 is the representable. The traditional approach to reliability and risk analysis relies on the identification of hazards and the development of subsequent scenarios, and cannot manage to the unknown threats or unpredictable events. In contrast, the concept of resilience focuses on the recovery action after a severe disruption both predictable and unpredictable, in addition to conventional reliability analysis.

Resilience is first defined by the ecologist Holling as the measure of persistence that result from the ability of a system to absorb change [1]. For the engineering system, Bruneau et al. proposed a seminal definition of resilience as the ability of a system to reduce the chances of a shock, to absorb a shock if it occurs and to recover quickly after a shock [2]. More specifically, a resilient system is the one that shows (i) reduced failure probabilities, (ii) reduced consequences from failures, (iii) reduced time to recovery.

A broad measure of resilience that capture these key features can be expressed by the concept illustrated in Fig. 1. The measure \(P(t)\) is defined for the performance of the infrastructure, such as the number of normally operating components within the infrastructure system. Specifically, performance can range from 0% to 100%, where 100% means no degradation in service and 0% means no service available.

Fig. 1  Conceptual definition of resilience.

The performance response process in Fig. 1 can be divided into three different stages [3]. The first stage is disaster prevention stage with a range \((0, t_0)\) in Fig. 1. The resistant capacity of the system to prevent any possible hazards and maintain 100% system performance is an important aspect in this stage. The conventional reliability techniques such as robustness and redundancy can be conducted to improve the resilience in this stage [4]. The second stage, \((t_0, t_1)\), is the damage propagation stage. If a shock, such as an earthquake, occurs at time \(t_0\), it could cause damage to the system such that the performance decreases to 50% at time \(t_1\). The ability to absorb the impacts of initial damage and to minimize the consequence is required in this stage. The maximum degradation level, 100%-50% in Fig. 1, is used to measure the absorptive capacity of the system. The third stage, \((t_1, t_2)\), is the recovery process. During the period, the information of system damage is collected and the recovery resources are allocated to restore performance. Resourcefulness and rapidity of recovery action are the main properties. Recovery time and recovery cost represent the restoration capacity and help characterizing resilience in this stage. “R4 framework” is a framework to comprehensively evaluate the four key properties, robustness, redundancy, resourcefullness and rapidity [2].

The three stages constitute a typical system response cycle to disruptions. To enhance system resilience, the improvement strategies should be conducted for three different stages [3]. Even though reliability engineering has been used to enhance resilience of first and second stages, the resilience of third stage has not been focused. To improve resilience in the third stage, several strategies, such as establishing efficient communication channels and coordinating rapid recovery response [5], and improving decision support platforms to quickly and accurately identify feasible recovery strategies [6], are proposed.

For the measure of resilience in the third stage, Caputo et al. [7] defined the resilience subjected to restoration as follows,

\[\begin{equation*} R={1 \over t_2-t_0}\int_{t_0}^{t_2} P(t) dt. \tag{1} \end{equation*}\]

It defined as the mean performance of the system between the shock occurrence and the completion of restoration. Here \(P(t)\) in the restoration period depends on the resourcefulness that the capacity to identify problem, establish priorities and mobilize resources (material and human resources). Under the condition that the useful resources are limited, the mean repair time \(t_2-t_0\) is mainly decided by the initial degradation level. In this case, the strategies of restoration, especially to establish priority for restoration, play a significant role to increase resilience in Eq. (1). The black line in Fig. 1 shows the recovery process following a standard restoration strategy and the blue (red) line is that of a superior (inferior) strategy. The total recovery times for three strategies are same because of limited repair resources but the performance levels during the restorations are different.

Many studies have been conducted to quantify or evaluate the resilience of engineering systems based on the definition in [2], [7]. Many of them evaluate the resilience of the system by simulation under the occurrence of a specific disaster. As a representative example, Guzs et al. used a concrete example of a power grid that is not connected to other networks to obtain resilience under several scenarios of disaster occurrence and restoration [8]. Ziwei et al. evaluated the resilience of a train operation management system when the repair time follows a log-normal distribution [9]. Ganin et al. calculated the number of operating nodes using a graph theory approach and calculated resilience [10]. As an example of multiple hazard, Cimellaro defined resilience as the area under the system’s performance curve for a specified period of time [11]. In addition, Ouyang et al. calculated the average resilience of a power grid network for one year considering failures of substations in the network due to multiple types of disasters [3]. Here, they assumed that the arrival of each disaster follows a Poisson distribution, and five types of damage occur according to the pre-determined probabilities of the occurrence of the disaster. In addition, the time to completion of repair, which greatly affects the average resilience, was determined by random numbers following a normal distribution. Other researches on resilience involve mathematical recovery process modelings, that is, determining the performance curve under the restoration as a function [12], [13]. The past researches on resilience analysis have focused on analysis based on simulation study, even if some researches include stochastic process for the deterioration. No analytical solution for resilience has been proposed. The reason for this is the difficulty to model the occurrence of large-scale disasters, the difficulty to define the relationship between equipment failure and system performance and the difficulty to model the restoration process with various factors as a stochastic process. Another problem with resilience study is that the resilience in each three stage is evaluated separately, and there is no comprehensive measure to evaluate the total system resilience.

This paper discussed the resilience of infrastructure network. The electric power network where only edges may fail and the performance of the network is measured by the ratio of connected nodes, nodes (vertices) that have a pass from source node, is supposed for the target network.

The modeling of restoration is a key factor to analyze resilience. We assume that the resources for restoration are limited and the failed edges are repaired one by one. In addition, as a restoration strategy, a priority repair policy is adopted to determine the order of the repair when several edges failed. The order of repair is determined with the priority to the edge that the amount of increasing system performance is the largest after the completion of repair.

Two types of resilience are discussed in this paper, one is the resilience in the recovery stage and other is the resilience that integrates resilience capacities of three stages. The first is resilience based on Eq. (1), focusing on resilience in recovery process after a single disruption. The resilience is easily formulated under the assumptions that the deterioration of network performance is given and the restoration is conducted for all failed edges one by one. It is presented in Sect. 3. The second is realized by considering long-term operation in which the network state changes stochastically. In this case, the resilience measure should take into account the resilience capabilities of the first and second stages, not only the third stage. The main feature of this paper is to propose a stochastic process model that represents the outbreak of large-scale disruption and can be solved analitically. The common-cause failure (CCF) is incorporated to represent the occurrence of disruption stochastically. CCF is one of primary factors of large-scale disruption. It is widely studied in the field of reliability and risk engineering [14]-[16]. It invalidates the redundancy of a system by the occurrence of the single root cause event. The Marshall-Olkin type shock model is one of the mathematical models to evaluate CCF [17], [18]. CCF is assumed to occur as the result of the occurrence of an external shock in the model. Two measures, “operational resilience” and “operational resilience in recovery stage” are introduced to evaluate the resilience under the long-term operation in Sect. 4. Operational resilience in recovery stage shows the resilience in the recovery period within the entire operational period. It is a natural extension of the resilience in Eq. (1) considering the long-term operation. On the other hand, operational resilience is a new comprehensive resilience measure. It evaluates the ability of network resilience during the entire operational period containing the ability that maintains 100% network performance. Markov process is used to obtain analytical solution for both operational resilience measures under the assumption that the occurrence rate of CCFs and repair rate are constant. We also propose two approximation methods for the complex network. We verify the applicability and accuracy of the approximation method.

The rest of this paper is organized as follows: In Sect. 2, the target networks are clarified and the performance measure and the repair strategy are defined. Section 3 discusses the resilience in recovery process when networks are suffered a large scale disruption. The resilience in this section focuses only on restoration from the simultaneous failures that deterministically inflicted. In this case, we show that resilience can be formulated mathematically. The effect of priority repair strategy is verified in this section. Section 4 discusses the resilience when the large scale disruption occurs probabilitically. The operational resilience and the operational resilience in recovery stage are defined. The two operational resilience measures are derived by both Monte Carlo simulation and Markov analysis. The exact solution and two approximation methods for Markov analysis are proposed. The accuracy of approximations are verified by numerical examples. We summarize our work in Sect. 5.

2.  Model Description

2.1  Network

We define \(G=(V,E)\) as a given network, where \(V=\{v_1,v_2,\ldots,v_n\}\) is a set of nodes, \(E=\{e_1,e_2,\ldots,e_m \}\) is a set of edges and \(n\) (\(m\)) is the number of nodes (edges). Node \(v_1\) is a source node. Each edge is directional or bidirectional. To simplify the explanation, we focus on edge failures, though the concept may be extended to include node failures. Then the state of network is described by an \(m\)-dimensional vector whose element is the state of each edge given in binary (except for Sect. 4.4 where the definition of state is changed for approximation).

2.2  Performance Measure

The measure \(P(t)\) is the ratio of operational nodes in total \(n\) nodes, where operational node means at least one path to node \(v_1\) exists. If the states of every edges are fixed, the performance measure \(P(t)\) is derived by using the reachability matrix of the network. Let \(G'(t)\) be a subgraph that the unfunctional edges and nodes are removed from \(G\) at time \(t\), and \(A(t)\) be the adjacent matrix of \(G'(t)\), whose entry \(a_{ij}(t)\) is binary: if node \(i\) is adjacent or directly connected to node \(j\) at time \(t\), then \(a_{ij}(t)=1\); otherwise \(a_{ij}(t)=0\). The reachability matrix \(A_r (t)\) is given as follows by conducting Boolean operation,

\[\begin{equation*} A_r(t)=(I+A(t))^{n-1}, \tag{2} \end{equation*}\]

where, \(I\) is the \(n\times n\) identity matrix. \(A_r (t)\) is a binary matrix and the first row of the matrix represents the reachability from node \(v_1\) to other nodes. Then \(P(t)\) is given as follows,

\[\begin{equation*} P(t)={\pi_0 A_r(t) u^{\mathrm{T}} \over n}, \tag{3} \end{equation*}\]

where \(\pi_0=(1,0,\ldots,0), u=(1,1,\ldots,1)\). Obtaining \(P(t)\) by Eq. (3) is inefficient because it is necessary to calculate the \((n-1)\)th power of the adjacent matrix. Therefore, the following simple recursive algorithm is useful.

Algorithm 1 (number of connected nodes)

Input: Adjacent matrix, \(A(t)\)

Output: number of connected nodes, \(s(t)\)

  1. Let \(S(t)\) be an empty array and add source node.
  2. Choose one node in \(S(t)\), say node \(i\)
  3. Add node \(j\) to \(S(t)\) if the \((i,j)\) element of \(A(t)\) equals to 1 and \(j\notin S(t)\).
  4. Repeat 2 and 3 for all nodes in \(S(t)\).
  5. \(s(t)=|S(t)|\).
2.3  Restoration of Edge by Priority Repair Policy

For the restoration, the restoration begins as soon as edges fail, i.e., \(t_0=t_1\) in Fig. 1. Under the constraint that the resources are limited, the failed edges will be repaired one by one. The distribution of restoration is exponential with parameter \(\mu\). Therefore, the performance curve in recovery stage is a step function as shown in Fig. 1. The order of the repair of several failed edges is determined with the priority to the edge that the amount of increasing system performance is the largest after the completion of repair. Let \(Q(t)\) be a set of failed edges at time \(t\). The optimal repair edge \(e^*\) is selected as follows,

\[\begin{equation*} e^*=\arg\max_{e \in Q(t)}P(t'), \tag{4} \end{equation*}\]

where, \(t'\) is a time that the restoration of edge \(e\) is completed.

3.  Resilience for a Specific Disruption

This section discusses the resilience when a specific \(d\) edges have failed by a large scale disruption. Note that the occurrence of disruption is deterministic and we discuss only the recovery process after a given simultaneous failure in this section.

Let state \(i\) be the network state that the \(i\)-th repair (\(i=1,2,\ldots,d\)) of edge is in progress and \(p_i(t)\) be the probability of the state at time \(t\), where \(t\) is the elapsed time from the disruption. The initial state is state 1 and it is one of \({( \begin{array}{c} m\\ d \end{array} ) }\) states that the network has \(d\) failed edges. The state numbers \(i\) are explicitly named as continuous integers for the network states transitioned from \(t=0\), although \(i\) represents the number of repairs from the beginning in this case. The counting process \(i(t)\) follows a birth process with parameter \(\mu\) in our model and the probability is given as,

\[\begin{equation*} p_i(t)={(\mu t)^{i-1} \over (i-1) ! } e^{-\mu t} \quad {\rm for}\quad i=1,\ldots,d \tag{5} \end{equation*}\]

Note the performance \(P(t)\) at time \(t\) is piecewise constant and depends only on the network state. Let \(P_i\) be the performance of state \(i\). The expected repair time of all \(d\) edges is \(d/\mu\). The mean resilience in recovery stage, \(\bar R\), is given as follows,

\[\begin{equation*} \bar R={\mu \over d} \sum_{i=1}^d P_i \int_0^\infty p_i(t) dt={1 \over d}\sum_{i=1}^d P_i. \tag{6} \end{equation*}\]

The last equation in Eq. (6) is derived by the property of Erlang distribution. Equation (6) shows the resilience does not depend on parameter \(\mu\) and is given by the mean performance during the process.

Equation (6) suggests that the resilience does not depend on the distribution of restoration if the capacity of restoration is limited and failed edges are repaired one by one. It depends only on the performance level decided by the restoration order. In general, let \(M(t)\) be the distribution of restoration with mean \(\tau\), the mean duration of state \(i\) (\(i=1,2,\ldots, d\)) is \(\tau\), then the mean resilience after the disruption of \(d\) failed edges is

\[\begin{equation*} \bar R = {1 \over \tau d} \sum_{i=1}^d \tau P_i = {1 \over d}\sum_{i=1}^d P_i. \tag{7} \end{equation*}\]

Example 1

The power network in Fig. 2 is analyzed as a sample network. This network has 9 nodes and 14 edges. The edges are bidirectional. Node \(v_1\) are connected to a power plant and the line between the plant and \(v_1\) is assumed to be reliable. Even if one edge goes down by an accident, all nodes have at least one path from node \(v_1\). Such a redundant structure is called \(N\)-1 security in power system [4].

Fig. 2  Example of power network [19].

Let 8 edges \(e_1, e_2, e_4, e_5, e_6, e_7, e_{11}, e_{14}\) failed simultaneously at \(t=0\). Table 1 shows the repair edge and the performance for each state following the priority repair.

Table 1  Repair edge and performance measure.

State 1 is an initial state with the performance \(2/9\). First, \(e_7\) is selected by the priority repair policy, because the performance will be recovered up to 6/9 after the repair completion. State 1 is also a state that the restoration of \(e_7\) is undergoing. After the completion of repair of \(e_7\), all the other failed edges does not have priority in this example. In this case, an edge is randomly selected and repaired. The resilience for the restoration order in Table 1 is \((2+6+7+8+9+9+9)/(9 \times 8)=0.8194\) by Eq. (6). If the order of edges to be repaired first and second is upside down, i.e., \(e_1\) is first, \(e_7\) is second, the resilience is \((2+3+7+8+9+9+9)/(9 \times 8)=0.6528\). The mean resilience is decided only by order of repair.

Next, let confirm the effectiveness of priority restoration. We performed the Monte Carlo simulation to obtain the resilience of the network in Fig. 2, where both the number of simultaneous failures and the combination of failed edges are selected randomly. Figure 3 shows the resilience values for 1,000 samples when the priority restorations are conducted. Figure 4 shows the resilience without priority restoration for comparison. In this case, the repair edge is randomly selected one by one. In both figures, “x” in blue or red shows the resilience of one sample. For examples, if the number of simultaneous failures equals to one, any one edge fails, the mean resilience during the restoration is 1 because of the \(N-1\) security. In this case, no difference between the priory and the random repair policies. However, when more than one edge failed, resilience depends on both the combination of failed edges and the repair policies. Figure 5 shows the mean values of resilience for the two repair strategies. The effect of priority restoration increases as the number of simultaneous failures increases.

Fig. 3  Resilience of simultaneous failure (priority restoration).

Fig. 4  Resilience of simultaneous failure (random restoration).

Fig. 5  Mean resilience for priority and random restoration.

4.  Operational Resilience

This section discusses network resilience as a stochastic process (a finite-state Markov chain). First, the occurrence of disruption as the result of the occurrence of simultaneous failure of edge is defined. Then resilience measures under long-term operation are defined. The Monte Carlo simulation results following the definitions are demonstrated. Finally, Markov chain solution and its approximations are presented.

4.1  Failure of Edge

All nodes are fully operational, but each edge may fail independently or simultaneously. Marshall-Olkin type shock model and \(\alpha\) factor method are applied to model CCF, as follows.

  1. Edge may fail independently or simultaneously with other edges as the result of CCF following an external shock.
  2. There are \(2^m-1\) kinds of shocks. The shocks are independent each other and the magnitude of shock is divided into \(m\) types. If level \(r\) (\(r=1,2,\ldots,m\)) shock occurs, the randomly selected \(r\) edges fail simultaneously. Note that the external shocks arrive independently of the state of edges, i.e., the edges affected by a shock may be selected even if the edge has already failed.
  3. The distribution of shock arrival time is exponential. The total shock occurrence rate for a given network is \(\lambda\). The CCF occurrence rate of level \(r\) shocks, \(\lambda_r\), is \(\alpha_r \lambda\). Here, \(\alpha_r\) is called an \(\alpha\) parameter of CCF. The CCF occurrence rate of a specific combination of \(r\) edges, \(\lambda'_r\), is, \(\lambda_r / (\begin{array}{c} m\\ r \end{array} ).\)
4.2  Operational Resilience

Operational availability is defined as Eq. (8) in reliability engineering.

\[\begin{equation*} A_O={{\rm MUT} \over {\rm MUT}+{\rm MDT}}, \tag{8} \end{equation*}\]

where, MUT and MDT are the mean up time and the mean down time of the system, respectively. In the infrastructure systems, their up time means no degradation in service, i.e., \(P(t)=100\%\), otherwise down time. Figure 6(a) is an example of time transition of \(P(t)\) and show the concept of operational availability. If we consider the time process that repeating the occurrence of shock and the restoration, MUT is the expectation of duration that \(P(t)=100\%\) and the MDT is the expectation of duration that \(P(t)<100\%\). MDT corresponds to mean repair time and it depends only on the capacity of repair resources for the system without the redundant structure, otherwise depends on both the resources and the strategy of restoration. \(A_O\) is the percentage of up time within long life time considering the robustness and redundancy of the system.

Fig. 6  Conceptual diagrams of resilience measures.

We propose a new resilience measure named operational resilience in this section. Operational resilience is the resilience experienced under an actual condition of operation and restoration. Let \(EP_{DT}\) be the expectation of \(P(t)\) during down time, (Fig. 6(b)). Operational resilience is the average performance measure for a long period of time (Fig. 6(c)) and given as follows;

\[\begin{eqnarray*} R_O&=&{{\rm MUT} +{\rm MDT}\cdot EP_{DT} \over {\rm MUT}+{\rm MDT}} \tag{9} \\ &=&1-{ED_{DT} \over {\rm MUT}+{\rm MDT}},\nonumber \end{eqnarray*}\]

where \(ED_{DT}=1-EP_{DT}\) is the expectation of degradation during down periods. In the Monte Carlo simulation, \(EP_{DT}\) can be obtained by measuring the elapsed time of each step of simulation and by calculating the performance of the state that corresponds to \(P(t)\) in Eq. (3), i.e., by obtaining the sum of areas under the actual performance curve during the restoration period. The operational resilience is a measure to represent the comprehensive capacity of resilience for a system. It enables us to evaluate a system considering the resilience not ony in the restoration stage (third stage in section 1), but also in the disaster prevention and damage propagation stages (first and second stages). In order to increase the resilience, we have to increase MUT and to decrease \(ED_{DT}\). Increasing MUT is realized by improving robustness, redundancy and absorptive capacity of the system. These are the measures in the first and second stages. Decreasing \(ED_{DT}\) is the main subject in the third stage.

Note that the resilience in Eq. (1) is defined as the mean performance between the occurrence of disruption and the completion of restoration, assuming that the initial degradation is given and no further system degradation occurs during the restoration. \(EP_{DT}\) is a natural extension of the resilience in Eq. (1) considering the actual experience of restoration for a long time period. Here we redefine \(EP_{DT}\) as operational resilience in recovery stage and denote it as \(R_{OR}\). \(R_{OR}\) is given as follows;

\[\begin{equation*} R_{OR}={R_O-A_O \over 1-A_O}. \tag{10} \end{equation*}\]

Equation (10) is directly given by Eq. (9), i.e., \(R_O=A_O+(1-A_O) R_{OR}\).

Example 2

For the network in Fig. 2, we set \(\lambda_1\)=0.01 and \(\lambda_r\)=\(\lambda_{r-1}/1.4\) for \(r=2,\cdots,14\), and \(\mu\)=0.1. We implemented a discrete event-driven simulation until \(t\) =1,000,000. Figure 7 shows the time transition for two restoration strategies until \(t\)= 800. Table 2 shows the result. The calculations were conducted by using a PC with Intel Core i7 3.5 GHz and the C programing language. Although the computational cost of priority restoration is large, All measures for the priority strategy increase because of the redundant structures in this system.

Fig. 7  Example of discrete event-driven simulation.

Table 2  Operational resilience and operational availability.

Next, we consider a network that edges \(e_3, e_4, e_7, e_8, e_{10}\) and \(e_{12}\) are removed from the original network. The redundancy in the original network is removed and if at least one edge fails, this network fails. Table 3 shows the results when the shock occurrence rates and restoration rate are same to the previous experiment. The operational resilience is improved by the priority strategy whereas the difference of operational availabilities for both strategies is small.

Table 3  Operational resilience and operational availability without redundant structure.

4.3  Operational Resilience by Markov Analysis

The state transition of the network system is described by the continuous-time Markov chain with \(2^m\) states. The performance \(P(t)\) depends on the state at \(t\). Let \(p_i\) be the steady state probability, and \(P_i\) be the performance of state \(i\) \((i=1,\ldots, 2^m )\). Note there is a one-to-one relation between a state and the performance. The performance \(P_i\) for given state \(i\) is derived by Eq. (3) or Algorithm 1 given in Sect. 2.2. Let \(\bf A\) be the state transition matrix and \({\bf p}=(p_1,p_2, \ldots, p_{2^m})\) be the state probability vector. The elements of \(\bf A\) are obtained considering the CCF and priority repair. For instance, the transition from state \(i\) to state \(j\) by CCF is

\[\begin{align*} a_{ij}=\sum_{d=f_j-f_i}^{f_j} \left( \begin{array}{c} f_i\\ d-(f_j-f_i) \end{array} \right)\lambda'_d \tag{11} \end{align*}\]

if the transition from state \(i\) to \(j\) is possible, otherwise 0. Here, \(f_i\), \(f_j\) are the number of failed edges of state \(i\) and state \(j\), respectively. Note that state \(i\) (\(j\)) indicates a specific state that is one of the combinations where \(f_i\) (\(f_j\)) edges failed. Therefore, the transitions from state \(i\) to state \(j\) by CCF should be limited to the CCFs of maximum \(f_j\) simultaneous failures including all the difference between state \(i\) and state \(j\). This is the reason that Eq. (11) does not depend on \(m-f_i\), the number of functioning edges in state \(i\). For restoration from state \(i\) to state \(j\),

\[\begin{equation*} a_{ij}=\mu / |e_i^*| \tag{12} \end{equation*}\]

for selected state \(j\) that satisfied \(f_j=f_i-1\), otherwise 0, where \(e_i^*\) is the selected edges within the failed edges of state \(i\) given by Eq. (4).

The steady state probability is obtained by the following simultaneous equations.

\[\begin{eqnarray*} &&{\bf p}={\bf pA} \tag{13} \\ &&\sum_i p_i=1. \tag{14} \end{eqnarray*}\]

Then the operational resilience \(R_O\) for a long operating period, that corresponds to Eq. (9) in simulation study, is obtained by the expectation of \(P_i\) as follows,

\[\begin{equation*} R_O=\sum_{i=1}^{2^m} P_i p_i. \tag{15} \end{equation*}\]

Note that the time-wise integral expression of resilience in Eq. (1) is transformed to the expectation of network performance in Eq. (15). The operational resilience in recovery stage \(R_{OR}\) is given by Eq. (10), where \(A_O\) is given by

\[\begin{equation*} A_O=\sum_{i; P_i=1} p_i. \tag{16} \end{equation*}\]

4.4  Approximation

The steady state probability \(p_i\) in Eq. (15) is derived by the multidimensional Markov analysis with the number of edges \(m\) as the dimension. The number of states increases exponentially as the network size increases, then the calculation becomes difficult accordingly. For this problem, we propose an approximate method where the dimension of Markov analysis is reduced to one by considering the number of failed edges as the state. In this case, the number of states equals to \(m+1\). Let \(p'_i\), \(P'_i\) be the steady state probability and the performance measure of \(i\) failed edges. Then the approximated operational resilience \(R'_O\) for a long operating period is

\[\begin{equation*} R'_O=\sum_{i=0}^{m} P'_i p'_i. \tag{17} \end{equation*}\]

The approximated operational resilience in recovery stage, \(R'_{OR}\), is given as the same way in Eq. (10).

The probability \(p'_i\) in Eq. (17) is easily obtained by one dimensional Markov chain but \(P'_i\) is not. In state \(i\), there are \(M_i=\Bigl( \begin{array}{c} m \\i \end{array} \Bigr)\) substates. \(P'_i\) is the expected performance measure of these substates, and decided by the performance measure and the probability of each substate. These are different because of the priority repair. For the probability of substate, we propose two approximations as follows.

Approximation 1:

This approximation gives \(P'_i\) as the mean of all substates with \(i\) failed edges, i.e.,

\[\begin{equation*} P'_i={1 \over M_i}\sum_{k=1}^{M_i} P_{i,k}, \tag{18} \end{equation*}\]

where, \(P_{i,k}\) is the performance of a substate \(k\) in state \(i\). It is given by assuming all the substates have the same probabilities. Then it corresponds to the random repair. Therefore, this approximation gives the lower bound of the operational resilience. This approximation is based on the fact that for complex networks, the more complex, the less the effect of priority repair strategy. The approximated operational resilience and the approximated operational resilience in recovery stage with approximation 1 are denoted by \(R'_{O1}\) and \(R'_{OR1}\), respectively.

Approximation 2:

This approximation assumes that the probability of a substate is proportional to the number of connected nodes. \(P'_i\) is given as follows,

\[\begin{equation*} P'_i=\sum_{k=1}^{M_i} {s_{i,k} \over s_i}P_{i,k}, \tag{19} \end{equation*}\]

where, \(s_{i,k}\) is the number of connected nodes of a substate \(k\) in state \(i\) and \(s_i=\sum_k s_{i,k}\). It based on the idea that the probability of a substate with high connectivity becomes large because of the priority repair strategy. The approximated operational resilience and the approximated operational resilience in recovery stage with approximation 2 are denoted by \(R'_{O2}\) and \(R'_{OR2}\), respectively.

By using these approximations, it becomes possible to calculate the operational resilience and the operational resilience in recovery stage by using one dimensional Markov chain.

Example 3

The Gauss-Seidel method is used to solve the steady state probabilities, \(p_i\) in Eqs. (13), (14) and \(p'_i\) in Eq. (17). Table 4 (5) shows \(R_O\) (\(R_{OR}\)) and their approximations \(R'_{O1}\), \(R'_{O2}\) (\(R'_{OR1}\), \(R'_{OR2}\)) for several networks. If \(m=n-1\), these are minimal edge graphs. Else if \(m=n(n-1)/2\), complete graphs with bidirectional edges. The values show the average of 100 randomly generated networks [20] except for complete graphs. The parameters are \(\lambda_1=0.1\), \(\lambda_{r+1}=\lambda_r /5\) \((1\leq r<m)\), \(\mu=0.5\). \(R_O\) and \(R_{OR}\) are given by Eq. (15), Eq. (10), respectively. However, the values with asterisk show the results of Monte Carlo simulation, because of the limitation of our computational resources (memory storage requirements). The coefficient matrix A in Eq. (13) is dense due to CCFs and the size is \(2^{20} \times 2^{20}\) when \(m=20\). Diffs shows the relative difference \((\%)\), i.e., Diff 1 =\((R'_{O1}-R_O)/R_O\times 100\), Diff 2 =\((R'_{O2}-R_O)/R_O\times 100\), Diff 3 =\((R'_{OR1}-R_{OR})/R_{OR}\times 100\) and Diff 4 =\((R'_{OR2}-R_{OR})/R_{OR}\times 100\).

Table 4  Operational resilience for networks.

In Table 4, the errors are relatively small in every case and Approximation 2 has small errors compared with Approximation 1. However the difference is rather small especially for the complete graph. Also we can confirm that the Approximation 1 gives the lower bound. From Table 5, the error becomes large compared to Table 4. And we can see Approximation 1 is superior than Approximation 2 for dense graphs (complete graphs).

Table 5  Operational resilience in recovery stage for networks.

To verify the influence of CCF, Table 6 shows the results without considering CCF. Although the total failure rate \(\lambda\) of each network is same to the previous example, failures are restricted to the independent failures. Namely, \(\alpha_1=1, \alpha_r=0\) for \(r\geq 2\), \(\lambda_1= \lambda\), \(\lambda_r=0\) for \(r\geq 2\) in every case. We can see both the operational resilience and operational resilience in recovery stage have increased compared with Tables 4 and 5. The analysis ignoring CCF overestimates resilience.

Table 6  Operational resilience and operational resilience in recovery stage without CCF.

5.  Conclusion

This study discussed the resilience of network systems, where the electric power networks was supposed for the target networks and the priority restoration was conducted. Two types of resilience were discussed, one (I) was resilience in the recovery stage according to the conventional resilience definition and the other (II) was steady state operational resilience considering the long-term operation in which the network state changes stochastically. For the first type resilience, we showed that the resilience was easily formulated under the assumptions that (1) the deterioration of network performance was given, (2) the restoration was conducted for all failed edges one by one with a designated order and (3) the distribution of repair time was i.i.d. with known mean. In this case, mean resilience was decided only by the order of repair. The second resilience gave a comprehensive capacity of resilience for a system and was derived by Markov analysis. This is the most notable achievement of this paper. In the analysis, the common-cause failure was incorporated to represent the occurrence of large-scale disruption. Two resilience measures, (II-a) operational resilience and (II-b) operational resilience in recovery stage, were newly proposed to estimate the resilience of a network for a long operational period. The operational resilience of II-b was a measure that applied the conventional resilience to the long-term operation. For the analytical method for type II resilience, two approximations were proposed for complex networks. The occurrence of common-cause failures decreases the network resilience drastically. The effectiveness of the restoration strategy was verified. The analysis in this paper was based on the condition that a network had only one source, the network performance was measured by the connectivity from the source, restoration capacity was limited and both the distributions of shock occurrence and restoration were exponential. (The last condition is used only for the mathematical framework based on Markov analysis.) To relax these conditions is interesting for future study. Furthermore, obtaining a more effective approximation method for more complex networks is required to advance the research of resilience engineering.

Acknowledgments

This research was partially supported by JSPS (Japan Society for the Promotion of Science) KAKENHI Grant Number 20K05021.

References

[1] C.S. Holling, “Resilience and stability of ecological systems,” Annual Review of Ecology and Systematics, vol.4, no.1, pp.1-23, 1973.
CrossRef

[2] M. Bruneau, S.E. Chang, R.T. Eguchi, G.C. Lee, T.D. O’Rourke, A.M. Reinhorn, M. Shinozuka, K. Tierney, W.A. Wallace, and D. Winterfeldt, “A framework to quantitatively assess and enhance the seismic resilience of communities,” Earthquake Spectra, vol.19, no.4, pp.733-752, 2003.
CrossRef

[3] M. Ouyang, L. Dueñas-Osorio and X. Min, “A three-stage resilience analysis framework for urban infrastructure systems,” Structural Safety, vol.36-37, pp.23-31, 2012.
CrossRef

[4] G. Hug-Glanzmann and G. Andersson, “N-1 security in optimal power flow control applied to limited areas,” IET Generation, Transmission & Distribution, vol.3, no.2, pp.206-215, 2009.
CrossRef

[5] S.A. Nezam-Sarmadi, S. Nourizadeh, S. Azizi, R. Rahmat-Samii, and A.M. Ranjbar, “A power system build-up restoration method based on wide area measurement systems,” Euro. Trans. Electr. Power, vol.21, no.1, pp.712-720, 2011.
CrossRef

[6] A.A. Mota, L.T.M. Mota, and A. Morelato, “Visualization of power system restoration plans using CPM/PERT graphs,” IEEE Trans. Power Syst., vol.22, no.3, pp.1322-1329, 2007.
CrossRef

[7] A.C. Caputo, P.M. Pelagagge, and P. Salini, “A methodology to estimate resilience of manufacturing plants,” Proc. 9th IFAC Conference on Manufacturing Modelling, Management and Control (MIM 2019), pp.808-813, 2019.
CrossRef

[8] D. Guzs, A. Utans, and A. Sauhats, “Evaluation of the resilience of the Baltic power system when operating in island mode,” Proc. 31th European Safety and Reliability Conference, WEiJ:056, Anger, France, 2021.
CrossRef

[9] G. Ziwei and Y. Fei, “Research on resilience evaluation method of train operation control system based on random failure,” Transport Reviews, vol.40, no.4, pp.457-478, 2020.

[10] A.A. Ganin, E. Massaro, A. Gutfraind, N. Steen, J.M. Keisler, A. Kott, R. Mangoubi, and I. Linkov, “Operational resilience: Concepts, design and analysis,” Sci. Rep., vol.6, 19540, 2016.
CrossRef

[11] G.P. Cimellaro, A.M. Reinhorn, and M. Bruneau, “Framework for analytical quantification of disaster resilience,” Engineering Structures, vol.32, no.11, pp.3639-3649, 2010.
CrossRef

[12] D.A. Reed, M.D. Powell, and J.M. Westerman, “Energy supply system performance for hurricane Katrina,” Journal of Energy Engineering, vol.136, no.4, pp.95-102, 2010.
CrossRef

[13] B. Cassottana, L. Shen, and L.C. Tang, “Modeling the recovery process: A key dimension of resilience,” Reliability Engineering and System Safety, vol.190, pp.1-10, 2019.
CrossRef

[14] U.S. Nuclear Regulatory Commission, Guidelines on Modeling Common-Cause Failures in Probabilistic Risk Assessment, NUREG/CR5485, 1998.

[15] U.S. Nuclear Regulatory Commission, PRA Procedures Guide, NUREG/ CR2300, 1983.

[16] U.S. Nuclear Regulatory Commission, Parameter Estimations, NUREG/CR6268, 2015.

[17] A.W. Marshall and I. Olkin, “A multivariate exponential distribution,” J. Amer. Statist. Assoc., vol.62, no.317, pp.30-44, 1967.
CrossRef

[18] W.E. Vesely, “Estimating common-cause failure probabilities in reliability and risk analyses: Marshall-Olkin specializations,” Nuclear Systems Reliability Engineering and Risk Assessment, J.B. Fussell and G.R. Burdick, eds., Society of Industrial and Applied Mathematics, 1977.

[19] H. Mori, “Current status of reliability assessment in power systems,” Journal of Reliability Engineering Association of Japan, vol.30, no.4, pp.317-327, 2008 (in Japanese).
CrossRef

[20] P. Erdös and A. Rényi, “On random graphs I,” Publicationes Mathematicae, vol.6, pp.290-297, 1959.
CrossRef

Authors

Tetsushi YUGE
  National Defense Academy

is currently a professor in the Department of Electrical and Electronics Engineering, at the National Defense Academy, Yokosuka, Japan. He received his B.S. (1989) in Mathematics, M.S. (1991) in Information Engineering and Ph.D. (1996) in Reliability from Hokkaido University. His current research interests include reliability analysis, and safety analysis. He is a member of IEEE and the Reliability Engineering Association of Japan.

Yasumasa SAGAWA
  National Defense Academy

received his B.E. and M.E in Engineering from the National Defense Academy, Yokosuka, Japan, in 2016 and 2022, respectively. He has been working as an official for the Japan Air Self-Defense Force.

Natsumi TAKAHASHI
  National Defense Academy

is currently a lecturer in the Department of Electrical and Electronics Engineering at the National Defense Academy, Yokosuka, Japan. She received her B.A. in Economics from Meiji Gakuin University in 2011, and M.S. and Ph.D. in System Design from Tokyo Metropolitan University in 2013 and 2016. Her research interests are optimizations based on the reliability engineering and operations research. She is a member of IEEE Reliability Society and Reliability Engineering Association of Japan.

Keyword