Reliability, Dependability, Security, MTBF, MTTR, Availability 🤔😓🤷‍♂️

© Copyright Rod Hughes Consulting Pty Ltd
Rod Hughes Consulting
General Web Site
 
Applications
Home
Innovations and
Solutions Home

A bit about
Rod Hughes
 Link to this page...

The URL in the browser address bar is volatile and may be broken at any time.

To obtain a link to this page, click the <<Share>> button top-right of the screen.

  

Note - if the navigation pane on the left of this window is not visible, click the 2-pane icon on the top bar



Semantics come into play in all this and sometimes there are just too many words - this is no different, but I will also add some numbers 👍🍾😁

RELIABILITY of overall correct operation of a protection system has two sides:

  • DEPENDABILITY that it WILL work correctly specifically when it has to
  • SECURITY that it WILL NOT work when it should not work – sometimes referred to as Stability

Reliability is sometimes mutually confused when referring to Dependability because they "sound" the same re correct operation

Dependability and Security have opposite implications when it comes to Duplication – Duplication being in the sense of Redundancy.

Context

We generally refer to the PROTECTION SYSTEM as being the combination of CTs, d.c. power, wires, links, relays, trip coils. We use one physical VT but with two secondary windings or at least separately fused circuits from one winding.
For a fault on the primary power system, the Protection System should cause a trip of the primary circuit breaker.


Consider two INDEPENDENT Protection Systems, each capable of “doing the job” on their own – no common mode of failure and no dependency on each other.
These are INDEPENDENT and DUPLICATED protection systems (which I call X and Y, or you may say #1 and #2 or Main 1 and Main 2).
"Duplicate" applies even though perhaps the relays in each system are not necessarily IDENTICAL in function, performance or physical manufacture – in fact we want them to be sufficiently different in those three aspects as to minimize chance of common mode failure, but still clear the main power system fault within the critical clearing times.
REDUNDANCY in the sense of FUNCTIONAL DUPLICATION (e.g. clear faults within 100 milliseconds) does not specifically require PHYSICALLY IDENTICAL equipment.
Hence a distance relay as the basis of X and a Line Differential as basis of Y are in fact both Redundant and Duplicate.

However having two protections as Functional Duplication to see the same conditions does raise the corollary question: same or different operating principles? and also same or different vendors?
Refer Duplicate systems - same or different vendors?

We may even put these Independent Duplicate Protection Systems in different cabinets, different cable trays and different cable trenches.
If mounted in same "thing" we may segregate top/bottom or left/right possibly even to the extent of firewalls to eliminate as much chance of common mode physical failure

The X and Y trip coils are an interesting problem as they operate on the same mechanism.
Many decades ago in Australia, both X and Y relays operated for a critical power system fault but the CB did not budge. 😤
When each system was tested independently on their own during commissioning, they individually caused operation of the circuit breaker so it all seemed good.
But as the trip coils were installed in opposite polarity, when simultaneously energized nothing happened!
Testing procedures (around the world) have since been modified to test for simultaneous relay operation.


Putting two independent systems in parallel means the Duplicated systems are connected "at the ends" so that the same primary system condition is presented to both systems and either system could cause the same outcome on the primary system
If one system fails, the other system is not affected and will still do the job correctly so the combined Duplicated systems increases Dependability.

However, as you have two systems either of which could fail in a manner that leads to an unwarranted operation of that system, the combined Duplicated systems have twice as much possibility of an unwarranted operation from the failure of one "thing" somewhere in the two systems.  Hence Security is decreased.

Simply put, when you put two X and Y systems as Duplicated systems in parallel, Dependability increases and Security decreases.

Regarding the question of how does Redundancy (Duplication) affect Reliability .. it depends on which side of the Reliability “balance beam scales” that you are concerned about with Dependability on one side and Security on the other.

We don’t want unwarranted operation (low Security/Stability) as that means power system outages/blackout and stopped production etc and that means lost revenue to the company.

However it is generally considered that whilst Security may reduce, it is better to increase Dependability by duplication as we do not want a primary system fault to go undetected or have a slower than required clearance time as that could lead to further personnel injury/death or more sever plant/equipment damage that would cause more widespread outage and longer time to repair

i.e. It is better to have an unwarranted trip (and few explanations to your boss’ tirade of “unacceptable”) which you can quickly restore to service, than to not trip when required with far greater consequences and far longer time to restore to service (with you getting sacked for incompetent engineering, the company going bankrupt or having to give your explanations to the Coroner)!

Assessing Dependability

We need some numbers associated with this word “Dependability” to assess what it really means .. and perhaps help make some important decisions about replacing old equipment.

Mean Time Between Failures (MTBF) is often considered the key criteria for Dependability.
For an individual “thing” that is true.
However MTBF is statistical .. after all, it is an average!

Whilst MTBF is an indication of how likely the "thing" will keep doing its job for as long as possible, "things" inevitably fail for one reason or another, or perhaps just may need to be taken out of service for maintenance activity to keep the risk of failure as low as possible.

A single protection system has many “things” as components of the system, any of which can be out of service or fail to stop doing their job for any manner of reasons 
The “things” include the CT/VT, CB coil, d.c. supply, relay(s) AND all the wires and links connecting them. 
The protection system is essentially a set of components connected in series to “do the job” of tripping the circuit breaker

Accepting that things WILL fail, the REAL question is how quickly can it be restored to normal service?

  • A failed "intelligent" relay may alarm leading to replacement and recommissioning in “half a day”.
  • A cable trench fire may take more than six months to relay cable, retest and recommission the system.
  • A link left in the wrong position may not be detected till the next routine maintenance a year or more later.

Hence the other side of the coin of MTBF is Mean Time To Repair (MTTR) .. or as I prefer Mean Time To RESTORE.

MTBF of a “thing” is primarily associated with the VENDOR’s design and manufacturing, plus some aspects of how the ASSET OWNER has arranged the installation and maintenance of the “thing”.

However, MTTR is influenced by factors purely under the ASSET OWNER’s direct control such as:

  • Fault detection systems required to be built into the systems/devices
  • Fault alarm routing to appropriate personnel
  • Personnel response and travel time to the faulty equipment
  • Time to identify the particular faulty equipment
  • Time to isolate the faulty part(s) of the system for safe personnel intervention with as far as practical minimal disruption to plant operations
  • Time to repair or replace the faulty equipment
  • Time to test for correct operation of the repaired/new equipment
  • Time to undertake recommissioning of the repaired/new device and system into normal service


We can now derive a better number relative to Dependability than just MTBF alone.

This number is AVAILABILITY.

Availability (A) is the percentage of overall time that the individual thing, or indeed a system, can be expected to be operating correctly when it should

A = MTBF / (MTBF + MTTR)

e.g. electromechanical relays are known to have been in service for 50 years or more so we could consider MTBF of say 50 years.
Assuming no power system faults, any failure of the relay may not be detected until the next scheduled routine testing when it may then take another couple of days to replace and recommission to normal service – lets say a nice round 1 year overall
In the meantime that protection system is NOT OPERATING at all
So its Availability is 50/(50+1) = 98.039%

A modern Intelligent Electronic Device (IED) and let’s say it has 25 year MTBF because inherently there are electronic components (capacitors etc)
This sounds bad.
However, being intelligent, it has a lot of self-diagnostics (which can possibly also detect external problems) and can raise an alarm via the communications port to the personnel who can get to site and replace the IED and restore to service in say 8 hours from failure.
Its Availability using days is (25x 365)/ [(25x 365) + 0.25] = 9125/9125.25 = 99.997% ... FAR better than the electromechanical relay with MTBF twice as long.

This clearly shows why MTBF is an insufficient criterion on its own for choosing equipment

Availability of a system is calculated depending on whether the things are in “series”, or arranged in parallel

  • Aseries = A1 x A2 x A3 … x An
    e.g. Three functions in series each with 99% Availability yields a system Availability of 97.03%
  • Aparallel = 1- [(1-As1) x (-As2)]
    e,g,  Two systems each with 97.03% Availability in parallel yields a combined Availability of 99.91%

A single protection system is a bunch of "things" in series – the relays, CT, the wires, the links, d.c. supply, cb coils.

Duplicated protection is two systems in parallel.

Some Availability Example Numbers

Single Electromechanical Relay

Consider having one protection system using an electromechanical relay as the main protection device.
Each "thing" as a "function" in that system and has its own MTBF and MTTR.
There are 11 things/functions (i.e. points of potential failure) even in this simple system of IED, Panel wire, panel link, control room wire, control room link, yard wire, marshalling link, cb wire, cb link, trip to coil, coil

  • Relays fail to work (auxiliary supply failure, relay fails, CTs left shorted, output contact failure ...)
  • Wires get eaten through by rodents, or just fall off if not tightened properly
  • Links get left in wrong position or fall into wrong position if not tightened properly
  • Coils open circuit

I won’t get into arguments of the MTBF and MTTR figures, but suffice to say I derived a set of example figures for this simple system of just the relay and wiring through to the CB coil:

For electromechanical relay MTBF = 50y, MTTR = 0.25d has its own Availability of 99.998%. 
That sounds great!
But in association with the other "functions" in the system, that gives its own overall X system Availability Ax = 90.621% as MTBF of 1.47y and MTTR of 0.152y being out of service.
Apart from deliberate maintenance activity, that is 55 days per year when the protection will not work when it is potentially require to work.

Duplicate Electromechanical Relay System

Having duplicate X and Y electromechanical relay systems, we can consider that all the things are the same and therefore have the same Availability with the result that Ax = Ay 
This would achieve combined parallel Availability of Ap = 99.120%  (so-called “2 nines”)
This reflects that there is a chance that both systems have failed simultaneously and either system works. 
This will only be the case for however long it takes to repair one of the systems and hence we have the same 55 days per year where there is no protection operating.
However, the MTBF of such a a situation of no protection operating has increased from once every 1.47 years to having no operating protection only once in over 17 years.

One IED System and One Electromechanical Relay System

However, if one of the two systems was changed to an IED (which arguably has a lower MTBF because of life of electronic components), we get the benefit that the relay can alarm its own failure to appropriate personnel who can respond to repair/replace the faulty relay and restore it to service MUCH faster.  Therefore the IED has MTBF 30y and MTTR 0.25d,
All the wiring etc. would be the same as for the electromechanical relay.
Hence the IED based system Ay = 99.858% (statistically potentially not operating for 0.9 days per year)

So the combined Availability of one electromechanical system and one IED system in parallel is Ap = 99.987% - better than 3 nines and statistically MTTR = 22 hours where there is no protection at all and MTBF = nearly 19 years between when that would occur .

Duplicate IED Relays

If both were IED systems, Ap = 99.9998% and the overall system still has just 22 hours where there is no protection operating at all, but the the duration between such circumstances likely to exist has increased to over 1200 years


Relay Replacement Justifications

Perhaps you have experience where your proposal to replace all the "old" protection meets with a response like ...
"it rarely operates anyway and the annual maintenance checks for the last 50 years say it is still working ...."


Your response to that could be something based on such MTBF and MTTR numbers (financial people understand numbers well, but not words so much 😜) as:
"The  risk is is increasing daily that statistically it will not work some time in the next <<MTBFold>>  ... possibly even the day after maintenance, ... and then statistically we will have no power for <<MTTRold>>
If we replace them the risk is EXTENDED to <<MTBFnew>> and even then we would only be without power for <<MTTRnew>>"


 ... and if you really wanted to go "all in",
"if we adopt IEC 61850 engineering, we can implement far greater self-monitoring systems to alarm and be able to restore service faster ... even "instantaneous resilience" of the LAN to have almost no risk of loss of overall protection functionality even if something fails, so no loss of power whist we fix the failure" 

Refer Architectures - Star, Ring, Duplicated, Redundancy, bumpless PRP, bumpless HSR???  - section <<Unlocking the Power, Redundancy and Resilience of PRP with dual RSTP rings (MSTP)>>

Is one system more secure or dependable than another?

First define "system"
- Transformer differential or Generator differential is a single "box" relay.
- Restricted Earth Fault differential is a "single box" or possibly incorporated into other boxes as direct measurement or numerically derived.
- Busbar differential could be a single box or could be multiple box peripheral units with a central unit with proprietary communications between them (or IEC 61850 MUs with Sampled Values being sent to the IED with the PDIF function).
- Line differential would be 2 or more boxes at each substation

But in any case the protection "system" incorporates CT, VT, d.c. supply, wires, links, trip coils ...
Oh and if we really wanted to incorporate all failure modes .. the settings themselves are part of the ability or inability of a protection function to operate when it should and not operate when it shouldn't as per my discussion on same/different brands of devices!
Duplicate systems - same or different vendors?

For example, is a differential FUNCTION secure and dependable? In as much as ANY protection FUNCTION is both secure and dependable, YES.

Just because a differential FUNCTION has a defined zone of operation does not make it more secure and/or more dependable than an unrestricted OC/EF.  

We want ALL forms of protection FUNCTIONS to not operate when they shouldn't AND operate when they should.

  • A Restricted (physical) Zone FUNCTION should operate for an in-zone fault above setting, and not operate for any load condition or any out of zone fault ... even with a CT saturated (which is similar to the CT being left short circuited).
  • An Unrestricted (physical) Zone FUNCTION should operate for any fault above setting, and not operate for low level faults below settings and hopefully settings are such to avoid operation for load condition.

Consider if a CT is left short circuited with fault current flowing or even normal load current flowing sufficiently above setting threshold:


Load CurrentFault Current
Differential FUNCTIONOperateNot Operate
Overcurrent FUNCTIONNot OperateNot Operate
Earth Fault FUNCTION Neutral leg CTNot OperateNot Operate
Earth Fault FUNCTION Core Balanced CTNot OperateNot Operate
Earth Fault FUNCTION Holmgren Connection CTNot OperateNot Operate
Distance FUNCTION No signalling scheme Not OperateNot Operate
Distance FUNCTION with remote end signalling scheme Not OperateForced operation

Does this inherently mean one is preferable to another? Or more Dependable than another?
Not in consideration of all the other factors that go into maximising the chances that the protection will operate when it should ... AND ... minimising the chances that it will operate when it shouldn't.

We use a combination of measures to achieve best practical DEPENDABILITY including things like

  • multiple protection FUNCTIONS to detect all manner of possible faults (no one function can detect all faults),
  • d.c. supply supervision,
  • VT supervision
  • CT supervision
  • trip circuit supervision,
  • fail safe alarms on device failure,
  • primary injection testing after maintenance (many embarrassing cases of relay CT circuits left short circuited by links or other test procedures),
  • ... and even X and Y SYSTEM DUPLICATION.

The Primary System - Does the Same Apply?

SO far I have focused on just the Secondary System in this consideration of Reliability, Dependability, Security, and Availability.

It is then fair to ask ... what about the Primary System?  What is its MTBF?

Obviously it is not zero, otherwise we would never need protection systems!

As an example, consider a Busbar - is it "reliable"?

The semantics are important.  "bus" as a length of aluminium tube or a length of copper bar is inherently reliable and will keep passing current when it is supposed to and not pass current when it is isn't.

In the substation however that piece of metal has various connections of circuit breaker connection palms, line droppers etc that make up the bus bar arrangement. It also has mounting/supports and insulation/isolation.

What is the MTBF of all that?
Long!

Hopefully connection bolts won't rust away or become loose. 
I have consulted on an incident where the bus bar bolt INSIDE the GIS was not tightened properly during manufacture causing a single phase open circuit melt down six months after being commissioned. 
Another incident I had to deal with was an AIS CT palm INSIDE the CT being loose.

Hopefully flexible dropper connections won't fray (seen it).

Hopefully stanchions won't tilt on bad footings (seen it).

Hopefully overhead earth wire shackles won't rust through allowing the earth wire to drop onto the busbars (seen it).

So whilst the MEAN Time Between Failures of a bus bar arrangement is LONG, it is not infinite.  Never the less, here it seems that the busbar is considered to be extremely reliable without being duplicated.

If we then look at the bus arrangement in the sense of the overall substation's purpose and ability to bring power in and distribute it out 24 hours a day "forever"!

Now we have to take into consideration the circuit breaker may fail or be out of service for maintenance etc as a kind of "failure" to pass the power through the substation.
Hence we have more sophisticated bus arrangements to allow that such as split bus with bus tie, double bus, 1.5 circuit breaker, ring bus, mesh bus ....
Effectively we have duplicated the busbar (together with additional switchgear and instrument transformers) to achieve overall reliability of the substation to do its purpose.

Or perhaps its a question about the generator or loads need continuous grid connection ... I sigh profusely when I see all these renewable projects connect via a single CB where as for the cost of an extra two, they can have a 3-CB mesh (typically two 1.5 CB diameter layout partially populated allowing for easy future expansion) which means they can carry out annual CB maintenance without loss of connection ...


Bottom line - inevitably THINGS FAIL to do their job either randomly or sometimes because of maintenance activity needing them not to do their normal job!

We have to do OUR job reliably by anticipating such failures and providing mechanisms - quite often duplication - to minimise the disruption to the overall objective.



Course contact
Are you in need of specific training:
  • Protection Systems Engineering
  • IEC 61850 Engineering

I provide a range of courses for company-specific in-house training and occasional public invitation courses.  Contact me for details.


 

Contact Me

Skype: (ping even if showing offline)

Email Me

A phone call is nearly always welcome depending on the time of night wherever I am in the world.
Based in Adelaide UTC +9:30 hours e.g.

April-SeptemberNoon UK = 2030 Adelaide
October-March:Noon UK = 2230 Adelaide

  Office + 61 8 7127 6357
  Mobile + 61 419 845 253


Extra Notes:

Disclaimer
No Liability:
Rod Hughes Consulting Pty Ltd accepts no direct nor consequential liability in any manner whatsoever to any party whosoever who may rely on or reference the information contained in these pages.  Information contained in these pages is provided as general reference only without any specific relevance to any particular intended or actual reference to or use of this information. Any person or organisation making reference to or use of this information is at their sole responsibility under their own skill and judgement.

No Waiver, No Licence:
This page is protected by Copyright ©
Beyond referring to the web link of the material and w
hilst the information herein is accessible "via the web", Rod Hughes Consulting Pty Ltd grants no waiver of Copyright nor grants any licence to any extent  to any party in relation to this information for use, copy, storing or redistribution of this material in any form in whole or in part without written consent of Rod Hughes Consulting Pty Ltd.