Safety integrity levels: an overview of system-design risk analysis

Author: John O’Sullivan, engineering director, Douglas Control and Automation The term ‘safety integrity level’ (SIL) is used as a convenient shorthand to describe the safety rating of various hardware components and systems, e.g. “This PLC CPU is rated SIL3.” The SIL was designed as a shorthand to represent the results of complex analysis, but it is still only a part of an overall lifecycle approach to functional safety. Technically, the safety integrity level is the level by which the risk is reduced by the introduction of a safety instrumented system (SIS). There are four levels, with SIL1 being the least reduction in risk and SIL4 being the greatest. The SIS is separate to and independent of the basic process control system (BPCS) and, like the BPCS, it consists of sensor(s), logic solver(s) and final element(s). The SIS reduces the risk by intervening during a failure of the BPCS to ensure that the system remains safe. While the SIS hardware and software components may resemble the BPCS components and may come from the same manufacturer, they are required to be more reliable. The specification, design and operation (safety life cycle, or SLC) are defined in the standard IEC 61508, ‘Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems’. This standard has spawned a number of industry- and sector-specific standards that delve into more detail for specific industries, although we will focus on IEC 61508 in this article. IEC 61508 defines the SLC in three sections:

Phases 1 to 5: Analysis;
Phases 6 to 13: Realisation; and
Phases 14 to 16: Operation.

The following standards elaborate on the approach to SIL assignment outlined in IEC 61508:

IEC 61511 ’Functional Safety – Safety Instrumented Systems for the Process Industry Sector’
IEC 61513 ‘Nuclear Power Plants – Instrumentation and Control Important to Safety’
IEC 50128 ‘Railway Applications – Communication, Signalling and Processing Systems – Software for Railway Control and Protection Systems’
IEC 50129 ‘Railway Applications – Communication, Signalling and Processing Systems – Safety-Related Electronic Systems for Signalling’.

Safety integrity levels - hazard and risk analysis

During the analysis phases of a project, hazard identification and risk analysis are carried out by an interdisciplinary team. This should consist of all the system stakeholders including designers, process owners, safety, automation, mechanical and electrical specialists. Where possible, hazards are designed out of the system. Where this is not possible, e.g. a volatile raw material is essential to the process, the risks associated with the hazard are identified. Hazards are considered occurrences of harm and once identified the risk is assessed as the product of ‘frequency of the occurrence’ and the ‘severity of the harm’. Methods of analysis include:

HAZOP: Hazard and Operability Study
FME(C)A: Failure Mode Effect (and Criticality) Analysis
FMEDA: Failure Mode Effect and Diagnostic Analysis
ETA: Event Tree Analysis
FTA: Fault Tree Analysis

Normally, a risk matrix uses the likelihood of the occurrence and the consequence of the event to categorise the risks. Risks that cannot be designed out and are not tolerable will require safety functions to reduce the risk to a tolerable level. This results in the ‘residual risk’, which must be less than the pre-defined ‘tolerable risk’. The greater the reduction required to reach the residual risk, the higher the SIL. See the diagram below where the consequences, frequency/exposure and probability of avoidance are used to determine the required SIL.

Risk Parameters: C1: Minor injury or damage C2: Serious injury or one death, temporary serious damage C3: Several deaths, long-term damage C4: Many dead, catastrophic effects Frequency/Exposure Time: F1: Rare to quite often F2: Frequent to continuous Possibility of Avoidance: P1: Avoidance possible P2: Unavoidable, scarcely possible Probability of Occurrence: W1: Very low, rarely W2: Low W3: High, frequent Safety Integrity Levels Required: -: Tolerable risk, no safety requirements a: No special safety requirements b: A single E/E/PE is not sufficient 1: SIL 1 2: SIL 2 3: SIL 3 4: SIL 4 Depending on the SIL level to be achieved based on the risk reduction required, a device must achieve a low enough probability of failure and a high enough safe-failure fraction.

Probability of failure

Probability of failure comes in two flavours: probability of failure on demand (PFD) for safety functions that are only activated when required, and probability of failure per hour (PFH) for safety functions that are operating continuously. The lower the probability of failure, the higher the risk-reduction factor. The higher the risk-reduction factor, the higher the SIL achieved. (See the tables below for the figures related to PFD and PFH.)

SIL	PFD	PFD (power)	RRF
1	0.1-0.01	10⁻¹ - 10⁻²	10-100
2	0.01-0.001	10⁻² - 10⁻³	100-1000
3	0.001-0.0001	10⁻³ - 10⁻⁴	1000-10,000
4	0.0001-0.00001	10⁻⁴ - 10⁻⁵	10,000-100,000

Table 1: Probability of failure on demand

SIL	PFH	PFH (power)	RRF
1	0.00001-0.000001	10⁻⁵ - 10⁻⁶	100,000-1,000,000
2	0.000001-0.0000001	10⁻⁶ - 10⁻⁷	1,000,000-10,000,000
3	0.0000001-0.00000001	10⁻⁷ - 10⁻⁸	10,000,000-100,000,000
4	0.00000001-0.000000001	10⁻⁸ - 10⁻⁹	100,000,000-1,000,000,000

Table 2: Probability of failure per hour

Safe failure fraction

While the PFD and PFH tell us how likely a failure is to occur, the safe failure fraction (SFF) tells us what fraction of failures will be safe or if dangerous, detected. This is achieved by increased diagnostics and reporting of the safety function. The Greek letter λ is used to define the rate of failure per hour.

λsafe = failure rate leading to safe state
λdangerous = failure rate leading to dangerous state
λtotal = λdangerous + λsafe

This results in four types of failure rate, depending on whether the failure is detected or undetected. λdu is the rate of dangerous undetected failures. Thus, SSF = 1- λdu / λtotal. So, for SSF to be as high as possible, failures have to be safe or detected. If all the failure were safe and/or detected, the SFF would be 1 or 100%. Before SSF can be used to determine the SIL, other factors have to be considered. First is the hardware fault tolerance (HFT) of the device. Achieved through redundancy, a HFT of N means that N+1 faults are required before the safety function is lost. Secondly, devices are treated differently for SSF depending on their type. Type A devices are considered to be well defined and have sufficient failure data from experience in the field. Type B devices are considered to have insufficient data and field experience. See the tables below for the figures related to SSF.

SSF	Hardware Fault Tolerance (HFT)
SSF	0	1	2
<60%	SIL1	SIL2	SIL3
60% to 90%	SIL2	SIL3	SIL4
90% to 99%	SIL3	SIL4	SIL4
>99%	SIL4	SIL4	SIL4

Table 3: SSF for Type A subsystem

SSF	Hardware Fault Tolerance (HFT)
SSF	0	1	2
<60%	Not allowed	SIL1	SIL2
60% to 90%	SIL1	SIL2	SIL3
90% to 99%	SIL2	SIL3	SIL4
>99%	SIL3	SIL4	SIL4

Table 4: SSF for Type B subsystem In summary, the tools are available to identify and analyse risks associated with a system design and then implement the appropriate safety instrumented system to mitigate those risks and save lives and assets. John O’Sullivan BE, Dip Phys Sci, CEng MIEI is the engineering director of Douglas Control and Automation. He has 20 years’ experience in the automation industry focusing on the pharmaceutical, biotechnology and medical device sectors. He has developed design and test specifications for the regulated environment and project manages automation and safety projects for life science customers. O’Sullivan has consulted on the validation of certified failsafe, high availability systems.