Current issues, opinions, and mini-tutorials on system safety.
Readings in System Safety and Upcoming Events are also provided on this page.
May 13, 2012
Terry Hardy
In the book Being Wrong: Adventures in the Margin of Error the author Kathryn Schulz discusses how hard it is to admit we make mistakes, even though we make mistakes all the time. Schulz states that, generally, instead of admitting to and analyzing our errors, we instead figure out ways to lay blame for our errors on someone else. Alternatively, we come up with elaborate explanations describing how we might have been wrong, but only because of some circumstance (we were “almost” right.). And we are quick to forget our mistakes, while remembering vividly when we were right. Schulz points out that this failure to acknowledge and learn from our errors is a cultural failing. She states that this “error blindness” explains a lot of our inability to learn from failure.
Not all mistakes are of course bad, as we may gain insight from small failures. But some mistakes, such as those that result in a loss of lives in an accident, are clearly unacceptable. This is why in system safety we must always be aware that, no matter how good a job we do in designing our systems, or how good we think our analyses are, we will never be good enough, and often we will not get our engineering and our safety analyses right. Our analyses will be flawed, our designs will incorporate unrealistic assumptions, and our products will fail. Software will be designed using invalid or misinterpreted requirements. Our components will fail before we expect them to. The operational environments will be unexpected. And, because building complex technologies is a human endeavor, people will make mistakes at every step of the development process through intentional and unintentional actions. Therefore, engineers and managers should try to design in safety whenever possible. And even when organizations take the approach of designing in safety through inherently safer designs, those organizations should anticipate failure and prepare emergency systems to respond to those bad days.
Most importantly, we must implement and embrace systems and approaches to learn from failure and problems. A failure to learn from our mistakes may lead to accidents (or to a repeat of a previous accident). For example, on December 19, 2005, an amphibious airplane operated by Flying Boat, Inc., doing business as Chalk’s Ocean Airways flight 101, crashed into the shipping channel adjacent to the Port of Miami, Florida shortly after takeoff. All 20 people aboard, including 18 passengers and 2 crew members, were killed in the crash. The NTSB found that the airplane’s right wing had separated during flight. The probable root cause was 1) the failure of the Chalk’s Ocean Airways maintenance program to identify and properly repair fatigue cracks in the right wing and 2) the failure of the Federal Aviation Administration (FAA) to detect and correct deficiencies in the company’s maintenance program. The NTSB report noted that flight logs had documented numerous fuel leak discrepancies on the airplane that crashed. These fuel leaks were always resolved, but would then mysteriously reappear, according to the report. The NTSB noted that the fuel leaks were near the area where the right wing separated, and were probably indicators of structural damage that ultimately led to the loss of the airplane. The NTSB also stated that pilots had reported skin cracks. The accident report stated that had an in-depth investigation of the cracks and the fuel leaks been conducted, the operator probably would have discovered significant structural damage inside of the wing. The report stated, “Correction of that structural damage not only would have corrected the leaks but also would have prevented the accident.”
In the words of Kathryn Schulz, “If you really want to be right you have to start by acknowledging your fallibility, deliberately seeking out your mistakes, and figuring out what caused you to make them.” In other words, we must accept that it is likely that we could be in error, and be open to what may have led to such errors. This means that management must encourage chronic unease, where there is a feeling throughout the organization that the next accident is just around the corner. Organizations must develop a reporting culture where problems are recorded and used in decision making. And organizations should implement effective closed-loop anomaly reporting and corrective action systems. In short, we have to have a healthy respect for the dangers and risks of the systems we design, and understand that we too can be wrong.
Schulz, K., Being Wrong: Adventures in the Margin of Error, Ecco, 2010.
U.S. National Transportation Safety Board, In-flight Separation of Right Wing, Flying Boat, Inc. (doing business as Chalk's Ocean Airways) Flight 101, Grumman Turbo Mallard (G-73T), N2969, Port of Miami, Florida, December 19, 2005. Aircraft Accident Report NTSB/AAR-07/04, May 30, 2007.
April 3, 2012
Terry Hardy
At the IET System Safety Conference in Birmingham, England, in September 2011 Nick Holmes-Mackie presented an excellent paper that included some common criticisms of system safety and system safety practitioners. I thought it might be worth examining some of these criticisms to see if they are valid and to determine how we might address these concerns. Note that the phrasing of the criticisms and the responses and discussions that follow are mine, not Mr. Holmes-Mackie’s, and I encourage readers to also read his paper for a different perspective.
System safety is an expensive waste of effort and does not deliver value.
When system safety methods are used correctly they can provide tremendous benefits to an organization. Through a systematic process we learn about the system in ways that other approaches and reviews do not explore, and most importantly learn about ways in which that system has the potential to harm people, property, and the environment. However, poor system safety analyses can lead us down the wrong path, with the result being that we use precious resources on low risk activities while missing or ignoring higher risks. System safety is generally not cheap to implement for complex systems. Therefore, in some organizations, this criticism could be valid if system safety is inappropriately or poorly applied. One reason for a poor system safety effort may be that management does not understand what system safety does and therefore directs the efforts poorly. For example, if the system safety personnel are isolated from the design effort or system safety efforts are started too late in development to make a difference, then resources could be spent on system safety without apparent value.
System safety engineers often concentrate on the wrong things and repeat work done elsewhere.
The most effective system safety practitioners are those who look at the system from a different perspective than the engineering or project management groups. These system safety engineers look for the ways that things can go wrong, not just at parts that can fail, and they focus much of their efforts on interactions between subsystems. However, if management does not understand the purpose of system safety, then system safety efforts may become a “check the box” activity. If the system safety personnel are brought on late in the project, then they may have no choice but to perform these assurance activities because the design is too far along to change, adding credence to the criticism. System safety personnel do not help this perception if they only use old checklists and old analyses to perform the safety analysis on new systems. In addition, safety personnel may only focus on their area of expertise or pay attention to the problem others are already addressing while ignoring other critical areas (such as interfaces), and therefore may give the appearance of repeating the engineering function.
System safety engineers only say “no”.
Safety personnel in general have a reputation for being the ones to inform project management and engineering personnel of problems. System safety engineers in particular are known for telling management and engineering why they can’t do something in a design or operation. Some of this reputation comes naturally from the job system safety personnel are asked to do. Good system safety efforts look at systems from the viewpoint that the operation is unsafe unless proven otherwise. This means that system safety personnel should always be looking for things that can go wrong. Unfortunately, this often means that decision makers only hear “You can’t do that because…” and the system safety engineer becomes a naysayer in the eyes of management. If system safety personnel do not help look for ways the project can be successful and present alternatives, then this could be a valid criticism.
From the above discussion it appears that these criticisms could have validity, but they must be considered in context of the organization. It could become a self-fulfilling prophesy that system safety does not add value if management does not apply the resources or does not properly direct the effort. However, system safety practitioners also have a responsibility to counter these criticisms, and to evaluate if they might be contributing to these perceptions. System safety practitioners should be prepared to teach the benefits of what they do and why system safety activities are so important. It is probably not an understatement to say that few senior managers truly understand system safety approaches. Therefore, it is important to look for ways to show the value in what we do, both informally through our everyday interactions and formally through training. Both in daily discussions and formal training, we should be prepared to tell stories of things that have gone wrong. We should be armed with descriptions of accidents and their causes, how those accidents could easily have been in our organization, and how a systematic safety effort could have helped prevent those accidents. It is through our knowledge and sharing stories of past accidents and their causes that decision makers and coworkers become aware of the importance of system safety efforts.
In addition to educating our managers and peers, some specific suggestions for system safety practitioners include the following, some of which are covered in the Holmes-Mackie paper:
System safety efforts and system safety personnel are often on the receiving end of criticisms about our discipline, and for various reasons some of those criticisms may be valid. We must both be aware of the criticisms and be prepared to counter them through education and our own personal work habits.
Holmes-Mackie, N.W., “Safety Assurance; Where Does the Effort Go? Where Should It Go?”, 6th IET International System Safety Conference, 20-22 September, 2011.
What Went Wrong?
Case Histories of Process Plant Disasters and How They Could Have Been Avoided
By Trevor Kletz
5th edition
Butterworth-Heinemann
640 pages
Trevor Kletz brings his experience of nearly four decades in the chemical process industry to his new edition of What Went Wrong?. The subtitle really says it all – this book provides hundreds of case histories of the way things can go awry in chemical processing. The book allows the reader to learn from what went wrong in previous disasters and incidents by providing illustrations and examples. Chapters are included on subsystems such as storage tanks, stacks, piping, tank cars, and so on. Hazards related to leaks, static electricity, materials, and explosions, to name but a few, are covered. Change management is a major part of the book – Kletz talks to changes in organization, hardware, and procedures and shows how poor management of these changes have contributed to previous accidents. Poor communication is also included as a hazard cause, and human error and software are briefly addressed. Much can be learned from What Went Wrong? whether one is a member of the chemical process community or not. My only complaint is that few of the examples have been referenced, and it would have been useful to be able to review accident and incident reports related to the case studies discussed. However, this book is very useful to help learn from previous incidents, and practitioners should use this book to help them learn in order to avoid future catastrophes, especially those in the chemical industry.
Australian System Safety Conference 2012
Brisbane, QLD, Australia, May 23-25, 2012
ASSE Safety 2012
Denver, CO, USA, June 3-6, 2012
30th International System Safety Conference
Atlanta, Georgia, US, August 6-10, 2012
The 7th International IET System Safety Conference,
incorporating the Cyber Security Conference 2012
15 - 18 October 2012, Radisson Blu Hotel, Edinburgh, UK
http://conferences.theiet.org/system-safety/index.cfm