Effective system safety and emergency management efforts require learning from failure, and from success. Lessons learned will be presented here, often illustrated through an accident or incident. Note that in discussing these events, the intent is not to oversimplify the conditions that led to the incidents or to place blame on individuals and organizations. Rarely is there only one identifiable cause leading to the accident. Accidents and incidents are usually the result of complex factors that include hardware, software, human interactions, procedures, and organizational influences. Readers are encouraged to review the full investigation reports referenced to understand the often complex conditions that led to each accident discussed here.
Natural Gas Pipeline Rupture in California
On September 9, 2010, a 30-inch-diameter natural gas pipeline owned and operated by Pacific Gas and Electric Company (PG&E) ruptured in a residential area in San Bruno, California. The released natural gas, estimated at 47.6 million standard cubic feet, ignited. Eight people were killed and many were injured in the resulting fire, and 38 homes were destroyed. The U.S. National Transportation Safety Board (NTSB) found in their investigation that the pipeline rupture originated in a partially-welded seam of a short pipe section due to a weld defect during installation in 1956. The weld defect reduced the strength of the pipe section, making it susceptible to ductile crack growth and fatigue crack growth under internal gas pressure. The NTSB ruled that the probable cause was 1) an inadequate quality assurance and quality control in 1956 which allowed the installation of a substandard and poorly welded pipe section with a visible seam weld flaw and 2) an inadequate pipeline integrity management program, which failed to detect and repair or remove the defective pipe section. The NTSB also stated that PG&E’s flawed emergency response procedures and delay in isolating the rupture to stop the flow of gas contributed to the severity of the accident. The NTSB found the PG&E lacked detailed and comprehensive procedures for responding to a large-scale emergency. These procedures should have included trouble-shooting protocols and checklists. PG&E also did not have a defined command structure with a single point of leadership. Under PG&E’s emergency response plan, Supervisory Control and Data Acquisition (SCADA) center personnel were responsible for pipeline monitoring and operations and the PG&E dispatch center personnel were responsible for sending first responders. This meant that personnel at two different locations had to coordinate with each other to direct operations during the emergency. In addition, although emergency responders arrived immediately, it took PG&E 95 minutes to stop the flow of gas; the NTSB stated that this delay was “excessive.” The NTSB report noted that SCADA staff had difficulties determining the exact location of the rupture, which may have contributed to the delay. As stated by NTSB, “The PG&E SCADA system lacked several tools that could have assisted the staff in recognizing and pinpointing the location of the rupture, such as real-time leak or line break detection models, and closely spaced flow and pressure transmitters.” The NTSB also stated that the use of automatic shutoff valves and remote control valves along the ruptured line would have significantly reduced the amount of time taken to stop the flow of gas and to isolate the rupture.
Lessons Learned: Organizations must not only focus on failures. They must be prepared to respond to incidents. Organizations must have a comprehensive emergency management program as part of a broader safety effort. This emergency management effort should ensure that procedures are in place instructing employees on what to do in an emergency and a command structure to assure that response efforts are coordinated. Plans must be in place for emergency response, and those plans must be practiced to assure that the response will be timely when needed.
U.S. National Transportation Safety Board, “Pacific Gas and Electric Company Natural Gas Transmission Pipeline Rupture and Fire, San Bruno, California, September 9, 2010,” Pipeline Accident Report NTSB/PAR-11/01, August 30, 2011.
Rail Grinder Derailment in California
On November 9, 2006, a Harsco Track Technologies grinder derailed near Baxter, California. A grinder is a vehicle used to remove irregularities from the rail as part of regular maintenance on railroad track. The grinder was not in service at the time of the accident, but was making a 300 mile trip from Sparks, Nevada to Tehachapi, California to be used in a new location. This trip took the grinder over the Donner Pass in the Sierra Nevada Mountains. As the grinder was descending from the Donner Pass summit the speed of the grinder increased. The operator attempted to apply the air brakes to slow the vehicle down, but the brakes did not respond. The grinder derailed as it entered a curve in the tracks. Two employees were killed in the derailment. The National Transportation Safety Board found in its investigation that the air brake system was incapable of providing adequate breaking prior to the accident. The NTSB stated that the condition of the brakes should have been evident from inspections or brake tests. The NTSB stated that the probable cause was ineffective maintenance, inspections, and testing of the brake system. Contributing to the accident was inadequate safety oversight of rail grinders by the Federal Railroad Administration (FRA). According to the NTSB, “FRA officials acknowledge that before the accident, the FRA provided little oversight of rail grinding equipment and that the nature of the FRA’s regulatory and enforcement authority was poorly understood.”
Lessons Learned: It is not enough to verify systems prior to implementing them. Mitigation measures must be regularly maintained and checked to assure that they will work when needed. Organizations should maintain records of the maintenance and testing, and audits should be conducted of those records.
U.S. National Transportation Safety Board, “Railroad Accident Brief,” NTSB/RAB-09/03, November 9, 2009.
Loss of Mars Spacecraft
The Phobos 1 spacecraft was launched on July 7, 1988, on a mission to conduct surface and atmospheric studies of Mars. The vehicle operated normally until routine attempts to communicate with the spacecraft failed on September 2, 1988 and the mission was lost. Examination of the failure showed that a ground control operator had omitted a single letter in a series of digital commands sent to the spacecraft. The computer mistranslated this command and started a ground checkout test sequence, deactivating the attitude control thrusters. As a result, the spacecraft lost its lock on the Sun. Because the solar panels ended up pointed away from the Sun, the on-board batteries were eventually drained, and all power was lost. A lack of specifications for the human and software interface contributed to the failure. Additionally, error-checking functions had been turned off during the data transfer operation and therefore no checks were performed that could have captured the improper entry prior to critical operations.
Lessons Learned: Before initiating hazardous operations, computer systems should perform checks to ensure that they are in a safe state and functioning properly. Examples include checking safety-critical circuits, components, inhibits, interlocks, exception limits, safing logic, memory integrity, program loads, input values, and output values. The initial states of variables and interfaces should be understood. Software should also be able to discriminate between valid and invalid input for hazardous operations, and reject invalid information. Invalid commands should also be rejected based on checks performed.
Harland, D.M., and Lorenz, R.D., Space System Failures: Disaster and Rescues of Satellites, Rockets, and Space Probes, Praxis Publishing, 2005.