Effective system safety and emergency management efforts require learning from failure, and from success. Lessons learned will be presented here, often illustrated through an accident or incident. Note that in discussing these events, the intent is not to oversimplify the conditions that led to the incidents or to place blame on individuals and organizations. Rarely is there only one identifiable cause leading to the accident. Accidents and incidents are usually the result of complex factors that include hardware, software, human interactions, procedures, and organizational influences. Readers are encouraged to review the full investigation reports referenced to understand the often complex conditions that led to each accident discussed here.
Rail Grinder Derailment in California
On November 9, 2006, a Harsco Track Technologies grinder derailed near Baxter, California. A grinder is a vehicle used to remove irregularities from the rail as part of regular maintenance on railroad track. The grinder was not in service at the time of the accident, but was making a 300 mile trip from Sparks, Nevada to Tehachapi, California to be used in a new location. This trip took the grinder over the Donner Pass in the Sierra Nevada Mountains. As the grinder was descending from the Donner Pass summit the speed of the grinder increased. The operator attempted to apply the air brakes to slow the vehicle down, but the brakes did not respond. The grinder derailed as it entered a curve in the tracks. Two employees were killed in the derailment. The National Transportation Safety Board found in its investigation that the air brake system was incapable of providing adequate breaking prior to the accident. The NTSB stated that the condition of the brakes should have been evident from inspections or brake tests. The NTSB stated that the probable cause was ineffective maintenance, inspections, and testing of the brake system. Contributing to the accident was inadequate safety oversight of rail grinders by the Federal Railroad Administration (FRA). According to the NTSB, “FRA officials acknowledge that before the accident, the FRA provided little oversight of rail grinding equipment and that the nature of the FRA’s regulatory and enforcement authority was poorly understood.”
Lessons Learned: It is not enough to verify systems prior to implementing them. Mitigation measures must be regularly maintained and checked to assure that they will work when needed. Organizations should maintain records of the maintenance and testing, and audits should be conducted of those records.
U.S. National Transportation Safety Board, “Railroad Accident Brief,” NTSB/RAB-09/03, November 9, 2009.
Loss of Mars Spacecraft
The Phobos 1 spacecraft was launched on July 7, 1988, on a mission to conduct surface and atmospheric studies of Mars. The vehicle operated normally until routine attempts to communicate with the spacecraft failed on September 2, 1988 and the mission was lost. Examination of the failure showed that a ground control operator had omitted a single letter in a series of digital commands sent to the spacecraft. The computer mistranslated this command and started a ground checkout test sequence, deactivating the attitude control thrusters. As a result, the spacecraft lost its lock on the Sun. Because the solar panels ended up pointed away from the Sun, the on-board batteries were eventually drained, and all power was lost. A lack of specifications for the human and software interface contributed to the failure. Additionally, error-checking functions had been turned off during the data transfer operation and therefore no checks were performed that could have captured the improper entry prior to critical operations.
Lessons Learned: Before initiating hazardous operations, computer systems should perform checks to ensure that they are in a safe state and functioning properly. Examples include checking safety-critical circuits, components, inhibits, interlocks, exception limits, safing logic, memory integrity, program loads, input values, and output values. The initial states of variables and interfaces should be understood. Software should also be able to discriminate between valid and invalid input for hazardous operations, and reject invalid information. Invalid commands should also be rejected based on checks performed.
Harland, D.M., and Lorenz, R.D., Space System Failures: Disaster and Rescues of Satellites, Rockets, and Space Probes, Praxis Publishing, 2005.
Aircraft Near Miss in Calgary
On March 2, 2010, an aircraft crossed the hold line on one taxiway as another aircraft passed overhead at the Calgary International Airport in Alberta, Canada. The Transportation Safety Board of Canada (TSB) determined in its investigation that multiple factors led to this runway incursion event. The airport controller lost track of the aircraft due to long delays between the time the aircraft arrived at the taxiway and the issuance of take-off clearance. The tower was at reduced staffing levels, and the tower coordinator position was vacant, so no one was present to check on the position of the aircraft. Calgary International Airport was equipped with Airport Surface Detection Equipment (ASDE), which included a real time display in the tower of aircraft and other vehicle traffic operating on airport maneuvering areas. The ASDE display did not show identification tags of the departing aircraft, which led the controller to lose the position of certain aircraft. The ASDE included a software feature known as the Runway Incursion Monitoring and Collision Avoidance System (RIMCAS) to monitor movements and identify conflicts. But the RIMCAS feature was not enabled, so the controller was not alerted to the potential for incursion. The RIMCAS had been disabled because the controllers experienced a large number of nuisance alarms associated with the system. Disabling the RIMCAS removed an opportunity for the controller to be alerted to the incursion, according to the TSB.
Lessons Learned: If too many alarms are received from a safety system, operators may become distracted and may not understand the potential for catastrophe. In some cases, those operators may disable key systems to reduce the number of nuisance alarms. Nuisance alarms must be considered in implementing hazard controls.
Transportation Safety Board of Canada, “Runway Incursion, NAV CANADA Calgary Tower, Calgary International Airport, Alberta, 02 March 2010,” Report Number A10W0040, October 21, 2010.