Effective system safety and emergency management efforts require learning from failure, and from success. Lessons learned will be presented here, often illustrated through an accident or incident. Note that in discussing these events, the intent is not to oversimplify the conditions that led to the incidents or to place blame on individuals and organizations. Rarely is there only one identifiable cause leading to the accident. Accidents and incidents are usually the result of complex factors that include hardware, software, human interactions, procedures, and organizational influences. Readers are encouraged to review the full investigation reports referenced to understand the often complex conditions that led to each accident discussed here.
Near Miss in New Zealand
On May 23, 2010, the oil-field support vessel Marsol Pride was conducting underwater operations off the west coast of New Zealand when the carbon dioxide fire smothering system inadvertently activated, resulting in an uncontrolled release of carbon dioxide gas into the engine room. As reported by the Transport Accident Investigation Commission (TAIC) of New Zealand, this was a serious event because carbon dioxide gas displaces air by design, and therefore makes that environment unfit to support human life. In addition, such a release can immobilize a ship’s propulsion system; in this case two main propulsion engines shut down from air starvation. Fortunately, an automatic alarm activated in the engine room as designed to warn personnel to leave the area before the carbon dioxide gas release. The duty engineer left when he heard the alarm to investigate the problem and therefore was not affected. In addition, there was no damage to the vessel when the propulsion system shut down. The TAIC stated that two failures occurred to create the inadvertent release of carbon dioxide gas: the pilot cylinder had been leaking, possibly due to foreign debris trapped between the valve and the valve seat, and the booster valve was leaking, likely due to poor maintenance. The TAIC stated that the requirement to inspect these valves every 5 years was inadequate. In addition, the TAIC noted that critical systems must be periodically tested; no requirement existed to test the system. As stated in the report, “Any component in a fixed CO2 gas fire fighting installation the failure of which can cause serious harm or immobilise [sic] a vessel should be inspected and tested often enough to detect any deterioration in performance so that remedial action can be taken to avert a failure.”
Lessons Learned: It is not enough to verify systems prior to implementing them. Mitigation measures must be regularly maintained and checked to assure that they will work when needed. Organizations should maintain records of the maintenance and testing, and audits should be conducted of those records.
Transport Accident Investigation Commission (New Zealand), “Marsol Pride, uncontrolled release of fire-extinguishing gas into engine room, Tui oil and gas field, 27 May 2010,” Marine Inquiry Report 10-203, August 2011.
Aircraft Accident in the Dominican Republic
On February 6, 1996, Birgenair flight 301, a chartered Boeing 757-225, crashed into the sea shortly after takeoff from Puerto Plata's Gregorio Luperón International Airport in the Dominican Republic. All 189 persons on board died in the crash. During takeoff the captain noted that the Air Speed Indicator (ASI) was not working properly. The captain decided to continue the flight because the co-pilot’s ASI was functional. While climbing, the ASI read 350 knots, causing the autopilot and autothrottle to react by increasing the pitch-up attitude and reducing power to try to slow the aircraft (the actual speed at this time was about 220 knots). The crew then started to observe contradictory warnings from the flight control system, receiving rudder warning and excessive airspeed advisories followed by a stick shaker warning. The crew finally realized that the ASI readings were unreliable, and that the autopilot was slowing them to a stall condition. The crew disconnected the autopilot and applied full thrust, but the actions were not enough to prevent the airplane’s impact with the water. It is believed that the incorrect air speed readings were the result of obstructed Pitot tubes. The aircraft had been left outdoors for at least 20 days prior to flight. It is believed that in that time an insect called the black and yellow mud dauber wasp had nested inside the Pitot tubes, preventing them from functioning properly. Without the Pitot tubes the flight software could not perform accurate calculations of air speed. The accident report stated that the probable cause was the crew's failure to recognize the activation of the stick shaker as a warning of imminent entrance to the stall, and the failure of the crew to execute the procedures for recovery from the onset of loss of control. The nonfunctioning Pitot tubes were identified as a contributing factor.
Lessons Learned: Natural environments can be the source of a number of problems that must be considered. Natural environments can include humidity, wind, radiation, temperature, lightning and even flora and fauna. For example, during inspection of the Space Shuttle Discovery prior to the launch of STS-70, technicians discovered holes in the external tank insulation created by woodpeckers. Although most natural environments may be known, the ability of systems to withstand these environments may be overestimated.
U.S. National Transportation Safety Board, Safety Recommendation A-96-141, November 15, 1996.
In September 1997 the USS Yorktown, a Ticonderosa-class cruiser in the United States Navy, experienced a system failure that led to loss of the propulsion system for over two hours. This failure meant that the ship was “dead in the water,” unable to move until the problem could be resolved. The ship was testing a new software system called the Smart Ship system at the time of the incident. The Smart Ship system was meant to automate tasks that had previously been done manually. The U.S. Navy stated that bad data had been provided through the user interface to the software. The input contained a zero, which was an invalid entry for the data field to the Remote Data Base Manager program, and the system was not designed to capture such bad input. As a result a buffer overrun occurred (divide by zero error) which crashed the entire network. After the crash the ship lost control of the propulsion system and was unable to move. Schedule and cost factors also played a role in the incident. The system was rushed into production with no prototyping or full system tests prior to the system shutdown. In addition, the system was designed without a backup should such a failure occur.
Lessons Learned: Using unproven or inadequately tested computing system technologies can introduce unforeseen risks. Engineers and managers are prone to overestimate technology readiness, and organizations may be inclined to overstate the readiness of a new technology or system. As a result, decision makers may misunderstand the potential for an accident or incident. Of particular concern is automation of a task that was formerly performed manually. In such cases the nature of the hazard can dramatically change because of increased complexity and a change in the role of the operator. With automation, operators may no longer be performing hands-on activities but rather may be tasked with monitoring and communicating information.
Slabodkin, G., “Software glitches leave Navy Smart Ship dead in the water,” Government Computer News, July 13, 1998.
“Sunk by Windows NT,” Wired, July 24, 1998.
Visit this page weekly for a new system safety lesson learned.