No matter how much tedious testing engineers have done, the sleepless night they spent coding until dawn, at the end, what you get in return: a bug that causes a complete failure. Do you know that software errors cost the United States economy billions annually in lost productivity, rework and actual damage?
The common reasons behind any software failure are:
- False architecture definition and low level design.
- A forced schedule or milestone dates without substantial amount of data and analysis.
- Failing to account and adjust for requirement growth.
- Integrating excessive personnel to achieve unrealistic schedule compression.
- Intuition-based or emotional stakeholder negotiation.
- Miscommunication, egos and negative attitudes.
Following are the major software failures that led to embarrassment, massive financial loss, and even death. If you have any suggestion or honorable mention, bring it on.
12. British Passport System (1999)
The United Kingdom passport agency started a new computer system, which failed to issue passports on time to over half million citizens. Later, the agency paid millions in form of compensation, staff overtime and umbrellas for people queuing in the rain for passport.
Cost: $14 million
Reason: The agency rolled out their software without proper testing and without training its employees about the new system. Also, the new law (released at the same time) required all children under 16 travelling abroad to get a password. This resulted in a huge spike in passport demand, overloading the amateur software system.
11. Mariner 1
Image credit: wikimedia
The Mariner 1 (1962) spacecraft headed for Venus diverted from its intended path after 293 seconds of liftoff. The mission was completed by Mariner 2 which launched 5 weeks later.
Cost: $19 million
Reason: This is a combination of two failures – an antenna hardware failure and an onboard guidance system software failure. The guidance antenna performed below its specification. So, the spacecraft had to rely on its onboard guidance system, which had a bug in it.
A programmer incorrectly transcribed a formula into computer code, missing a single subscript bar, which was meant for nth smoothed value of the time derivative of a radius R. Without this smoothing function the software treated normal variation of velocity as if they were serious, causing vague corrections that sent the spacecraft off course.
Mydoom is a computer worm, first seen on January 26, 2004. The next day, SCO group offered $250,000 reward for giving information about the worm’s creator. According to the MessageLabs, at that time, every 12th email carried this virus.
Cost: $38 billion
Reason: The virus is capable of creating a backdoor in operating system letting unauthorized users to access your personal data. Also, it can spoof emails, making it very difficult to track the source. Like other viruses, it searches for email contacts and sends the request to all search engines.
9. Hartford Coliseum Collapse
Image source: wikispaces
The Hartford Civic Center Coliseum collapsed on January 18, 1978, just hours after nearly 5 thousand spectators left the Coliseum. The steel-latticed roof collapsed under the weight of snow.
Cost: $70 million + $20 million damage to the local economy.
Reason: There were many conflicting accounts of failure, including design flows, construction and programming errors. The CAD programmer designed the coliseum incorrectly, assuming that the roof supports would only face pure compression.
Also, the computer model assumed all of the top chords were laterally braced, but in fact only interior frame met the criteria. Dead loads were underestimated by more than 20%. When one of the supports unexpectedly buckled from the snow, it sets off the chain reaction that brought down the other roof sections.
8. Mars Climate Orbiter
Mars Climate Orbiter was a robotic space probe launched by NASA in 1998 to study the Martian atmosphere, climate and surface change. After 286 days of launch, the communication with spacecraft was lost because it went into orbital insertion. The navigation mishap pushed the rocket too close to the Mars atmosphere where it presumably burned and broke into pieces.
Cost: $125 million
Reason: The mars probe lost due to ground-based computer software (by Lockheed Martin) which generated the output in non-SI units of pound-seconds instead of metric units of Newton-seconds.
7. IRS: Lack of Fraud Detection System
In 2006, the Internal Revenue System (IRS), working without an automated refund fraud detection system to figure out potential fraudulent cases in returns claiming funds, cost America in millions.
Cost: $300 million damage + $21 to fix
Reason: The Computer Science Corp. was supposed to deliver the (EFDS) Electronic Fraud System in January 2005. However, in October 2004, IRS was worried about their 21 million systems that would not be ready on time and that’s why they decided to use their existing system for 2005 filing session.
According to the IRS Commissioner Mark Everson, the management efforts of both the IRS and its contractor to improve automated refund fraud detection system were insufficient and unacceptable.
6. Cluster Spacecraft
Cluster (constellation of 4 ESA spacecraft) was launched on the maiden flight of the Ariane 5 rocket in 1996. The rocket was unable to achieve orbit and the mission ended in failure.
Cost: $370 million
Reason: The whole system terminated when a computer program tried to convert the sideways rocket velocity from 64-bit to 16-bit format. The number was too big, causing inadequate protection from integer overflow. When the guidance software shut down, control passed to an identical redundant unit, which also turned into a failure because it was running on the same algorithm. As a result, flight got diverted from its original path 37 seconds after launch. Finally, the rocket self-destructed itself by its automated flight termination system.
5. Pentium’s Long Division
In 1994, Intel’s Pentium microprocessor chip was carrying a bug in floating point unit. For precise calculation, the processor would return incorrect decimal values. There were around 5 million defected chips in circulation and Intel eventually decided to replace all chips for anyone who complained. Later, Intel turned some of their faulty processors into key chains.
Cost: $475 million + Brand reputation
Reason: The divider in the Pentium floating point unit had a flawed division table, missing five entries out of about a thousand. However, an error is only likely to occur once in a nine billion random floating point divides. For instance, dividing 4195835.0/3145727.0 yielded 1.333739068902037589 instead of 1.333820449136241002, an error of 0.006%.
4. Wall Street Crash 1987
On 19th October 1987 (also referred as Black Monday), the Dow Jones Industrial Average (DJIA) fell 508 points, losing 22.61% of its total value, and the S&P 500 dropped 20.4% . It was the greatest loss Wall Street ever saw in one day.
Cost: $500 billion in one day
Reason: Major causes include program trading and overvaluation. In program trading, computers execute rapid stocks based on external inputs, such as the price of related securities. The program trading was supposed to implement portfolio insurance strategies, and an attempt to engage in arbitrage.
In early 1987, there was a rash of SEC investigations into insider trading. By October, investors decided to move out. As people began the mass exodus, computer trading programs generated a flood of sell orders to DOT (Designated Order Turnaround), overwhelming the systems, crashing market and leaving all investors effectively blind.
The Y2k (Year 2000 problem) was a problem in the coding of computerized system that was projected to create havoc in computer networks and software in the transition from 31 December 1999 to 1 January 2000.
Cost: $500 billion
Reason: To save computer storage, most of the legacy software used two digit numbers to store the year for dates, for example “97” for 1997. This caused date-related programs to operate incorrectly after 1 January 2000.
In addition, some programs did not take into account that year 2000 was a leap year. Even before the dawn of 2000, it was feared that some software might fail on 9 September 1999 (9/9/99), because early developers often used a series of 9 to indicate the end of a program code.
2. Cancer Treatment and Deadly Radiation Therapy
The Therac-25 medical radiation therapy device was involved in many cases where hundreds of patients exposed to massive overdoses of radiation in 1985-87. Few patients received up to 100 times the intended dose. The same kind of radiation dosage error happened in Panama City in 2000.
Cost: 10+ people dead, 20 critically injured.
Reason: The therapy planning software calculated radiation dosage based on the order in which data was entered, often delivering a double dose of radiation.
1. Patriot Missile Failure
In February 1991 (during the first Gulf War), an American Patriot Missile system in Dharan, Saudi Arabia, failed to intercept and track an incoming Iraqi Scud missile. The Scud crashed onto American Army barracks.
Cost: 28 soldiers dead + 100 injured
Reason: The inaccurate calculation of time and computer arithmetic error led to system failure. Technically, this was a small chopping error – the system’s internal clock was multiplied by 1/10 (non-terminating) to generate the time in seconds. This calculation was performed using 24 bit fixed point register. That means 1/10 value chopped at 24 bits after the radix point. This led to a significant error, causing missile travel more than half a kilometer.
The Disaster Continues
Microsoft customers accused of pirating: Someone from the Window’s team accidentally installed bugged filled pre-production software on all Windows servers. For the next 19 hours, all genuine XP users were told they were running pirated software.
Criminals on Parole: In 2011, around 450 violent criminals were released from California county prison due to a small mistake in computer program code.
The World War III (almost): The nuclear early warning system of Soviet Union reported the launch of American missiles on 26 September, 1983. The Soviet systems mistakenly picked up sunlight reflections off cloud-tops and interpreted them as missile launches.
Later, the missile attack warnings were identified as a false alarm by an officer of the Soviet Air Defense Forces. This decision prevented a nuclear war and the potential deaths of millions of people.
The blackout: Darkness spread throughout 8 U.S states, affecting 50-million people in 2003. The problem was a race condition which was a result of two separate threads of a single operation using the same element of the code.
Apple Map Fails: With iOS 6 release, Apple decided to kick the superior Google Maps platform. Unfortunately, it turned out as one of the most epic fails of the mobile computing industry. TPMIdeaLab realized that the software was missing entries for entire towns, incorrectly placed locations, satellite imagery obscured by clouds and more in September 2012.
LAX Flights Grounded: Tons of incorrect data was sent out on the U.S border and Custom Control Network in 2007. The lead to the LAX airport shutting the entire place down for 8 hours – more than 17,000 planes were grounded until they resolved the issue. The culprit was a single piece of faulty embedded software.