The Therac-25 Incident (2021)

This discussion highlights several key themes regarding software failures, their causes, and their implications, drawing heavily on historical incidents like the Therac-25 and the Boeing 737 MAX.

The Nature and Severity of Software Failures

The conversation frequently touches upon what constitutes a "bug" and the catastrophic real-world consequences that can arise from software defects. There's a recurring debate about whether certain failures are strictly software bugs or stem from broader design flaws, operational issues, or a combination thereof. The sheer deadliness of some incidents is a central point of discussion.

An early comment sets the stage: "The most deadly bug in history. If you know any other deadly bug, please share! I love these stories!" (napolux)
The Boeing 737 MAX is cited as a major example: "The MCAS related bugs @ Boeing led to 300+ deaths, so it's probably a contender." (NitpickLawyer)
Another user points out the ambiguity: "mcAS functioned exactly as designed, zero bugs. It's just that the design was very bad." (phire)
The sheer scale of impact is recognized, even jokingly: "mcAS is arguably a bug. That killed 346 people." (echelon)

Systemic Failures and Interconnectedness of Issues

A prominent theme is that severe software failures are rarely due to a single isolated error. Instead, they often result from a complex interplay of factors including flawed system design, inadequate testing, poor communication, and organizational culture. This leads to the conclusion that identifying a singular "root cause" is often an oversimplification.

"It's almost never just software. It's almost never just one cause." (thyristan)
"Just to point it out even clearer - there's almost never a root cause." (actionfromafar)
The Therac-25 case exemplifies this: "A commission attributed the primary cause to generally poor software design and development practices, rather than singling out specific coding errors." (vemv)

The Role of Process, Culture, and Talent

The discussion delves into the critical interplay between development process, organizational culture, and individual developer talent in achieving software quality. While strong individual talent is valued, many argue that robust processes and a culture of quality are essential for ensuring reliability, especially at scale. The failure to learn from past mistakes or implement proper feedback loops is also highlighted.

"software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing." (benrutter)
Countering this, one user states: "This is true but there also needs to be good developers as well. It can't just be great process and low quality developer practices." (AdamN)
Regarding learning: "the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed." (benrutter)
A strong opinion on talent: "My takeaway from observing different teams over years is the talent by a huge margin is the most important component. Throw a team of A performers together and it really doesn't matter what process you make them jump through." (varjag)
However, the counterpoint is that "You can't teach people to care" (mrguyorama), emphasizing the cultural aspect.

The Importance of Failsafes and Independent Verification

A recurring point, particularly in safety-critical systems, is the necessity of independent, often hardware-based, failsafes that can function even if the primary software fails. This is contrasted with relying solely on software checks or logic. The concept of "defense in depth" and ensuring that no single failure leads to a catastrophic outcome is stressed.

A fundamental principle from aerospace: "The guiding principle is that no single failure should cause an accident." (WalterBright)
The Therac-25's lack of physical interlocks is seen as crucial: "The Therac-25 was just one of the many incidents we covered in my Software Ethics course for my Computer Science degree." This highlights the missing element of "It had a physical mechanism that was helping it to not kill someone." (layman51)
"The #1 virtue of electromechanical failsafes is that their conception, design, implementation, and failure modes tend to be orthogonal to those of the software." (bell-cot)
"Assume the software does the worst possible thing. Then make sure that there's an independent system that will prevent that worst case." (WalterBright)

The Impact of Business Decisions and Commercial Pressures

Several comments suggest that business objectives, cost-cutting measures, and commercial pressures can significantly influence development decisions, often at the expense of safety and quality. This includes a reluctance to inform users about new systems or train them, potentially to avoid delays or added costs.

The decision not to implement warnings for conflicting data was driven by a desire to avoid informing pilots: "They deliberately designed it to only look at one of the Pitot tubes, because if they had designed it to look at both, then they would have had to implement a warning message for conflicting data. And if they had implemented a warning message, they would have had to tell the pilots about the new system, and train them how to deal with it." (phire)
Regarding the 737 MAX certification: "The competition (Airbus NEO family) did not need this kind of new training for existing pilots, so airlines being required to do this for new Boeing but not Airbus planes would've been a huge commercial disadvantage." (reorder9695)
"It is business who requests features ASAP to cut costs and and then there are customers who don’t want to pay for „ideal software” but rather have every software for free." (ozim)

The Disconnect Between Software Engineering and Other Engineering Disciplines

The discussion touches on the perception that software engineering, unlike more established disciplines like civil or mechanical engineering, often lacks rigorous standards, legal oversight, and professional accountability. This enables a "YOLO" culture where best practices are ignored, and the critical nature of software is underestimated.

"Software developers are the absolute most offensive use of the word "engineer", because 99.9% of the stuff this field makes is a competition to take the most unique approach to a solution, then getting it bandaged together with gum and paperclips." (kulahan)
"There should be tons and tons of standards which are enforced legally, but this is not often the case. Imagine if there were no real legal guardrails in, say, bridge building!" (kulahan)
"Software has developed a YOLO culture. People are used to having almost no structure, and they flit between organizations so rapidly..." (ChrisMarshallNY)

The Rise of AI and the Potential for New Catastrophes

The conversation frequently draws parallels between historical software failures and the current rapid adoption of AI and LLM-driven development. There's a significant concern that similar issues – lack of understanding, poor testing, "vibe-coding," and the blind trust in new technologies – could lead to new, perhaps even more dangerous, types of failures.

"Don’t worry we are poised to re learn all these lessons once again with our fancy new agentic generative ai systems." (ipython)
"Replace the "hire poor developers" with "use LLM driven development", and you have the rough outline for a perfect Software Engineering horror movie." (ZaoLahma)
"The idea of 'vibe-coding' safety critical software is beyond terrifying. ... neophyte code monkeys introducing massive black boxes full of poorly understood voodoo to the process." (voxadam)
"The mechanical interlock essentially functioned as a limit outside of the control system. So you should build an ai system the same way- enforcing restrictions on the security agency from outside the control of the ai itself. Of course that doesn’t happen and devs naively trust that the ai can make its own security decisions." (ipython)

Medical Software and the Challenges of Regulation and User Feedback

The discussion highlights specific challenges within medical software development, including the impact of bureaucratic regulation, the difficulty of getting accurate field feedback, and the unique nature of medical data and device interactions.

"The FDA mostly looks at documentation done right and less at product done right." (vjvjvjghv)
"A lot of doctors don’t report problems back because they are used to bad interfaces. And then the feedback gets filtered through several layers of sales reps and product management." (vjvjvjghv)
"It's a different world that makin apps thats for sure." (sim7c00) regarding safety-critical systems.

Trust in Software and the Role of Developers' Attitudes

There's a sentiment that both developers and the general public often exhibit a misplaced trust in software, assuming it is inherently more reliable than it is. This undervaluing of software fragility, combined with a lack of care or understanding among some developers, contributes to recurring failures. The very definition of "engineer" and the professionalization of software development is also brought into question.

"the prevailing idea is that certain medical findings are considered proof beyond reasonable doubt of violent abuse, even without witnesses or confessions... These beliefs rest on decades of medical literature regarded by many as low quality because of methodological flaws, especially circular reasoning..." (rossant) – relating to flawed data feeding into AI.
"most developers place way too much trust in software." (jwr)
"If software "engineers" want to be taken seriously, then they should also have the obligation to report unsafe/broken software and refuse to ship unsafe/broken software." (V__)
"There’s a lot of people who think this is a "joke" bullshit class that just wasted their time. You can't teach people to care." (mrguyorama)