Therac-25, a study in the potential risks of software bugsDec 062016
December 6th, 2016
I'm Jason Firth.
It's unfortunately common to find that people don't appreciate the risks involved with software, as if the fact that the controls are managed by bits and bytes changes the lethal consequences of failure.
A counterpoint to this is the Therac-25, a radiation therapy machine produced by Atomic Energy of Canada Limited -- AECL, for short.
The system had a number of modes, and while switching modes, the operator could continue entering information into the system. If the operator switched modes too quickly, then key steps would not take place, and the system would not be physically prepared to safely administer a dose of radiation to a patient.
Previous models had hardware interlocks which would prevent radiation from being administered if the system was not physically in place. This newer model relied solely on software interlocks to prevent unsafe conditions.
There were at least 6 accidents involving the Therac-25. Some of these accidents permenantly crippled the patients or resulted in the need for surgical intervention, and several resulted in deaths by radiation poisioning or radiation burns. One patient had their brain and brainstem burned by radiation, resulting in their death soon after.
There were a number of contributing factors in this tragedy: Poor development practices, lack of code review, lack of testing, and of course the bugs themselves. However; rather than focus on the specifics of what caused the tragedy, what I want to show is that what we do is not just computers -- it's where rubber meets road, and where what happens in our computers meets the reality. People who would never dream of opening a relay cabinet and starting to rewire things would think nothing of opening a PLC programming terminal and starting to 'play'.
Secondly, part of the problem is people who didn't realise that they were controlling a real physical device. There are things to remember when dealing with physical devices: For example, that no matter how quick your control system, valves can only open and close so fast, motors can only turn so fast, and your amazing control system is only as good as the devices it controls. Because the programmer forgot that these are real devices, they forgot to take that into account, and people died as a result. This holistic knowledge is why journeyman instrument techncians and certified engineering technologists in the field of instrumentation engineering technology are so valuable. They don't just train on how to use the PLC, they train on how the measurements work, how the signalling works, how the controllers work (whether they are digital or analog in nature), how final control elements work, and how processes work.
When it comes to control systems, just because you're playing with pretty graphics on the screen doesn't mean you aren't dealing with something very real, and something that can be very lethal if it's not treated with respect.
Another point that's near and dear to my heart comes in one of the details of the failures: When there was a problem, the HMI would display "MALFUNCTION" followed by number. A major problem with this is that no operator documentation existed saying what each malfunction number meant. I've said for a long time in response to people who say "The operator should know their equipment", that we as control professionals ought to make the information available for them to know their equipment. If we don't, we can't expect them to know what's going on under the surface. If the programmer had properly documented his code, and properly documented the user interface, then there may have been a chance operators would have understood the problem earlier, preventing lethal consequences.
Thanks for reading!