On April 11, 1986, a radiation technician at the East Texas Cancer Center in Tyler, Texas, set up to treat a skin cancer patient with the center’s two-year-old Therac-25 medical linear accelerator. The machine had been used to successfully treat more than 500 center patients over two years. Similar machines, the Therac-6 and Therac-20, had been in use throughout the 1980s.
As Nancy Leveson and Clark Turner summarize in a 1993 issue of IEEE Computer, the radiation tech turned the beam on but the machine quickly shut itself down, making a loud noise and reporting “Malfunction 54.” The patient told the technician and, later, the hospital’s nuclear physicist, that it had felt like fire on the side of his face, he’d seen a flash of light, and heard a sound reminiscent of “frying eggs.”
The patient died three weeks later, from a massive radiation exposure. The incident might have remained completely mystifying, except it was not the first time this had happened. It wasn’t even the first time it had happened in Tyler.
One Report Is Just an Anecdote – Two Make a Warning
Just three weeks earlier, a patient undergoing his ninth radiation treatment had actually jumped up from the treatment bed and pounded on the door to get out of the treatment room. He, too, had received a massive overdose. He died from it five months later.
After the first incident, the machine had been taken off-line for examination. Two engineers from AECL, the manufacturer, came out to Tyler. They found no problems, and assured the treatment center that overdoses were “impossible.” AECL suggested that perhaps there had been an electrical shock. An independent engineering firm was brought in and found no electrical problems. The machine was put back in service on April 7th, four days before the second incident.
Actually, there had been two prior incidents involving the Therac-25, one in Canada and one in Georgia. A lawsuit had been filed in the Georgia case, on November 13, 1985, half a year before the Tyler cases.
As you may have guessed by now, and as you probably already know if you’ve ever taken a software safety course, the problem was a rarely exercised bug in the software.
Around the time of these radiation burns, leading-edge automobile manufacturers were introducing electronic cruise control and electronic throttle control. And around that time, the world began hearing of “sudden unintended acceleration.”
More and More, Cars Are Rolling Computers
People understandably believe that a car’s gas pedal is mechanically connected to the car’s throttle. In modern cars, there is no such connection. A foot on the gas pedal is signaling a computer in the car that the driver would like to go faster; a little electric motor then opens the throttle in response to the driver’s command. This enables the car to do clever things with the throttle on your behalf to improve gas mileage, to allow for automatic adjustment to altitude, and so on. It also means the computer runs the car’s throttle, just like the computer controlled the Therac-25. The driver offers “input” but it’s the computer that moves the physical parts.
Sudden acceleration is not new. The National Highway Traffic Safety Agency (NHTSA) has now looked into it twice, once engaging NASA engineers for help, both times failing to find any problems in the cars they examined. The manufacturers, along with the highway safety agency, have politely suggested that it is generally due to “pedal misapplication,” an argument that many have been inclined to believe. It hasn’t helped that instances of sudden acceleration seem to happen to older drivers with a greater frequency than you might expect if it were really a problem with the car.
That changed, in part, after a horrific 911 call on August 28, 2009. A veteran California Highway Patrol officer, Mark Saylor, was driving his family in a Lexus, Toyota’s luxury brand, when it accelerated out of control and the brakes failed to slow it. A passenger was on the phone with 911 when the Lexus ran off an embankment at an estimated 120 MPH. No one suggested that Saylor did not know the gas from the brake.
Another investigation suggested that the problem was floor mats. Toyota recalled the cars for a floor-mat fix.
Now, the company has been fined $1.2 billion because its executives withheld information pointing to some other problem with the car.
What other problem? Has the press told us?
The Watchdog Doesn’t Bark and the Fail-safes Fail
There are several good candidates for possible causes, but somehow this part isn’t getting well-reported. The first good candidate came up in a jury trial in Oklahoma late last year, in which a jury found Toyota guilty in the fatal crash of a 2005 Camry driven by Jean Bookout, who suffered extensive injuries. The jury called for $3 million in compensatory damages.
After the verdict but before the jury could decide on punitive damages, Toyota decided to settle this case and a large set of outstanding sudden-acceleration cases. The reason Toyota became anxious to settle may have been the expert testimony of Michael Barr, an embedded software systems expert who had been allowed to examine the source code of Toyota’s engine-control system and who found terrible problems with the system’s design. The problems, said Barr, made it theoretically possible that a software failure would both trap the throttle in a wide open position and disable the very systems that should notice the problem. Perhaps worse for Toyota, Barr found many problems throughout the software, any one of which would indicate shoddy engineering practice.
First, Toyota had not used error-correcting memory, a special type of hardware that is able to automatically detect and correct situations where some rare physical event flips a one to a zero or vice versa. It had told NASA’s investigators that it had used error-correcting memory, which may have led NASA’s safety experts to pay less attention to certain situations than they did.
But this is just one problem from a laundry list that Barr turned up. Toyota did have many layers of fail-safes in place. The problem is that these fail-safes did not cover all situations, and left the possibility of a single-point failure. Many of these fail-safes are somewhat technical, but they are not rocket science.
As one example, many critical data items in embedded systems are mirrored in multiple locations. If the system detects different values for the same item from the different locations in which an item is stored, the system can respond, perhaps even to the extent of resetting itself. But although many of the variables in Toyota’s system were mirrored, one that was not was the target throttle angle, which basically tells the car how fast to go. There was no backup.
Another problem: To make sure that a system is running properly, it can incorporate a watchdog timer, a subsystem that expects to be given an “all ok” signal from the various tasks the system runs. When a task doesn’t check in on schedule with the watchdog timer, again, the system can respond by correcting or resetting itself. Toyota had a watchdog timer, but it checked only certain routines, meaning that the task which controlled the throttle could become non-responsive without the watchdog being alerted.
The list goes on. A type of software call that can exhaust a system’s critical memory was included, even though it is generally not used in safety-critical software. According to Barr, NASA was told that there was twice as much safety margin in the exhaustible resource (the “stack”) as was actually present; Toyota had mismeasured. Barr’s investigation found buffer overflows, invalid pointer dereferences, race conditions, unsafe casting, potential stack overflow, and nested scheduler unlocks in the source code used by Toyota’s 2005 Camry L4, any of which could have caused the sort of memory corruption that would leave the throttle stuck.
Remarkably, the diagnostic codes that might have clued an engineer into failures were handled by the exact same task whose failure could freeze the throttle. This task, referred to as task X throughout the trial because its actual name is considered secret by Toyota, was referred to as a “kitchen sink” task by Barr.
Another Theory Addresses More Evidence
Barr’s theory may well explain what happened in various Toyota cars, but it leaves some mystery about why sudden acceleration is reported in many models of cars from a wide variety of manufacturers. It also leaves unexplained why sudden acceleration seems to happen to a statistically unlikely degree to older drivers, often from the very moment they shift from park to drive or reverse.
The somewhat disheartening answer may well be that there are other, additional sources of sudden-acceleration incidents – multiple pathways leading to the same deadly result.
A retired electronics engineer, Dr. Ron Belt, has a testable theory that he has offered in several papers, including one released in April 2012. Belt thinks it is possible that some incidents of sudden acceleration are due to negative voltage spikes on battery supply lines. Belt theorizes that if a spike occurs at the moment the voltage is being sampled in order to correct the voltage applied to the throttle motor, this could lead to an increase in the input to the throttle motor, which could result in sudden acceleration.
The voltage sampling does not happen frequently, and the spikes may be associated with transitions into gear. Belt believes his theory may account for some of the things Barr’s theory does not explain – for example, the statistical association of sudden acceleration with older drivers, who may have a different driving pattern than the long commutes many younger people have, and the presence of sudden-acceleration complaints across a wide variety of automotive manufacturers.
The Media “Watchdogs” Don’t Bark Either
Why haven’t Barr’s report and Belt’s theory been widely disseminated by the media? Probably because the in-depth discussion of this has been in outlets like EDN Network (“an electronics community for engineers, by engineers…”) and EETimes (“connecting the global electronics community”), not those like The New York Times or The Washington Post.
EDN’s headline is clear: “Toyota’s Killer Firmware.” Here’s one from EETimes: “Single Bit Flip that Killed.” Compare those with the final sentence of an Associated Press story by reporter Sean Murphy: “Toyota has denied the allegation, and neither the National Highway Traffic Safety Administration nor NASA found evidence of electronic problems.” Or this one, from a December story in Bloomberg Businessweek, discussing Toyota’s decision to settle 200 consolidated claims: “Toyota apparently didn’t want to risk that California juries would assume there had to be something wrong with Toyotas if owners claimed their cars suddenly hurled themselves into trees, walls, or other cars.”
The technical press gets it – the business and mainstream media don’t.
What are the reasons for this? Have the media not bothered to read the testimony? Is this the laziness of “he said, she said” reporting, where the fact that NASA failed to find something is presented as evidence that it does not exist? The NASA engineers were very clear on page 20 of their report that they were not able to vindicate the Toyota engine software; they simply said that in the limited time they were provided, they could not find a problem. In the executive summary of their work, the caveats do not appear.
Or is it simply human nature: Are people disinclined to exert the effort to understand the potential dangers of the new information ecology on which society has become dependent?
Increasingly, people’s lives are placed in the hands of visible technology and also software – invisible technology. The coastal nuclear power plant? No problem, the statisticians say a tsunami is unlikely. The chemical plant in India? Perfectly safe, as long as it’s in a country far away from the neighborhoods where Union Carbide executives live. Software is an even more difficult problem than physical technology – corporations generally keep their software secret, since it provides them with a proprietary advantage. The resulting secrecy can make it difficult to find the source of failures.
Does this matter? How many more years before cars drive automatically? Who will believe the first driver to say the car wouldn’t respond to their steering? Who will believe the second?