How complex systems fail: Lessons from Boeing’s 737 MAX 8 crashes – Part II

Monday 11 March 2024

In Part I, Dr Sean Brady of Brady Heywood examined complex systems; the sand pile model; and Boeing, Airbus and the A32neo.

Developing the 737 Max 8

The original Boeing 737 was launched in January 1967. By 1988, it was flown by more than 137 operators worldwide and described as the ‘unsung prodigy’ of the Boeing family.

It would also form the basis of the 737 MAX 8: Boeing planned to take the existing plane, replace the engines with more fuel-efficient ones, and bring them forward on the wings.

A critical decision made early in the MAX’s development was that Boeing wanted pilots already trained on the existing 737 to be able to fly the MAX with no additional simulator training.

Training is a significant cost for airlines: simulators cost about $15m each, pilots have to be taken out of service, and the training itself costs hundreds of dollars per hour. (Training, wages, and maintenance costs amount to 20% of the overall costs of running an airline – more than is spent on fuel.)

If Boeing could put a new plane on the market without requiring pilots to undertake additional simulator training, it would give it a massive advantage. But to achieve this, it had to ensure it could modify the existing 737 and not add any new functionality that would change the handling or operation of the aircraft.

To understand the environment in which the development of this aircraft occurred, we need to look at Boeing’s history and the dramatic changes it went through from 1997 onwards.

The company was founded in 1916. By 1944 it had a workforce of 50,000 people, and by the 1960s, this had jumped to 142,400. It was all about producing high-quality, safe planes and had a saying, “we hire engineers and other people”.

At meetings, designers were encouraged to fight loudly for what they wanted on the planes to make them safer. But in 1997, it merged with McDonnell Douglas – a company much more cut-throat when it came to cost-cutting.

As the McDonnell Douglas executives spread throughout the organisation, their approach to building planes began to dominate how Boeing operated. The infiltration was described as ‘hunter killer assassins’ being let loose on a room full of engineers.

The organisation started to change – in many ways, a microcosm of the Jack Welch-inspired culture of the times. Cost-cutting and a return on shareholder investment seemed more important than producing quality aircraft. Engineering views took a back seat.

The ‘other people’ were now firmly in charge. Against this backdrop, the design of the 737 MAX 8 got under way, with a focus on ‘more for less’. A countdown clock was set up in the conference room where programme meetings took place to remind people there was no time to waste. Overshadowing every decision was the drive to ensure no new functionality was added to the aircraft, which would have required additional pilot training.

MCAS

A big problem with Boeing’s design approach for the MAX emerged during wind tunnel tests on a scale model of the aircraft. The model pitched up during tight, high-speed turns due to the new engines being placed further forward on the wings. This behaviour was a genuine concern: if the plane pitched up too far, its angle of attack would become too steep, and it would stall, which could lead to a crash.

The 737 chief pilot, Ray Craig, examined the problem and discovered it only happened in the part of the flight envelope that commercial pilots rarely go. But pilots could enter this zone if they were dealing with high turbulence or responding to some upset.

And if they did, the nose of the plane could pitch up, and they could stall. This issue had to be addressed, and several mechanical solutions were proposed. They explored putting tiny veins on the wing, but they didn’t think that would work.

The only real mechanical solution was redesigning the tail and removing the pitch-up risk. But this was a costly solution that could delay the plane’s release.

So they agreed upon a software – not mechanical – solution. This software system went by the cumbersome name ‘Manoeuvring Characteristics Augmentation System’, or MCAS for short.

The software would detect when the plane was pitching up too far (while in this edge-of-the-envelope zone), and it would rotate the horizontal stabiliser at the back of the aircraft and push the plane’s nose back down. This would manage the stall risk.

Not only was this solution cheap, but it would ensure the plane handled like the previous 737s – simulator training would not be required. To detect when the plane was pitching upwards, the software would rely on measurements from two sensors: an accelerometer measuring the plane’s acceleration, and one angle of attack vane, mounted on the front of the aircraft, measuring the plane’s angle of attack⁵.

Critically, the software would rely on input from two sensors – not one – to control the plane in this edge-of-the-envelope zone. The chief pilot of the project, however, wanted a hardware solution, but was overruled because the software was cheaper. But adopting this solution did raise a concerning problem. Boeing engineers were worried about what they should call this software and who they should tell about it.

There was a real danger that the regulator, the FAA, might view this software as ‘new functionality’. And if it did, it was something that pilots would need to be trained on in the simulator. This was the last thing Boeing wanted: it had publicly announced that existing 737 pilots could migrate to the MAX by undertaking a short training session on an iPad.

The extension of MCAS

The design and development of the aircraft continued, with the first test flight of the 737 MAX taking place in January 2016. But there was more bad news only a few months into these tests.

The pitch-up problem, which the scale model showed only happened near the edge of the envelope, was now also evident at slower speeds. This meant that an edge-of-the-envelope concern could now occur during routine operations.

To make matters worse, this stall risk at slower speeds could occur during take-off and landing, the most vulnerable part of the flight, and when the pilots are at their busiest.

There was now a genuine concern within Boeing that the FAA would not certify the plane. To solve this problem, Boeing extended the software solution to cover these low-speed stall risks.

If the aircraft was at risk of pitching up at these slow speeds, MCAS would detect it and activate the horizontal stabiliser on the tail so the plane would pitch down again. This stabiliser, which MCAS could move by 0.6 of a degree previously, could now move 2.5 degrees at slower speeds.

But at these slower speeds, the software could no longer use the accelerometer as an input. MCAS now relied on only one sensor: an angle of attack vane. Boeing, which had previously put so much focus on engineering and safety, was now relying on a system with no redundancy if anything was to happen to this sensor.

And Boeing had another problem. It had picked an unproven supplier to deliver the simulator. While Boeing wanted no simulator training for existing pilots, it still needed a simulator for new pilots who had never flown a 737. But simulator development was falling behind. This not only proved a worry for training new pilots, but it also meant that if the FAA declared that training was required for the MAX, even for pilots who had flown the 737 before, there was nowhere for this training to take place. That would prohibit planes from flying.

Now the necessity for convincing the FAA there was no new functionality, specifically around the role that MCAS would play, was crucial. MCAS was first loaded onto the 737 MAX 8 computer on August 15, 2016 – it was now ready for production.

In November of that year, Boeing engineers sent their system safety assessment of MCAS to the FAA. The latest version, revision E, had all the details of how the system would operate at lower speeds and move 2.5 degrees. But revision E was not the version submitted to the FAA. Instead, revision C was submitted, covering only MCAS’s more limited role.

Not only did the FAA approve MCAS, but it also approved no reference being made to it in the manual. Pilots could now take the plane up, without additional simulator training, with a system on board that was not discussed in the manual, which could override their control. Most MAX pilots didn’t even know MCAS existed.

The first crash

The grain of sand that would initiate the Lion Air 610 crash was a misaligned angle of attack sensor – it erroneously told MCAS that the plane’s angle of attack was too steep. MCAS engaged, activated the horizontal stabiliser, and pushed the nose of the aircraft down.

While Captain Suneja could activate the trim switch and pull the nose back up, each time he did, the software would continue receiving data from the misaligned sensor, reactivate, and push the nose back down again. Even when Harvino pulled back on the control column to pull the nose up, this had no effect – MCAS was designed to override it.

As Suneja and Harvino continued to battle the aircraft’s behaviour, neither was aware of the software, what it was doing, or what was required to deactivate it. They were entirely at its mercy. In November 2018, one month after the Lion Air crash – with planes still flying – Boeing met with pilots and trainers.

The pilots and trainers were shocked when they heard about MCAS and the fact that they had not been told it was on the plane. Boeing also explained what was required to disable MCAS in the event of a malfunctioning angle of attack sensor. This sequence would turn out to be very difficult to execute in the real world.

Also that month, the Indonesian investigators released their report on the Lion Air crash, primarily blaming the pilots and maintenance staff. The MAX continued to fly, Boeing’s stock price rose over the following months, and the FAA gave it 10 months to fix the software, even when the FAA’s own analysis concluded that the MAX posed a serious risk.

In March 2019, Boeing’s CEO, Dennis Muilenburg, got the highest paycheck of his career: $31m, including a $13m bonus for performance.

The second crash

And then, on March 10, 2019, a mere five months after the Lion Air crash, Boeing received news of a second incident. Ethiopian Airlines flight 302, a 737 MAX 8, had taken off from Addis Ababa Bole airport.

At the time Ethiopia Airways was considered one of nine of the best-run airlines in Africa. But shortly after take-off, MCAS activated because the angle of attack sensor developed an electrical issue.

It was sending incorrect data to the software. The pilots fought against MCAS, trying to execute the sequence Boeing had prescribed for disabling the software. But this was a complex and ill-explained sequence, and proved very difficult to execute in flight.

The aircraft crashed, tearing itself apart and killing all 157 onboard. By now, there had been two crashes and 346 people had been killed, yet Boeing still publicly argued the MAX 8 was okay. And the FAA agreed. China moved first and grounded the plane. It was followed by the European Union, India, Australia, Singapore, and Canada. The US grounded it on March 13, 2019.

Conclusion

A traditional approach to understanding Boeing’s 737 MAX 8 failures would result in us attempting to draw a line between cause and effect, beginning with the issues with the angle of attack sensors. (After all, but for these faulty sensors, the crashes would have been avoided.)

But taking a complex systems approach, especially through the lens of the sand pile model, provides a more useful way of viewing these types of incidents. Rather than trying to string all the contributing factors together in a line, the sand pile model asks us to consider how each of them interacted with one another, and layered upon one another, to build a hill.

It asks us to examine: the change in culture at Boeing from engineering excellence to cost cutting; the need to get a new aircraft out quickly and cheaply in order to compete with Airbus; the decision on no simulator training for existing 737 pilots; the use of MCAS, and then the extension of that use; the software’s reliance on a single sensor; the fear the FAA wouldn’t certify the aircraft if MCAS was deemed new functionality; the decision not to tell the trainers and pilots about the software, nor provide details of it in the manual.

It asks us to treat the issues with the angle of attack sensors as the initiating event, with the failure being the result of the shape of the hill we have built, not the grain of sand we have dropped.

It asks us to re-examine our more traditional views on cause and effect, and instead to look more closely at the sand piles we build in our own projects and organisations. It requires us to ask ourselves if the systems we have built are tending towards a critical state, just waiting for that single, innocuous grain of sand to bring them tumbling down.

Author: Dr Sean Brady, managing director, Brady Heywood. Email: sbrady@bradyheywood.com.au

References

1) The account of Airbus and Boeing’s history, the development of the 737 and 737 MAX 8, including the crashes, is taken from Flying Blind by Peter Robison (2021) (Robison, P. (2021). Flying blind: the 737 MAX tragedy and the fall of Boeing. New York: Doubleday.) This paper only discusses some aspects of the 737 MAX 8 story.

2) The description of the sand pile model is taken from Ubiquity by Mark Buchanan (Buchanan, M. (2001). Ubiquity: why catastrophes happen. New York: Three Rivers Press.)

3) From Miller, J.H. and Page, S.E. (2009). Complex Adaptive Systems: An Introduction to Computational Models of Social Life An Introduction to Computational Models of Social Life. Princeton University Press.

4) Ubiquity by Mark Buchanan (Buchanan, M. (2001). Ubiquity: why catastrophes happen. New York: Three Rivers Press.)

5) While there are two angle of attack vanes on the 737 MAX 8, the software would use only one.