How complex systems fail: Lessons from Boeing’s 737 MAX 8 crashes – Part I

Monday 26 February 2024

On the morning of October 29, 2018, Lion Air Flight 610 – a Boeing 737 MAX 8 aircraft – is preparing for take-off at Soekarno-Hatta International Airport, Jakarta¹.

In command is Bhavye Suneja, who is 31 years old, with more than 6,000 hours of flight time, most of which were in previous versions of the 737. His co-pilot, Harvino, is 10 years older, with more than 5,000 flight hours.

At 6.20am, they take off. But only minutes into the flight, Suneja’s control column starts shaking. This indicates the plane is nearing a stall, a situation where the angle of the plane’s wings – the plane’s so-called angle of attack – is too steep, which results in a loss of aerodynamic lift.

At the same time, two alerts go off in the cabin: bad altitude and air speed. Harvino asks the captain if he wants to turn around, but Suneja says no. He asks Harvino to get clearance for a holding point to buy them some time. Harvino gets on the radio: “Flight Control Problem”. Then the nose of the plane suddenly dips forward.

Suneja has no idea why it’s happened. He presses the trim switch on his control column, which changes the angle of the small wing on the rear of the aircraft – the horizontal stabiliser. The nose of the plane comes back up. But then it suddenly dips forward again. It’s like the aircraft has a mind of its own.

Beside Suneja, Harvino is working through Boeing’s Quick Reference Handbook, looking for an emergency checklist to work out what’s wrong. But the handbook is no help – it says nothing about the nose repeatedly pitching downwards.

Over the next eight minutes, Suneja continues to fight with the controls. The plane repeatedly pitches forward, filling the pilots’ view with the blue expanse of Jakarta Bay. And each time, Suneja flicks the trim switch, and the nose comes back up. Then it pitches downwards again.

It does this 21 times, and although they are cleared for an altitude of 27,000 feet, they are still less than 6,000 feet in the air and dangerously close to the water. Suneja asks the co-pilot to take the controls. But as Harvino takes over, the plane pitches downward again. Harvino presses the trim switch, but not as hard as Suneja had.

The plane pitches further forward. Then further forward again. Harvino tells the captain they’re pointing downwards. Suneja, distracted, says, “It’s okay”. They plummet at 10,000 feet per minute and Harvino pulls desperately on the control column, but it has no effect.

Alarms blare in the cabin: “Sink Rate, Sink Rate”. Blue water fills their view. Harvino starts to pray. Suneja is silent. The alarms continue: “Terrain, Terrain”. They hit the water at an almost vertical angle, travelling at 500mph. All 189 people on board are killed.

Over the months that followed, two narratives played out. The public narrative was driven by Boeing’s CEO and top engineers, as well as by the airline regulator in the United States, the Federal Aviation Administration (FAA).

As far as they were concerned, there was no fundamental issue with the 737 MAX 8; they claimed the cause of the crash was Lion Air’s fault, they claimed it was due to a poorly managed Indonesian airline. In time, the technical cause of the failure would be identified as an issue with one of the plane’s angle of attack (AoA) sensors.

As its name suggests, it measures the plane’s angle of attack, which is the angle of the wing relative to the airflow. It was found to be reading an erroneously high angle, which incorrectly suggested that the aircraft was nearing a stall. But we know that aircraft are designed and built with multiple layers of redundancy, so how could an incorrect reading from a single sensor crash a plane?

Complex systems

When we think about cause and effect, it would be very easy to conclude that this issue with the angle of attack sensor ‘caused’ the crash. In other words, the crash could have been avoided if the sensor had been working correctly. But when we examine systems as complex as the development of the 737 MAX 8, we need to think about failure differently – we need to take a complex systems approach.

Many of us think in Newtonian terms, meaning that when we examine systems, we tend to believe there is a direct link between cause and effect – everything that happens has a definite, identifiable cause and effect.

Furthermore, we expect symmetry: the seriousness of the effect is related to the seriousness of the cause – significant failures happen because of significant causes, and vice versa. However, this is different for complex systems, which are systems made up of agents or components that interact with one another and produce feedback.

While we often think of systems as ‘the sum of their parts’, complex systems are better thought of as ‘the sum of their parts and interactions’. An intuitive analogy is a sports team. The overall performance of a sports team is so much more than the sum of the abilities of the individual players.

A good team is one where the interactions between those players produce a performance that transcends the abilities of the individuals. Further, attempting to understand the overall performance of the team by studying each player in isolation will not provide much insight into the behaviour of the team as a whole.

To repeat the point above, a team is the sum of its players and their interactions. And it is these interactions that result in complex systems having a disproportionate relationship between cause and effect.

Relatively small causes can produce very large effects. For example, the assassination of Archduke Franz Ferdinand in 1914 in Sarajevo sparked the First World War and led to millions of deaths. How could a single assassination lead to a world war? This paper illustrates a different way to think about system failures, and to help us on this journey we begin in the most unlikely of places: a model of a sand pile.

The sand pile model

This sand pile model, which started as a thought experiment, was developed by physicist Per Bak². He and his colleagues created what physicists often describe as a ‘toy model’ – a model that allows you to think about complex phenomena in a simple way.

Considerable research has been done on this model, and we will only examine a few key concepts here. The model is as simple as it is profound. Bak asks us to imagine the following situation: we have a tabletop and drop grains of sand onto it at random locations – one grain at a time.

As more and more grains fall, they build up into small hills. The formation of these hills is random because the grains of sand fall at random locations. As the hills grow taller, they become steeper.

Eventually, one becomes so steep that an avalanche results when the next grain of sand lands on it. This avalanche could be localised, or it could trigger further avalanches as it strikes neighbouring hills.

Now, consider what is causing these avalanches. On the one hand, we could say that the cause of the avalanche is the single grain of sand that fell and struck the hill. (The avalanche only occurred because this specific grain fell at this precise location.) But blaming the single sand grain alone doesn’t fully explain why the avalanche occurred.

First, most sand grains that fall on the table do not result in avalanches. Second, a single grain of sand can start a small or large avalanche – the initiating event for each is the same, but the magnitude of the effect is independent of the initiating sand grain.

In other words, the single grain of sand doesn't help us explain why the magnitude of some avalanches are greater than others. Instead, we can attribute the cause of the avalanche to the hill itself. If the hill weren’t shaped as it was, the grain of sand would not have initiated the avalanche.

Shifting our thinking about the cause of the avalanche from the grain of sand to the shape of the hill has several profound implications for understanding failure in complex systems. If we accept that the shape of the hill – and not the sand grain – dictates the risk of an avalanche, then understanding how the hill came to be that shape is critical.

We, therefore, need to know its history – how it was produced as each grain fell on it. In complex systems, we cannot take a snapshot in time, but we must consider the culmination of steps that brought us to this point.

Further, it introduces the concept of the ‘critical state’. As more and more sand falls on the table, and the hills get taller, the system is becoming more and more at risk of an avalanche.

Bak described this process of reaching a critical state as ‘self-organised criticality’ because nobody is organising the sand pile and increasing the risk of an avalanche, instead it is doing this naturally because of the interactions between the individual sand particles.

Miller & Page (2007) state that, “The key driving force behind self-organised criticality is that micro-level agent behaviour tends to cause the system to self-organise and converge to critical points at which small events can have significant global impacts”³.

The sand pile model, therefore, is a helpful way to understand why simple causes can produce significant failures in complex systems. For example, we can use it to re-examine the cause of the First World War.

Serbian nationalist Gavrilo Princip assassinated Archduke Franz Ferdinand in Sarajevo in 1914, setting a series of events in play that led to a world war. But when we look at this through the lens of the sand pile model, as Mark Buchanan does in his book, Ubiquity, he suggests that we should think of the assassination as a grain of sand and Europe as a hill in the sand pile⁴.

This hill was in a critical state due to interlocking treaties between multiple countries – a state ripe for a single grain of sand, Gavrilo Princip, to fall and start an avalanche. Once this grain landed, the interactions between the European parties cascaded and resulted in the war, just like a cascade of grains in the sand pile.

If it hadn't been for Princip and the assassination, there probably would have been another initiating event. It was the critical state that mattered, not the specific grain of sand. With this in mind, let’s go back and examine the story of the Boeing 737 MAX.

Boeing, Airbus and the A320neo

The story of the MAX begins not with Boeing but with its rival, Airbus, a European consortium that received its first order in the US in 1978 and which, in 1984, launched the Airbus A320 in direct competition to Boeing’s existing 737.

In 2010 it repeated the move with the A320neo, a plane designed to take more market share from the 737. This aircraft was larger than the previous A320 and was more fuel efficient, with the ‘neo’ standing for ‘new engine option’.

By the Paris Air Show in June 2011, Airbus had secured more than a thousand orders. Boeing had to respond – more than a third of its profits came from the 737 – but it didn’t have an aircraft that could compete with the fuel efficiency of these new planes.

Then in July of that year, it got word of a potential deal between American Airlines and Airbus. It looked like the airline was about to order the neo. Boeing stepped in to try and secure its own deal, convincing American Airlines to split the order: the airline would buy 260 A320neos, with the remaining 200 planes being a more fuel-efficient aircraft from Boeing. This plane, which was entirely hypothetical at this point, would come to be named the 737 MAX 8.

(In Part II, Dr Brady will examine the development of the 737 Max 8; the MCAS software issue; two crashes; and a conclusion.)

Author: Dr Sean Brady, managing director, Brady Heywood. Email: sbrady@bradyheywood.com.au

References

1) The account of Airbus and Boeing’s history, the development of the 737 and 737 MAX 8, including the crashes, is taken from Flying Blind by Peter Robison (2021) (Robison, P. (2021). Flying blind: the 737 MAX tragedy and the fall of Boeing. New York: Doubleday.) This paper only discusses some aspects of the 737 MAX 8 story.

2) The description of the sand pile model is taken from Ubiquity by Mark Buchanan (Buchanan, M. (2001). Ubiquity: why catastrophes happen. New York: Three Rivers Press.)

3) From Miller, J.H. and Page, S.E. (2009). Complex Adaptive Systems: An Introduction to Computational Models of Social Life An Introduction to Computational Models of Social Life. Princeton University Press.

4) Ubiquity by Mark Buchanan (Buchanan, M. (2001). Ubiquity: why catastrophes happen. New York: Three Rivers Press.)

5) While there are two angle of attack vanes on the 737 MAX 8, the software would use only one.