The world increasingly depends on the services provided by major corporations, which in turn depend heavily on their IT infrastructure. British Airways’ recent IT meltdown is just the latest in a string of serious technical failures experienced by big companies and government agencies, with such failures effecting hundreds of thousands or millions of people and accruing potentially huge costs.
For BA, this has resulted in dissatisfied customers, a tarnished reputation and calls for the jobs of those in charge. The anticipated compensation bill may top £100m. If organisations are so dependent on systems’ uninterrupted functioning, how could this happen?
It’s hard to overstate the importance to BA’s operations of this networked systems infrastructure. The company’s systems provide services to customers and staff that range from flight maps and operational data for pilots, to online check-in, booking and baggage handling for customers.
BA’s issues started at Boadicea House, one of its two data centres located at Heathrow Airport. As with most data centres, Boadicea House boasts a so-called uninterruptible power supply, or UPS, which typically provides power through multiple redundant systems such as conventional mains power, backup generators and backup batteries. When power was completely lost at Boadicea House, a management strategy for handling power loss was in place that involved the gradual, phased return of power to the data centre’s servers. But crucially this strategy appears not to have been executed as planned, resulting in the “uncontrolled” return of power. Reports indicate that this led to a power surge that further exacerbated the problem by physically damaging servers and incapacitating backup systems.
(All too) human error
While the sequence of events that led to this outcome is still unclear, the cause is reportedly related to human error on the part of an engineer or contractor at the data centre. Data centres are designed for massive redundancy and are highly secure facilities. Indeed, as with many mission-critical systems, they are designed partly based on the assumption that they will fail, and so huge emphasis is placed on minimising the impact of these failures by developing efficient and effective recovery strategies. While such plans and precautions are never perfect, historically a good first place to look in the event of failures is the role of human error. Statistics indicate somewhere between 30-60% of errors are down to humans. This increases to 80-90% in some fields that demand high integrity systems.
Another recent example of human error leading to huge repercussions was the massive Amazon Web Service (AWS) outage. Amazon has stated that this was due to an erroneous command entered by a qualified, authorised engineer. This error resulted in the temporary loss of huge swathes of internet-connected services hosted by Amazon, from home security systems to business communication apps, email services and company websites.
One problem begets another
Built like many modern IT services to provide the best possible standard of service using carefully designed dependencies, a side effect of the sheer scale of the systems is that they may be vulnerable to small failures propagating and cascading, with major problems occurring as a result of potentially minor problems.
This appears also to be the case for BA. While designed to survive on reduced performance or even a temporary outage of the data centre at Boadicea House by using the second data centre to take up slack, the procedure for bringing power back up to Boadicea House appears to have not gone as planned, resulting in further damage.
Much research into mission-critical systems design is directed at what is referred to as “human factors engineering”. This includes designing systems and processes in such a way that they focus on accommodating its users, for example by guiding their interaction with the system to avoid potential errors.
In many cases, this includes aspects that have in the past been somewhat overlooked – for example, the careful design of user interfaces (UI) and the user’s experience of using the software (UX). For example, engineers may focus on minimising the likelihood of the so-called “fat-finger errors” by designing UIs so that controls are hard to confuse or accidentally trigger.
Ultimately, the increasing complexity of IT infrastructure and its importance to us means the stakes are far higher – and this dramatically increases the technological and engineering challenges, too.