Posted on 19 December 2018
As you’ll be aware, on the morning of Thursday 13 December, a major outage resulted in the campus network and all services being unavailable from 08.30am until approximately 10.30am.
We are very aware that even outside term time, this level of downtime has a massive impact on all members of the University. Research, teaching and administrative work all continue outside term and this failure continued well into the working day. This falls far short of the standards we set ourselves and so we’d like to explain what happened and how we’ll work to reduce the risk of it happening again.
Within our data centres, equipment is set up so that one part of it can fail or be updated with no impact on service. On Tuesday 11 December, as part of a scheduled programme of work, we updated a “cluster” of equipment consisting of two components within one data centre, and this work went exactly to plan: half of the cluster was updated followed by the second with no impact on live services or network availability.
On Thursday 13 December, we performed the same work in the other data centre and… it didn’t go so well. Even though the equipment was running the same software as the other cluster, the first attempted update crashed the whole cluster and stopped all networking within the data centre. We cannot be sure exactly why this happened, as we concentrated on restoring service, but our best guess is that the system, which had been running continuously for a number of years, had a fault which only revealed itself in the attempted restart.
This sudden loss of connectivity broke most of our server estate, including services that are crucial to the operation of the wired and wireless network around campus. A technical explanation as to why this happened can be made available to Departmental Computing Officers on request.
Technical teams in IT Services worked to restore the network first, bringing wired and wireless connectivity back by 10.30am. Most services were restored by 11am, but the restoration of all systems took another couple of hours.
The incident has revealed a number of issues with the redundancy of our systems. We’ve discovered (the hard way) dependencies between things that we had thought were independent, and those links made the incident far more severe. We are developing a programme of work and testing to correct these issues, with implementation during the first part of 2019.
Finally, I must apologise again for the disruption this caused. If anyone would like to discuss any issues raised in person, I’m more than happy to come and talk over what happened in any level of detail required.
Please contact firstname.lastname@example.org, Assistant Director (Infrastructure), IT Services, should you have any questions or queries.