What can we learn from the Global IT Outage?

As the global IT outage continues to unravel and we await CrowdStrike’s findings on the reasons that an update caused such a large scale IT issue I’ve been thinking about the current issue and how organisations might be able to better protect themselves from a scenario such as this.

The truth is that a widely distributed Blue Screen of Death taking multiple endpoint computers offline could well have been missing as a scenario from many Business Continuity and DR plans. Most organisations focus on keep the core infrastructure online and quick to recover but the endpoint has moved to a position of commodity that can simply be remote managed or re-deployed.

When an issue such as this strikes scale is the issue – reduced budgets in IT, smaller IT teams as a result of automation and remote management or the outsourcing of IT creates a difficult scenario when onsite, hands-on fixes are required.

So… while we wait for further updates from Crowdstrike and enterprise organisations continue to grapple with bring themselves back to 100% capability what can we consider as part of a more robust update strategy as this scenario could happen to any key vendor.

Whilst it’s easy to say that IT departments should be testing before deploying and that this is an ideal scenario the pressures on IT departments in terms of time vs effort vs balancing keeping organisations secure is difficult. There is a degree of trust when it comes to vendor updates but something that should be conisdered is how the organisation can utilise different update policies to reduce the risk of a system wide issue. To do this it’s worth considering having:
- Different update servers that deliver on a live release and delayed release model so specified endpoints get the initial release of the update and the delayed update is deployed to the majority of endpoints at a point in time afterwards.
- Specify lower risk, test machines that can receive the update initially to review the impact.
Review an update process that only applies at reboot, this avoids the risk of devices crashing and then being unable to remotely accessed.
Avoid updates at critical points in the organisations schedule to reduce the risk of impact.
Look into utilising tools that monitor end user experience. These systems detected a rise in blue screen of death crashes as they occurred around the globe and allowed IT teams to identify an issue proactively, see the fault and implement a fix ahead of time just by recognising a large number of end user devices were having a common theme when crashing.
Bring mass endpoint outages into the scope of the business continuity planning to ensure device recovery is fast and can be achieved remotely.
Review existing DR plans for critical network and server infrastructure ensuring recovery time and recovery point objectives are clearly defined and that testing is carried out.

This could happen to any software vendor and whilst CrowdStrike will hopefully provide a statement as to the reason they managed to implement the biggest global IT outage on record, we should all ensure that we take responsibility for our own Business Continuity plans and update processes to balance risk vs operational efficiency.

Lessons will be learnt across the industry with this issue and organisations will be better prepared to avoid the scale of such an incident in the future if the warning is headed.

Further updates to follow as the situation progresses.

Making IT simple