Immediate Takeaways From the Power Grid Disaster in Texas
As temperatures plunged in the Texas February 2021 winter storm, many organizations lost power, water, and accessibility to their facilities and network before they could even ask “who turned off the lights?” Many employees, already working from home due to COVID-19, also had to deal with water crises, including the loss of water, contaminated water and bursting water pipes resulting from the loss of power.
In this environment, personnel responding to the crisis struggled to remember their responsibilities carefully outlined in their organization’s disaster recovery and business continuity plans, which were often inaccessible. These organizations learned the hard way that if you’re facing a disaster and trying to access the disaster recovery and business continuity plans from your system for the first time, you’re in trouble.
This is one of several takeaways as organizations regroup from the immediate aftermath of the Texas weather event known as Uri – a true “Black Swan” event. With the unique combination of disaster scenarios and communication barriers, those responsible for managing their organization’s disaster recovery efforts are reflecting on the effectiveness of those plans and testing strategies. In the months ahead, organizations should transparently identify their vulnerabilities as risks are reevaluated and disaster and business continuity plans are updated. Here are some items to consider as you go through your evaluation process:
Internal and external outreach. When systems are down information and communication is crucial, as even the best laid plans for reaching both internal and external stakeholders can turn chaotic. Organizations that had protocols in place to reach people via multiple channels, including text messaging, email, website updates and personal contacts, had the best chance of communicating the right information to employees, customers, vendors and other stakeholders in a timely manner.
Training and preparation. When power was lost and operations were interrupted, separate departments within organizations didn’t really understand their respective business continuity plans and roles during the event, though this may have been well documented in their plans. Preparation for executing emergency response and business continuity plans must begin well in advance. Practice testing, including updating documentation and training, should be at the forefront, including revisions to plans based on what was learned from pre-event testing. This activity is often referred to as a tabletop exercise. Key personnel responsible for executing the plans should have a printed copies of the plans at their residence or another easily accessible, but secure, location.
Train early and often to be prepared for potential disaster events. Tabletop exercises should involve a wide variety of scenarios drawing on different aspects of the organization’s disaster recovery and incident response plans. Once the exercise is completed, a post mortem should be held to see what went well and what didn’t, making adjustments to plans as necessary. Long before an event occurs, ensure there is collaborative understanding of appropriate procedures at every level of the organization. Make sure everyone knows who the point people are and build redundancies into the plan so that there isn’t dependency on just one individual.
Cloud and colocation facilities. Most organizations that use cloud providers, such as Microsoft, Amazon, etc., did not experience significant disruptions during Uri unless they had problems with internet connectivity. Similarly, organizations that moved their servers to tier 3 or tier 4 data centers likely experienced fewer IT disruptions as long as they were able to maintain communications between facilities.
Organizations managing their own data centers, on the other hand, faced significant challenges. Once there was no power, they had to switch to a backup generator using a fuel or battery source. If a fuel generator was used, they had to hope the fuel source (natural gas or diesel) was available. Battery backups typically lasted a few hours and then required the systems to be powered down. With the prolonged nature of the outages in many locations and the challenges these organizations faced, we can expect organizations still using traditional onsite data centers to begin evaluating systems using the cloud and/or commercial data centers.
Geographic redundancy. Under normal circumstances, having a production data center in Dallas with a redundant backup site in San Antonio may be appropriate. But in this statewide emergency, both data centers may have been unavailable because they were tied to the same failing power grid, even though they were separated geographically by a few hundred miles. Going forward, some organizations will be assessing whether to have production and backup data centers on separate power grids (Eastern Grid, Western Grid and Texas Grid).
Communications redundancy. Organizations that remained dependent on one communications carrier in a given location will now be re-evaluating that strategy. To continue operations through a catastrophic event, organizations should assess whether multiple internet and telecommunications lines to every facility is more appropriate. Power outages in various locations also affected communications service providers, so at any given time an office or data center may have lost its internet or wide area network connection. Being able to convert to other carriers when necessary is crucial to maintaining connectivity. Just as important, IT and other staff should understand the key steps to be followed and the expected timing for rollover of redundant systems.
Impact of your vendor’s providers. A fourth party provider is an organization with which you do not have a direct contract; however it is providing critical services to a vendor (third party) you are engaged with. Similarly to your reliance on a third party, those vendors may also rely on products or services of other organizations. These fourth party providers should be managed and assessed through their vendor management program and may be identifiable through System and Organization (SOC) reports from your key service providers. When a disaster or unexpected event impacts the fourth party providers, your organization may experience significant negative impact beyond your immediate control. Therefore, organizations that also rely on fourth party service providers should include these providers in their risk assessments.
Stay tuned for more information about steps organizations should consider to improve their disaster response planning. Weaver is available to help you with business interruption and continuity issues, risk assessment and IT advisory services. Contact us for more information. We are here to support you.
© 2021