“Hey, it’s down” – 7 steps for dealing with SaaS outages
by Jonathan Day
It’s been an interesting few months in the life of a modern IT professional. On May 17th, Salesforce had a multi-day service impact, the largest in their history. On June 2nd, Google Cloud Platform had a large network disruption mainly on the East Coast, taking down/slowing YouTube, G Suite, and services like Google Compute Engine for hours. On June 18th, Google Calendar was down for a few hours. And finally, Cloudflare had a global outage on July 2nd that took many popular public sites down.
There are some wonderful established Incident Management processes more tailored to larger and more complex organizations. The Google SRE book is one of my favorites- it has some amazing concepts but is still mostly meant for larger teams. I’m going to try to simplify some of the best ideas and pull from own experience being an IT Director at 2 Startups for the last 5+ years to walk you through how to handle a major outage that affects your team.
Many of us no longer have full responsibility over our IT estate. We don’t have a cluster we can reboot, a DR site to cutover to, or a config file we can push to fix our applications. In this SaaS-powered world, Amazon, Microsoft, Google, and others can run their cloud applications with more security and reliability than all but the largest, most disciplined, and well-funded internal IT teams. They have the engineering talent, decades of experience, and the largest data centers in the world.
As an IT manager, what are we supposed to do when the unthinkable happens? Don’t feel helpless, don’t just throw up your hands – instead, take it on head-first and do the following:
As you’re receiving reports of an outage, confirm them. Do a quick check with your change management to ensure a global firewall change or software update couldn’t be causing the issue internally. Open a private Slack channel with your team and stakeholders for triage. Look at the Vendor’s status page. If the status page has an RSS feed, /feed subscribe [feed address] in Slack to get updates pushed into that channel. Test the services yourself, try different networks and devices, and ask others to do as well. Look on Twitter, ask friends at other companies, and try to narrow down the impact of the outage. You want to do this quickly, but you also don’t want to declare an incident just because one user’s Gmail isn’t working on their phone!
Dust off your established incident response plan. You have one, right? You want to communicate to your users quickly, getting ahead of the incident and providing as much context as your can. Use the appropriate method. Email down? Use Slack. Slack down? Use email. A good first update might look like this (with the text out to users in bold):
At 11:13am PDT the IT Team started receiving multiple reports of Salesforce CRM not loading or displaying a 500 error on desktop and mobile.Connected apps like /salesforce in Slack are also not working. (When it started, acknowledging the affected service and a bit more about the outage)
By 11:21am PDT further investigation has confirmed that the issue is widespread across the NYC, Miami, Austin, and San Jose offices as well as affecting many remote employees. (Your own internal confirmation, and then identify who are affected, including any that are not)
We will be immediately contacting the vendor and escalating with their engineering team to report and track these errors.(Your first next step)
Users are encouraged to use Email, Slack, Zoom and phone calls in the interim to continue communication. Please log CRM activity in a Google Doc until Salesforce is back up.(A user-friendly workaround to continue working)
The next update will be provided here in #general at 12:30pm PDT or earlier if we have more information. At your discretion, you may submit a ticket to show you are impacted, however this is not required. (Time to next update and how that update will be provided. Instructions on submitting a ticket as a way to “vent” if needed)
Immediately engage with the best support resource or channel you have at the affected vendor/service. I find that submitting a ticket via email or web support helps me provide the most accurate details including screenshots, error logs, and other important information. Then, I call either the general support number or an account contact like my CSM. During larger outages the support agents or your dedicated contact may be getting slammed, but it still never hurts to get someone live on the phone for escalation. Ask for information about the outage in writing even if the status page has been updated – this is invaluable to provide to management and to request SLA credits once the incident is over.
Continue testing the application to see if the issue has been resolved. There are numerous Chrome Extensions that can auto-refresh tabs to check on a web app. I used homebrew to install the watch package on my Mac, so I can run watch dig <domain> to keep running DNS lookups to monitor a DNS problem, or watch curl <url> to keep hitting a URL endpoint to monitor an API problem. Keep tabs on your user feedback; in my experience, end users usually see the resolution more quickly than I do!
Continue communicating. Use your smartphone, calendar app, or Slack to set a reminder on when to communicate again with your users. Use the above communication format and provide an update even if there isn’t one. Saying nothing is much worse than saying: As of 12:31pm PDT the vendor continues to troubleshoot the issue. Salesforce is still unreachable but Email, Zoom, and Slack are available for communication. Another update will be provided here in #general at 1:30pm PDT or earlier if we have more information.
Eventually you’ll be able to send out the best message of all – that the incident is over! Before communicating out to your organization, ensure that the service is actually back up. Check with your team, some trusted internal users, and anyone that you can ask. Prematurely closing an incident is a great way to lose trust. Once you have 100% certainty your communication could look like this: By 1:15pm PDT the IT team has been able to confirm reports that Salesforce is back up and running normally. You may now login again at https://salesforce.com or using the tile in Okta. If you are still experiencing any issues please submit a ticket, and thank you for your patience!
But wait, it isn’t quite over. Once the flurry of tickets has calmed down and your team can take a breath, schedule time. I like to conduct what Google calls a “blameless post mortem”- where everyone involved can be honest with themselves and others without fear of judgement. Ask:
Could we have noticed the incident earlier? Did our monitoring tools work as expected?
Were the right people notified in time? Did our alerting and on-call tools work as desired?
Did we properly confirm the scope and breadth of the problem? Could we have done better? Did we triage effectively as a team?
Were our communications timely, empathetic, and detailed enough?
Did our workaround actually help? Did users understand what the plan was, or do we need to write or update documentation?
How long were we impacted? Can we request SLA credits? Do we need to rethink our relationship with this vendor?
Could our Incident Response/Business Continuity plans be updated with any of the lessons we learned above?
Outages are frustrating and stressful, even more so when caused by a third party. A true IT leader will take ownership, communicate the severity, find a workaround, track for closure, communicate the resolution, and then reflect on what could be done better next time. End users are generally quite understanding with larger outages, and handling any incident like this is a fantastic way to gain trust and respect within your organization.
P.S. David Mytton, Formerly CEO @ Server Density has a great article about causes of outages on his blog.
Questions? Feedback? Send to erin (at) askspoke (dot) com.