Learnings from eight major outages of 2024 and best practices to stay prepared
While we cannot eliminate internet outages, lag, or security breaches, reflecting on the lessons learned from these events helps us cope, innovate, and implement measures to reduce how often they occur. In 2024, website and application outages had a significantly greater impact on the world than in previous years, leaving the IT community with valuable insights to consider. This blog highlights several major incidents, the lessons we have gathered from them, and ways to enhance the resilience of your IT systems to ensure preparedness for 2025.
Outages happen; what can we learn from them?
In 2024, there were 225 major internet disruptions, according to content delivery network Cloudflare. Among them, we have chosen to discuss, in brief, nine major insights that drew heightened attention in the tech world, sparking discussions on how they happened and what steps companies can take to mitigate them.
But first, the inconvenient truth upfront: outages happen, and their magnitude keeps increasing, more so with the internet's growth in coverage and complexity. IT teams must, in turn, get proactive and monitor their IT infrastructure in the face of apparent and sometimes imminent failure—to either avert them or reduce the fallout.
In other words, while problems are inevitable, how you choose to prepare and what you do in the heat of trouble will determine how resilient your applications and websites are.
1: Microsoft Teams outage
Timeline: The outage began on January 26th and lasted for over seven hours. Issues continued sporadically for the following day, too.
Impact: A service disruption on multiple continents prevented users from logging in to Microsoft Teams, and apps froze.
Cause: While many thought it could be the app, the company revealed it as an external BGP issue that affected global routes.
Learning: Cloud-based communication apps have to stay guarded against infrastructural vulnerabilities and must be comprehensively monitored to detect issues while operations teams fix them.
2: Meta outage
Timeline: The outage occurred March 5 and lasted for about two hours.
Impact: Login issues affected Meta apps for Facebook, Instagram, Threads, and Quest VR.
Cause: A backend configuration change in the system prevented user access to accounts and platform features.
Learning: Backend authentication systems and platforms should be monitored and checked constantly. App owners can consider moving to decentralized architectures, too.
3: Atlassian Confluence outage
Timeline: On March 26th, a brief disruption occurred that was resolved quickly.
Impact: A service interruption on Atlassian Confluence resulted in interlinking with Jira failing.
Cause: According to Atlassian, "The issue was caused by a code change depending on a missing configuration."
Learning: Infrastructure resilience backed by holistic IT infrastructure monitoring is crucial. Secondary communication channels must also be maintained, and users must be updated about restoration activities.
4: Google outage
Timeline: On May 1st, Gmail experienced an outage for about an hour.
Impact: Gmail users could not send unsigned messages to Yahoo users.
Cause: Poor inter-email provider connectivity between Google and Yahoo caused connectivity issues.
Learning: When ecosystems depend on complex inter-service connectivity, even minor issues can disrupt global access. To interconnect services, use standardized, secure, resilient, and reliable protocols and communication methods. Keep redundancies in place to route traffic to minimize impact.
5: CrowdStrike outage
Timeline: On July 19th, there was more than an hour of critical downtime, followed by a long path to recovery.
Impact: The feared blue screen of death (BSOD) error affected more than 8 million Windows systems, causing damages exceeding $10 billion. The error caused widespread agony, such as canceled flights and disruptions in medical services.
Cause: This outage was caused by a faulty configuration update of Windows systems with CrowdStrike's Falcon (managed detection and response service) sensors that needed physical access to machines, with as many as 15 reboots required to restore normalcy.
Learning: Adopt a robust endpoint management strategy, reassess your technology resilience, and update risk management strategies. Perform regular backups and have alternate systems in place to avoid service blackouts. Read our detailed blog about the CrowdStrike outage here.
6: Cloudflare outage
Timeline: On September 16th, users mostly in the US, Canada, and India were affected by an outage that lasted two hours.
Impact: Disruptions in apps such as Zoom and HubSpot caused service timeouts.
Cause: A connectivity failure in the backend affected apps that relied on Cloudflare's CDN and network services.
Learning: Move to robust data-based outage detection systems, so you can determine the root cause rapidly and perform post-incident analysis to prevent repeat failures.
7: Microsoft Outlook Online outage
Timeline: On November 25, Microsoft Outlook Online was down for nearly 24 hours.
Impact: Microsoft services like Outlook, Exchange Online, Teams, and SharePoint were affected. HTTP 503 errors were thrown.
Cause: A backend system change resulted in a spurt of retry requests, causing global service disruptions like email failures, calendar dysfunction, etc.
Learning: Any complex cloud infrastructure needs a thorough change management strategy, with provisions to mitigate, back up, monitor, and proactively manage risks.
8: OpenAI outage
Timeline: On December 11th, an outage occurred for about four hours.
Impact: OpenAI's APIs, its video creation application Sora, ChatGPT, and other platform services were affected.
Cause: A configuration change in OpenAI's Kubernetes infrastructure triggered a cascade of system failures, starting with the control plane.
Learning: Understand the complexities of cloud-native environments and architectures and develop fresh playbooks to handle issues. Deploy code or configuration changes in a staggered manner after ensuring failover methods, test your code rigorously, and employ a complete Kubernetes monitoring strategy.
Best practices
- Outages happen, and they're happening often. Assume the worst is inevitable and build concrete, validated steps to minimize the impact through actions like failover automation, which ensures redundancies and configuration rollbacks.
- To restore access during troubleshooting, think beyond your initial assumptions and surface guesses, and drill down to the root cause through intense investigations backed by problem-solving methodologies.
- Resilience should be built like a muscle. Involve all stakeholders in your SRE and DevOps teams to develop a regimen for constantly increasing your internet resilience.
- Draft and implement failure response playbooks, and continuously train your key resources by conducting drills and mock activities to think holistically and stay prepared when incidents strike.
- Employ a well-oiled status communication strategy and ensure a detailed, independently hosted status page is available in multiple languages for all your users to reassure them and keep them updated when your assets are down.
- Conduct periodic reviews, inventory your IT tools, eliminate outdated or excessive subscriptions, and minimize your workflows to make your organization more resilient by reducing the chances of things going wrong.
- Ensure security monitoring practices to strengthen your websites and safeguard your end-user data, especially during incidents.
- Document your responses in your recovery journey, and educate your personnel to trigger discussions and cross-pollinate ideas by encouraging open and continued conversation about outages.
Prioritize observability-backed IT resilience in 2025 and beyond
Murphy’s Law of "anything that can go wrong will go wrong" should not deter us from the path of continuous improvement, where we do everything we can to minimize things going wrong. We could perhaps choose another adage with an insightful warning: "an unwatched pot always boils over." When we fail to pay close attention to a situation or problem, it can easily escalate and become much bigger than it would have if you were monitoring it actively.
Studying the global tech outages of 2024 helps us realize the vulnerability of modern digital infrastructure and the pressing need for proactive, comprehensive IT observability. Only AI-powered IT observability gives you the ability to understand, identify, and give early warnings to thwart issues before they escalate into full-blown outages. By embracing these best practices, SREs and DevOps teams can build a resilient digital presence that ensures business continuity, safeguards reputation, and enables sustained success. Ensure your defense plan includes a proactive observability platform to guard against these disruptions and strengthen your organization's resilience.
Comments (0)