12 Cloud Outages With Catastrophic Effects
There’s no infrastructure that’s always on and immune to all kinds of threats. Even the top providers leave a tiny margin in their Service Level Agreements (SLA), stating 99.999% uptime at most. The cloud, advertised as the universal cure for the problems of legacy on-premises setups, also turned out vulnerable.
The most obvious and impactful manifestations of cloud vulnerability are cloud outages. Let’s look at devastating cloud outages to learn about causes, aftermaths, and valuable insights for your organization.
#1 PlayStation Network: Over 3-Week Rebuild of Security Infrastructure (April 20, 2011)
What happened? Hackers exploited a vulnerability in Sony’s servers through a sophisticated cyberattack. It let them gain access to the personal information of millions of PlayStation Network (PSN) users. As a result, Sony had to rebuild its security infrastructure from scratch, shutting down the PSN cloud for over 3 weeks.
Impact: PSN was unavailable to 77 million users for 24 days until May 14. They couldn’t access their purchased digital games throughout this entire period. For Sony, it was not only a $171 million downtime cost. It also meant serious legal trouble, including U.S Congress allegations and investigations from data protection agencies in several countries.
Cause: Coordinated cyberattack exploiting a vulnerability.
Key takeaway: No infrastructure is perfect, and there’s always a vulnerability ready to exploit. Your service provider had better actually be using data backups and… doesn’t have to rebuild everything from scratch.
#2 BlackBerry: High-Profile Email and Messaging Outage (October 10, 2011)
What happened? First, the core network switch failed. Then, the backup switch failed to activate. This resulted in a massive backlog of data travelling through the BlackBerry private cloud. The backlog overwhelmed the rest of the global BlackBerry infrastructure, resulting in a “self-inflicted” Distributed Denial of Service (DDoS) attack.
Impact: The features that made BlackBerry stand out at that time, the business email service and the BBM app, stopped working for 3 days. It particularly affected high-profile users, including world leaders and CEOs.
Cause: Hardware (core network switch) failure.
Key takeaway: Under the (cloud) hood, there is always hardware. That you don’t have to maintain it doesn’t mean it’s maintenance-free. While contemporary clouds offer greater redundancy, a hardware failure is always a possibility.
#3 Amazon Web Services: A Typo That Switched Off Half of the Internet (February 28, 2017)
What happened? An authorized AWS engineer mistyped a command intended to remove a small number of AWS S3 servers. Instead, it removed lots of servers supporting two critical S3 subsystems. With a massive S3 cloud infrastructure in place, the reboot to fix the downtime took hours, not minutes.
Impact: Major services relying on AWS, including Quora, Trello, Giphy, and Slack, experienced total outages or broken functionality for about 5 hours. Also, lots of smart devices (e.g. smart pet feeders) felt the impact. According to estimates, S&P 500 companies lost a total of $150 million due to the downtime. Financial service companies are said to lose an additional $160 million.
Cause: Human error (mistyped command).
Key takeaway: You don’t need a hacker or a natural disaster to bring the cloud down. It seems a single typo is enough to switch off half of the internet. So, it’s always a good idea to have a backup cloud to migrate your data or app to in the event of a disaster.
#4 Google Cloud: Automation Went Wrong (June 2, 2019)
What happened? Instead of shutting down a few local jobs, Google’s automation software concluded that it needed to disable network control jobs in many cloud regions. These are important ones responsible for data routing. With more and more jobs shutting down, Google’s network capacity was greatly reduced and crushed by ongoing traffic.
Impact: The outage impacted millions of users of Google apps (e.g. Gmail, Calendar). Third-party apps running on Google’s infrastructure (GCP) like Snapchat, Vimeo, or Shopify were affected, too. The SaaS apps either went dark or operated extremely slow for about 5 hours.
Cause: Automation software error.
Key takeaway: Automated solutions are nice because they operate in bulk, save lots of time, and… are reliable. Usually.
#5 OVHcloud: The Strasbourg Datacenter Fire (March 10, 2021)
What happened? Just before 1 a.m., a fire started in an Uninterruptible Power Supply (UPS) unit. It spread lightning fast due to an innovative “free-air” cooling system designed for energy efficiency. Apparently, it acted like a chimney, funneling oxygen to the fire. The datacenter buildings didn’t have an automated fire suppression system, which also didn’t help.
Impact: The fire completely destroyed about 30,000 servers and one of the datacenter buildings. 3.6 million websites went offline, including those of French government portals and banks. What’s more, some websites, services, and games (the famous Rust survival game case) with no offsite backup completely lost their data.
Cause: A failure in an UPS unit.
Key takeaway: If you’re not familiar with the 3-2-1 backup rule and backup software that supports copy replication yet, be sure to read our article. Learn how to reliably protect data and avoid the devastating irreversible data loss experience.
#6 Fastly: A Catastrophic Trigger of a Dormant Bug (June 8, 2021)
What happened? On May 12, Fastly deployed a software update with an undiscovered bug. On the morning of June 8, a Fastly’s customer pushed a configuration update that met the conditions to “wake up” the dormant bug. It followed with crashing edge servers responsible for delivering website/web app contents to end users. With Fastly’s high-performance network in place, the faulty configuration spread globally in an instant.
Impact: As a global Content Delivery Network (CDN) provider, Fastly makes sure thousands of websites are served to end users. Due to the outage, millions of Amazon, Spotify, Stack Overflow, CNN, BBC, and other service users worldwide were greeted with the same message: Error 503 Service Unavailable. Though the outage lasted for just 1 hour, it bore a huge economic cost for the affected e-commerce giants like Amazon.
Cause: Undiscovered buggy update triggered by a set of specific conditions.
Key takeaway: While the market seems crowded with competitors, the outage revealed that vendor diversity sometimes can be an illusion.
#7 Meta: Communication Impact for Half of the World (October 4, 2021)
What happened? During a routine infrastructure audit, a single command led to disconnecting all Meta’s datacenters from the Internet. It followed with breaking internal tools, which disrupted the daily work of Meta staff. What’s interesting, the outage made access to datacenters problematic for Meta’s engineers. They eventually had to visit the datacenter buildings in person to reset the configuration.
Impact: 3.5 billion users around the world couldn’t use their primary ways of communication—Facebook, Instagram, WhatsApp, and Messenger. What’s more, there were areas (e.g. Brazil, India) where corporate users heavily relied on Meta’s messaging solutions. The outage also caused loss for Meta—stock dropped nearly 5% in a single day, and $60 million was lost from missed advertising opportunities.
Cause: Routine maintenance command together with a bug in the audit tool.
Key takeaway: Monoliths can be dangerous when speaking of complex cloud infrastructure. Emergency tools should not run on the same network as publicly available apps.
#8 Atlassian: 2-Week Manual Jira Restore Effort (April 4, 2022)
What happened? Engineers wanted to deactivate a legacy app in Jira by using a script with that app instance IDs. Due to miscommunication between two teams, the team running the script received IDs of organizations that used the app instead of app instance IDs. Using the “permanent delete” mode, the script wiped the entire data for the affected customers.
Impact: The outage impacted 775 organizations, completely disrupting their operations for 2 weeks. This is how long it took for the Atlassian team to manually restore deleted data from full native copies through careful extraction.
Cause: Script with wrong IDs deleting organizations’ data instead of legacy app instances.
Key takeaway: Relying solely on cloud provider’s native backups can have severe consequences. That’s why, choosing a tool that supports backup replication and truly granular restore is crucial.
#9 Google, Oracle, and Microsoft: The Record-Breaking Heatwave (July 19, 2022)
What happened? On July 19, the temperature in London hit 40°C/104°F for the first time ever. The datacenter cooling units designed for the UK’s historically mild climate reached their “design limit” and failed. To prevent server overheating and catching fire, providers had to perform emergency power-downs.
Impact: Effects were felt globally, in particular by Google services (e.g. Gmail) users, government and financial institutions relying on Oracle’s networking and storage, and organizations on Azure. The case proved there’s no safe region to maintain datacenters.
Cause: A record-breaking heatwave in the United Kingdom that “killed” datacenter cooling systems.
Key takeaway: The cloud is not an ethereal thing in the sky. It’s maintained on real servers in a real datacenter, and it’s susceptible to the real world weather and natural disasters.
#10 Google Cloud: “UniSuper” Annihilation (May 2, 2024)
What happened? While configuring a Google Cloud virtual setup for UniSuper, Google engineers accidentally left a single parameter blank. The blank parameter was then automatically defaulted to a fixed 1-year term. After one year has passed, the system—without any warning—initiated an automated purge of absolutely all UniSuper data. This included backups stored in a completely different GCP region.
Impact: Due to the “outage” of UniSuper cloud subscription, 647,000 members of the Australian pension fund lost access to their accounts. And the company lost its entire operational backbone. Fortunately, UniSuper discovered they had an extra backup with an independent service provider, outside of Google Cloud. However, it took a total of 2 weeks to rebuild everything.
Cause: Automated accidental deletion across regions
Key takeaway: Keeping production and backup data with the same provider, even the one with robust geo-redundancy, might not be enough. To protect organization against all odds, set up backup replication with multiple storage providers (e.g. Google, on-premises filesystem, etc.).
#11 Microsoft Azure: Mistiming with Severe Consequences (January 8, 2025)
What happened? During a feature-enablement session, engineers were to restart networking services one after another. Instead, they restarted the services in parallel, which broke a quorum (agreement) between servers. The result was broken routing and communication. Virtual machines running services (e.g. Azure App Service, SQL Managed Instances, Azure OpenAI) for Azure corporate customers failed and were “orphaned” from the network.
Impact: While the outage affected one region, it was felt by organizations around the world. It lasted in total for about 2.5 days. This was due to an “inconsistent state” of affected resources, VMs in particular. Even after the connectivity was resumed, those had to be manually re-synced or restored from backups.
Cause: Engineering error—Incorrect service restart procedure.
Key takeaway: A mere mistiming in the maintenance of a complex cloud infrastructure may have far-reaching consequences for business continuity around the world.
#12 OpenAI: The “Sora” Capacity Strain (September 30, 2025)
What happened? OpenAI launched the improved 2nd version of Sora, their AI-based video generation service, on September 30. Due to increased hardware requirements and viral popularity among the users, the Sora’s cloud infrastructure became overloaded. As a result, after a user entered a prompt, video generation could take about 12-24 hours. Sometimes the video wouldn’t be generated at all.
Impact: The disruption affected users for 3 weeks. After that time, OpenAI decided to kill the Sora consumer app—officially due to high cost of GPU processing and energy constraints. Creative studios that invested in the Sora cloud (APIs, training, pipelines) had to switch to competitive products.
Cause: Strained cloud infrastructure due to demand surpassing the capacity.
Key takeaway: Cloud capacity is not infinite. With enough demand or a coordinated DDoS attack in place, even massive and distributed infrastructures may fail.
💡Want to read more cloud outage stories and learn how to protect your organization from disastrous impacts? Be sure to check out our article on building true data sovereignty. Learn more
Lessons Learned from the Cloud Outages
What can we learn from all these stories?
- Similar to self-hosted infrastructure, the cloud is not immune to threats like cyberattacks, human error, automation error, hardware failure, natural disasters etc.
- Cloud has its limits and uses mechanisms (e.g. throttling) that avoid reaching them.
- You don’t back up production data just by keeping it in the cloud. Using a different location/region of the same cloud provider is not enough to ensure successful data recovery, too. The only proven approach is to implement professional backup software that allows you to: keep data in a secure way (to protect against ransomware), replicate copies to independent locations, and swiftly restore your assets, including to a competitive cloud vendor or to an on-premises ecosystem if need be.



