Why Half The Internet Went Offline And How Cloudflare Broke Itself

So Why Did Everything Break?

If you were online yesterday and your favorite sites suddenly stopped loading, you were not alone. Huge services from social platforms to AI tools and even McDonalds order screens glitched out. The common link behind a lot of that chaos was Cloudflare, one of the biggest infrastructure companies on the internet.

Cloudflare sits in the background for millions of websites. It speeds up traffic, protects sites from attacks, and keeps things running smoothly. When Cloudflare has a bad day, the whole internet feels it.

The company has now published a detailed breakdown of what happened. The short version is surprisingly simple and a bit painful. This was not a hacker, not a massive DDoS attack and not some mysterious nation state operation. It was Cloudflare accidentally taking itself down.

Cloudflare CEO Matthew Prince opened his post with a very clear statement in bold text. The issue was not caused directly or indirectly by a cyber attack or any kind of malicious activity. No one broke in. The system tripped over its own feet.

He also delivered a straightforward apology. Prince said Cloudflare was sorry for the impact on customers and the internet in general and called any outage of their systems unacceptable. For a company that powers such a huge part of global traffic, even a few hours of broken routing is a big deal and he admitted that the team knows they let users down.

How One File Crashed A Giant Network

So how does a company with data centers all over the world end up taking itself offline because of one mistake

Cloudflare initially thought they might be under attack. The symptoms looked like what you would expect from a huge denial of service event. HTTP 5xx errors started spiking around 11:20 UTC which meant servers were up but could not handle normal requests correctly.

After digging into the logs they found the real cause. It started with a change to one of their database systems and specifically to its permissions. That change made the database write multiple entries into a special feature file. This file is used by Cloudflare Bot Management, the system that decides whether incoming traffic looks like a human or a bot.

Because of that change the feature file effectively doubled in size. Alone, that might not sound scary. The problem is that this larger than expected file was then pushed out across Cloudflares entire global network.

Here is where a hidden limitation came back to bite them. The bot management software had a hard coded limit for how big that feature file was allowed to be. The new oversized file blew past that limit.

Once that happened the software that was supposed to read the file simply failed. When that system broke, it affected the machines that were responsible for handling and routing huge amounts of internet traffic. Errors piled up and users around the world started seeing sites time out or throw error messages.

Cloudflare says they isolated the issue, stopped the rollout of the problem file and got core traffic mostly back to normal by around 14:30 UTC. By 17:06 UTC, they considered all of their systems fully operational again. That still meant several hours where large parts of the online world were flaky or offline.

According to the company, this was their worst outage since 2019. Considering how much more of the modern internet now relies on Cloudflare compared to a few years ago, the blast radius felt even bigger this time.

What Cloudflare Is Changing And Why It Matters

After an incident like this, everyone wants to know what will be done to avoid a repeat. In his post, Prince outlines a few key fixes that Cloudflare is already rolling out.

More global kill switches for features so they can immediately shut down a misbehaving system across their network without waiting for it to break more things.
Better controls around how error reports and core dumps are handled so that debugging data cannot overwhelm servers and make a bad situation worse.
Tighter checks on configuration changes before they hit production systems so one permission tweak cannot cascade into a network wide failure.

The bigger takeaway is about how fragile the modern internet can be when so much depends on a few infrastructure companies. When a provider like Cloudflare sneezes, thousands of services catch a cold instantly. Games, streaming platforms, AI tools, online stores and even physical devices in restaurants or shops can all be affected because they quietly rely on the same backend pipes.

Prince ends his post by calling the outage unacceptable and saying that past failures have always pushed Cloudflare to build more resilient systems. That is the ongoing reality of running internet scale services. You are never really finished hardening the system. You just move from one lesson to the next.

If you enjoy peeking behind the curtain of how the internet really works, the full Cloudflare write up is worth a read. It breaks down every stage of the incident in forensic detail and explains how their routing, security and bot filtering layers fit together.

For everyone else, the main story is simple. A small internal mistake triggered a chain reaction. A hidden limit in critical software caused a widespread failure. The team owned the error in public, apologized and is now trying to patch the holes. The next time half your favorite sites suddenly stop loading, there is a decent chance somewhere deep in the stack a similar story is unfolding.

Original article and image: https://www.pcgamer.com/hardware/cloudflare-apologises-for-the-pain-we-caused-the-internet-and-admits-a-file-size-error-brought-down-large-parts-of-the-web-yesterday-not-a-malicious-cyberattack/