Cloudflare explains Tuesday’s outage that temporarily took down ChatGPT

5 hours ago 9

A blog post published Tuesday night by Cloudflare co-founder and CEO Matthew Prince has details on what caused its “worst outage since 2019,” pinning the issue to a problem in the Bot Management system that is supposed to control which automated crawlers are allowed to scan particular websites using its CDN.

Cloudflare said last year that about 20 percent of the web runs through its network, which is supposed to share the load to keep websites online in the face of traffic spikes and DDoS attacks. But today’s crash disconnected many of them, knocking out everything from X to ChatGPT to the well-known outage tracker Downdetector for several hours and resembling recent outages caused by problems with Microsoft Azure and Amazon Web Services.

Cloudflare’s bot controls are supposed to help deal with problems like crawlers scraping information to train generative AI. It also recently announced a system that uses Generative AI to build the “AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect ‘no crawl’ directives.”

However, it says the problems today were due to changes to the permissions system of a database, not the generative AI tech, not DNS, and not what Cloudflare initially suspected, a cyber attack or malicious activity like a “hyper-scale DDoS attack.”

According to Prince, the machine learning model behind Bot Management that generates bot scores for the requests that travel over its network has a frequently updated configuration file that helps ID automated requests; however, “A change in our underlying ClickHouse query behaviour that generates this file caused it to have a large number of duplicate ‘feature’ rows.”

There’s more detail in the post about what happened next, but the query change caused its ClickHouse database to generate duplicates of information. As the configuration file rapidly grew to exceed preset memory limits, it took down “the core proxy system that handles traffic processing for our customers, for any traffic that depended on the bots module.”

As a result, companies that used Cloudflare’s rules to block certain bots returned false positives and cut off real traffic, while Cloudflare customers who didn’t use the generated bot score in their rules remained online.

For now, it lists four specific plans to keep this kind of problem from happening again, even if the growing centralization of internet services may make these outages inevitable:

Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
Enabling more global kill switches for features
Eliminating the ability for core dumps or other error reports to overwhelm system resources
Reviewing failure modes for error conditions across all core proxy modules

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.