Critical Bug in Linux CUBIC Congestion Controller Permanently Stalls QUIC Connections – One-Line Fix Deployed
Cloudflare found a critical bug in Linux's CUBIC congestion controller that permanently stalls QUIC connections after congestion collapse. A one-line fix restores recovery.
Urgent: Widespread Internet Impact Averted
Cloudflare engineers have discovered and fixed a critical bug in the default Linux congestion controller CUBIC that could permanently stall QUIC connections after a congestion collapse. The flaw, found in Cloudflare's open-source QUIC implementation quiche, caused the congestion window (cwnd) to lock at its minimum value, preventing recovery and effectively halting data transfer. A one-line code change resolved the issue.

The Symptom: Tests Failing 61% of the Time
The investigation began after erratic failures in Cloudflare's ingress proxy integration tests. In scenarios with heavy packet loss early in a connection, CUBIC failed to recover from congestion collapse. "Recovery after congestion collapse is exactly the regime a congestion controller exists to handle," said a Cloudflare networking engineer. "Yet most tests skip this corner case."
The bug was invisible in standard throughput tests but surfaced in real-world traffic patterns where connections experience early loss.
Background: CUBIC's Role and the Linux Kernel Change
CUBIC, standardized in RFC 9438, is the default congestion control algorithm in Linux. It governs how TCP and QUIC connections probe for bandwidth and respond to loss. Cloudflare's quiche uses CUBIC as its default, placing this code in the critical path for a significant share of internet traffic.
The bug originated from a Linux kernel change meant to align CUBIC with an app-limited exclusion in RFC 9438 §4.2-12. That fix addressed a real TCP problem, but when ported to quiche, it triggered unexpected behavior: after a congestion collapse reduced cwnd to its minimum, the algorithm never increased it again. "The fix was well-intentioned, but it exposed a subtle interaction in the QUIC state machine," explained a Cloudflare developer.

The Fix: An Elegant (Almost) One-Line Change
The solution was deceptively simple. By adjusting how CUBIC tracks time since last congestion event, the engineers broke the cycle that kept cwnd pinned. The change ensures the algorithm properly resets its internal state after a recovery attempt. "We were thrilled to find such a clean fix for a bug that caused so much trouble," said the engineer. The patch has been merged into the quiche repository.
What This Means
For Cloudflare users and the broader internet, this bug meant that any QUIC connection suffering early loss could remain stuck in a low-throughput state, leading to poor performance or complete stalls. With the fix, connections now properly recover after congestion collapse, restoring normal bandwidth probing.
This incident highlights the complexity of porting kernel-level optimizations to user-space stacks. "What works fine in TCP can break in QUIC if we don't carefully model the differences," the engineer noted. Cloudflare urges all users of quiche to update to the latest version immediately.
Further analysis of the bug and the fix is available in Cloudflare's engineering blog.