Nvidia RTX 5090 reset bug prompts $1,000 reward for a fix — cards become completely unresponsive and require a reboot after virtualization reset bug, also impacts RTX PRO 6000

2 months ago 99

(Image credit: AFOX)

Nvidia’s new RTX 5090 and RTX PRO 6000 GPUs are reportedly being plagued by a reproducible virtualization reset bug that can leave the cards completely unresponsive until the host system is physically rebooted.

CloudRift, a GPU cloud provider, published a detailed breakdown of the issue after encountering it on multiple Blackwell-equipped systems in production. The company has even issued a $1,000 public bug bounty for anyone able to identify a fix or root cause.

Reset bug bricks Blackwell

According to CloudRift’s logs, the bug occurs after a GPU has been passed through to a VM using KVM and VFIO. On guest shutdown or GPU reassignment, the host issues a PCIe function-level reset (FLR), which is a standard part of cleaning up a passthrough device. But instead of returning to a known-good state, the GPU fails to respond: “not ready 65535ms after FLR; giving up,” the kernel reports.

At this point, the card also becomes unreadable to lspci, which throws “unknown header type 7f,” errors. CloudRift notes that the only way to restore normal operation is to power-cycle the entire machine. Tiny Corp, the AI start-up behind tinygrad, brought attention to the issue by reposting CloudRift’s findings on X.com with a blunt question: “Do 5090s and RTX PRO 6000s have a hardware defect? We’ve looked into this and can’t find a fix.”

Do 5090s and RTX PRO 6000s have a hardware defect? We've looked into this and can't find a fix. tl;dr the cards can get into a state where they don't listen to reset. https://t.co/7HgpBfn8NdSeptember 6, 2025

Other users confirm similar failures

Threads across the Proxmox forums and Level1Techs community suggest that home users and other early adopters of the RTX 5090 are also encountering similar behavior.

In one case, a user reported a complete host hang after a Windows guest was shut down, with the GPU failing to reinitialize even after an OS-level reboot. In another case, a user said, “I found my host became unresponsive. Further debugging shows that the host CPU got soft lock [sic] after a FLO timeout, which is after a shutdown of LinuxVM. No issue for my previous 4080.”

Several users confirm that toggling PCIe ASPM or ACS settings does not mitigate the failure. No issues have been reported with older cards such as the RTX 4090, suggesting that the bug may be limited to Nvidia’s Blackwell family.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

FLR is a critical feature in GPU passthrough configurations, allowing a device to be safely reset and reassigned between guests. If FLR is unreliable, then multi-tenant AI workloads and home lab setups using virtualization become risky, particularly when a single card failure takes down the entire host.

Nvidia has not yet officially acknowledged the issue, and there is no known mitigation at the time of writing.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!

Luke James is a freelance writer and journalist. Although his background is in legal, he has a personal interest in all things tech, especially hardware and microelectronics, and anything regulatory.

Read Entire Article