Login to Continue Learning
It appears that NVIDIA’s flagship GPUs—specifically the GeForce RTX 5090 and the RTX PRO 6000—are experiencing a new bug related to unresponsiveness under virtualization.
According to CloudRift, a GPU cloud for developers, these high-end GPUs began exhibiting issues after being used extensively in VM environments. After a few days of VM usage, the GPUs became completely unresponsive and could no longer be accessed unless the entire node system was rebooted. This problem is reported to affect only the RTX 5090 and the RTX PRO 6000; other models like the RTX 4090, Hopper H100s, and Blackwell-based B200s are not currently affected.
The issue occurs when a GPU is assigned to a VM using the device driver VFIO. After a Function Level Reset (FLR), the GPU fails to respond, leading to a kernel ‘soft lock’ that deadlocks both the host and client environments. To resolve this, the host machine must be rebooted, which can be challenging for CloudRift given their large volume of guest machines.
CloudRift first reported these crashing issues, and Proxmox has since verified them, with one user experiencing a complete host crash after shutting down a Windows client. NVIDIA has acknowledged the problem, stating that they have been able to reproduce it and are working on a fix. While an official confirmation from NVIDIA is awaited, the issue seems specific to Blackwell-based GPUs.
Interestingly, CloudRift has offered a $1,000 bug bounty for anyone who can help resolve or mitigate this issue. Given its impact on crucial AI workloads, we expect NVIDIA to release a fix soon.