Troubleshooting Nvidia Server GPU Crashing Under Load

Experiencing crashes with an Nvidia Server Gpu under heavy load can be a perplexing issue, especially in high-performance computing environments. This article delves into a specific case of system instability encountered with a Titan Black GPU in a workstation, outlining the troubleshooting steps taken to diagnose the problem.

The system in question is built around a Xeon six-core E5-1650 processor, accompanied by 64GB of RAM and a 500GB SSD for boot and storage. For display purposes, a GTX 750 Ti is utilized, while the heavy lifting is intended for the Titan Black GPU. Despite seemingly adequate power and cooling, the system was crashing under specific GPU workloads.

Initial investigations focused on the power supply unit (PSU). The Titan Black, requiring additional power, was connected via a 6-pin auxiliary PCIe cable directly from the PSU. An LP4 to 8-Pin PCI Express adapter was employed due to the PSU lacking an 8-pin connector. Suspecting power delivery issues, the user reconfigured the LP4 connections to ensure direct power from the PSU, eliminating potential daisy-chaining concerns. However, this alteration did not resolve the crashing problem. The PSU, bearing an 80 Plus Gold certification, was initially believed to be sufficient.

Further testing revealed that the crashes were triggered by specific workloads, notably matrix-matrix multiplication tasks. Interestingly, stress tests like nbody -fp64 and CUDA-Z, even under heavy load settings, did not initially induce crashes or even significantly increase the Titan Black’s fan speed, while reporting substantial double-precision floating-point performance. The problematic matrix multiplication test, in contrast, caused crashes within one or two iterations, without taxing other system components like CPU, RAM, or the GTX 750 Ti display GPU.

The issue was further narrowed down using the nbody test. By increasing the problem size (n) with the -fp64 flag, the system could be reliably crashed. A threshold was identified: n=60000 consistently caused immediate failures, while n=55000 and lower ran without issues. In a further attempt to isolate power delivery as the root cause, an experiment was conducted using a separate, older PC power supply to power the 8-pin connector of the Titan Black. This resulted in the GPU fan immediately spinning at maximum speed and a system hang, prompting immediate disconnection to prevent potential damage.

Despite these troubleshooting steps, the root cause of the Nvidia server GPU crashing under specific loads remains unresolved. While power delivery has been investigated, and potentially mitigated through direct PSU connections, the consistent crashing under specific, demanding GPU computations suggests a deeper, potentially hardware-related issue requiring further expert analysis.

Troubleshooting Nvidia Server GPU Crashing Under Load

Comments

Leave a Reply Cancel reply