Nvidia engineer breaks and then quickly fixes AMD GPU performance in Linux

1 day ago 17

In a surprising turn of events, an Nvidia engineer pushed a fix to the Linux kernel, resolving a performance regression seen on AMD integrated and dedicated GPU hardware (via Phoronix). Turns out, the same engineer inadvertently introduced the problem in the first place with a set of changes to the kernel last week, attempting to increase the PCI BAR space to more than 10TiB. This ended up incorrectly flagging the GPU as limited and hampering performance, but thankfully it was quickly picked up and fixed.

In the open-source paradigm, it's an unwritten rule to fix what you break. The Linux kernel is open-source and accepts contributions from everyone, which are then reviewed. Responsible contributors are expected to help fix issues that arise from their changes. So, despite their rivalry in the GPU market, FOSS (Free Open Source Software) is an avenue that bridges the chasm between AMD and Nvidia.

Radeon performance breaking bug

(Image credit: Git.kernel)

The regression was caused by a commit that was intended to increase the PCI BAR space beyond 10TiB, likely for systems with large memory spaces. This indirectly reduced a factor called KASLR entropy on consumer x86 devices, which determines the randomness of where the kernel's data is loaded into memory on each boot for security purposes. At the same time, this also artificially inflated the range of the kernel's accessible memory (direct_map_physmem_end), typically to 64TiB.

In Linux, memory is divided into different zones, one of which is the zone device that can be associated with a GPU. The problem here is that when the kernel would initialize zone device memory for Radeon GPUs, an associated variable (max_pfn) that represents the total addressable RAM by the kernel would artificially increase to 64TiB.

Since the GPU likely cannot access the entire 64TiB range, it would flag dma_addressing_limited() as True. This variable essentially restricts the GPU to use the DMA32 zone, which offers only 4GB of memory and explains the performance regressions.

The good news is that this fix should be implemented as soon as the pull request lands, right before the Linux 6.15-rc1 merge window closes today. With a general six to eight week cadence before new Linux kernels, we can expect the stable 6.15 release to be available around late May or early June.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Read Entire Article