Testing DirectStorage with GPU decompression — do Blackwell GPUs have the upper hand?

6 hours ago 8
A GeForce RTX 5070 graphics card (Image credit: Future)

Microsoft first announced DirectStorage for PC back in 2020, with Forspoken being the first game to officially support it in early 2023. However, it wasn’t until Ratchet & Clank: Rift Apart was released later that year that we saw the full DirectStorage suite in action. Ratchet & Clank: Rift Apart was the first title to ship with GDeflate compression and support for GPU decompression of assets – a task that had previously been the responsibility of the CPU.

In theory, this should have facilitated more seamless streaming of assets with smoother performance, as the feature aimed to reduce the CPU bottleneck associated with the decompression of assets during gameplay. In practice, the opposite happened, in particular on Nvidia GPUs.

What is DirectStorage?

DirectStorage on PC aims to bring many of the benefits of the fast storage technology used in the PS5 and Xbox Series X|S. Its purpose is to allow games to make full use of NVMe SSDs with minimal CPU overhead, allowing for reduced load times, faster asset streaming, and larger, more dense worlds in games. DirectStorage 1.1 also added support for GPU decompression, which would shift the burden of decompression of game assets from the CPU to the GPU. This amplifies the amount of data that can be transferred through the SSD -> RAM -> VRAM pipeline.

Article continues below

Unlike CPUs, GPUs have thousands of cores, and they are also very efficient at performing repeatable tasks in parallel. GDeflate is a data-parallel compression scheme that is specifically optimized for GPU decompression.

GDeflate has two levels of parallelism. First, the original data stream is split into 64 KB tiles, and each time is compressed separately. If the CPU does the decompression, then each tile can be decompressed by a different thread. If the GPU does the decompression, then each tile can be decompressed by a single thread group. Second, the data is arranged within a tile so that many lanes within a thread group can decompress that tile in parallel. GPU decompression not only saves CPU cycles, but also saves system interconnect bandwidth and on-disk footprint since the data remains compressed until it reaches VRAM.

DirectStorage with GPU Decompression
(Image credit: Tom's Hardware)

One benefit of data moving at a faster rate through this pipeline is that, theoretically, you would need to hold less data in system memory at any point in time, which could be extremely helpful given the skyrocketing prices of DDR. This will be especially true if game developers start to lean even more into the high bandwidth of NVMe storage. Another benefit, which is especially evident in Ratchet & Clank, is that textures load in faster with DirectStorage enabled. As you can see above, with it disabled, you get blurry textures until the higher resolution textures load in.

What is the problem?

When Ratchet & Clank: Rift Apart launched, many users reported that disabling DirectStorage by removing the dstorage.dll files from the game folder (and therefore falling back to CPU decompression) resulted in better performance – particularly in terms of more stable frametime. The issue affected mainly Nvidia GPUs, including the 4090 and 3090.

In early 2025, Marvel’s Spider-Man 2 was released on PC with DirectStorage and GPU decompression support. Once again, there were reports of improved performance when disabling DirectStorage and falling back to CPU decompression on NVIDIA GPUs. When testing on a 4090, I saw increases of 18-25% in 1% lows in Spider-Man 2 when disabling DirectStorage, depending on resolution. These GPUs struggled to handle both rendering and decompression tasks simultaneously.

Do Blackwell GPUs suffer from this issue?

My initial tests were performed on a 5090, but I was pleasantly surprised to see that leaving DirectStorage enabled no longer tanked performance. However, the 5090 is an absolute beast, so this was not necessarily a sign that the Blackwell architecture is better suited to handle GPU decompression. For that, we need to test more Blackwell GPUs across the stack. Note that AMD Radeon GPUs never experienced this issue, so we will only be testing Nvidia cards in this article.

Test system

  • AMD Ryzen 7 9800X3D
  • 64GB (2x32GB) G.SKILL Flare X5 DDR5 @6200 MHz CL30
  • Crucial T700 Gen5 SSD
  • Asus ROG STRIX B850-F Gaming WiFi
  • Corsair Nautilus 360 RS AIO Cooler
DirectStorage Benchmark Charts
(Image credit: Tom's Hardware)

As you can see in both Spider-Man 2 and Ratchet & Clank, the RTX 5090 does not skip a beat with DirectStorage/GPU decompression enabled at any resolution. In fact, we now see some gains in average framerate and 1% low in some cases.

Below, we test the RTX 5070 in the same games at the same settings at 1080p and 1440p.

DirectStorage Benchmark Charts
(Image credit: Tom's Hardware)

Similarly, the RTX 5070 sees some nice gains in 1% lows with DirectStorage enabled. At 1440p, the GPU load throughout the Spider-Man 2 benchmark is over 98%, so we are GPU-bound, and yet the 5070 has no issues rendering and decompressing assets simultaneously during gameplay.

Now for an even bigger test. Can the RTX 5060 maintain the same level of performance when it has to render and decompress assets on-the-fly?

DirectStorage Benchmark Charts
(Image credit: Tom's Hardware)

Indeed, it can. At 1080p, the GPU load throughout the Spider-Man 2 benchmark run is over 98%, which means that even when GPU-bound, the RTX 5060 does not lose any performance when it is tasked with decompressing assets during gameplay.

By contrast, you can see below how the RTX 4060 handles GPU decompression.

DirectStorage Benchmark Charts
(Image credit: Tom's Hardware)

The performance degradation can be quite significant in terms of the 1% lows, which is indicative of how smooth and stable the game feels when traversing the game world. In contrast with the RTX 5060, the 4060 struggles when it is tasked to decompress assets.

However, GPU decompression is working as intended on the 40-series from a texture streaming perspective. Textures load in on time, just as they do on the 50-series.

Possibly the most interesting result, however, is what we see on a 4060 laptop.

Laptop specs

  • RTX 4060 Laptop
  • Intel Core i7-13620H
  • Gen4 SSD
  • 16GB DDR5
DirectStorage Benchmark Charts
(Image credit: Tom's Hardware)

As you can see, even on a system with a lower-end CPU, we still experience a drop in performance when shifting the decompression task from the CPU to the GPU on a previous-generation RTX card. This is true even at 720p when we are CPU-bound (the GPU load in Spider-Man 2 throughout the benchmark is 76%.)

From the tests, it is clear that Blackwell GPUs across the stack experience no performance degradation – from the 5090 all the way down to the 5060. Meanwhile, previous-generation GPUs still struggle with GPU decompression, at least the ones tested here.

Why do Blackwell GPUs handle GPU decompression better?

We only tested a limited selection of hardware and scenarios, but our results do show a few clear tendencies. It’s not entirely clear why the RTX 50-series handles GPU decompression better than their predecessors. The Blackwell data center GPUs have a dedicated decompression block, but there is nothing in the consumer Blackwell whitepaper that indicates the consumer GPUs have such a block.

One possible explanation is the addition of an improved scheduler in the Blackwell architecture. Despite its name – AI Management Processor (AMP) – the scheduler can be used to improve any asynchronous workloads running simultaneously with other graphics workloads, not just AI-related tasks. The AMP is implemented using a dedicated RISC-V processor, which isn’t anything new for NVIDIA GPUs, as RTX cards since Turing have featured RISC-V-based GPU System Processors. What does appear to be new is the fact that it was built specifically for Windows Hardware-Accelerated GPU Scheduling (HAGS), which allows the GPU to handle its own memory more efficiently without having to rely on the CPU.

According to the whitepaper, AMP matches the Microsoft architectural model that describes a configurable scheduling core on the GPU through HAGS. The AMP appears to be smarter and more efficient than previous generation schedulers, and the faster and more efficient scheduling of asynchronous workloads that it facilitates could be what we are seeing with GPU decompression on Blackwell GPUs.

Dan Mateescu is a PC enthusiast with many years of experience benchmarking PC hardware. In 2021, he started his own YouTube channel called 'Compusemble' where he benchmarks hardware in video games and the latest tech demos.

Read Entire Article