Presented by Solidigm
As inference workloads evolve from discrete question-and-answer exchanges into persistent, multi-step agentic systems, GPU availability is no longer the most critical AI bottleneck. Instead, the bottleneck has migrated from compute to context, says Jeff Harthorn, AI applied research lead at Solidigm.
"Why context management has become a primary bottleneck, more than GPU availability or compute efficiency, is the question of 2026," says Harthorn. "GPUs have gotten dramatically cheaper per FLOP. Model architectures and inference serving engines have all gotten much more efficient. But the thing that's grown faster than both of those is context. The persistent state that has to live between sessions has grown even faster than context itself."
It's happening as context windows grow dramatically, making individual inputs far larger than before. Agentic AI systems chain dozens or hundreds of model calls together, each generating state that must be tracked, and enterprises are requiring that inference state persist across sessions for audit, governance, and reuse. These trends compound each other, pushing context volumes beyond what any existing memory tier was designed to handle.
"Those three things are all happening at the same time, all of which are pushing context data and context memory into the stratosphere much more quickly than we're used to seeing," adds Ace Stryker, director of AI and ecosystem marketing at Solidigm.
The solution is a dedicated context tier emerging between GPU memory and bulk network storage: a layer of high-performance, high-density flash designed specifically to hold and serve Key-value (KV) cache, the inference data that allows models to retain and reuse context, and retrieval data at inference speed. Nvidia has formalized this architecture under the term CMX. Storage companies including Solidigm are building SSD products optimized for this workload.
"Storage has not been the first thing folks have thought about when they've been planning their enterprise infrastructure buildout," Stryker says. "In a lot of ways, it was a relatively small cost compared to compute, and it was a commodity. You just shopped around for the lowest dollar per gigabyte and called it good. But now, if your storage is not up to snuff, your ROI suffers, and it directly impacts your bottom line.”
Why AI inference requires a different storage architecture than training
The storage architecture that AI systems rely on today was largely inherited from training workflows. Training is sequential and write-dominated, with data moving in large blocks to and from bulk object storage. The tier structure, with high-bandwidth memory on the GPU, fast NVMe in the server, and bulk storage over the network, serves that use case reasonably well.
However, inference is a different animal. Its I/O signature is fine-grained, latency-sensitive, and increasingly stateful. KV cache data and retrieval data each have distinct access patterns, but both need to be served quickly and reused across interactions. Neither fits cleanly within GPU high-bandwidth memory, which is expensive and physically constrained, nor within traditional bulk storage, which was never designed for active inference workloads.
"The architectural gap that's interesting to me right now isn't at the top of the stack or the bottom, it's right in the middle," Harthon says. "A lot of what sits below the GPU HBM is being asked to do things it wasn't really designed for, which is where the most interesting systems work today is happening."
One of the most visible symptoms of this gap is recomputation. In inference, the pre-fill stage processes all of the context relevant to a given session before token generation can begin. When KV cache state isn't available in a fast, accessible tier, the system recomputes it — burning GPU cycles that produce no new value.
"A meaningful share of GPU cycles end up going to re-pre-filling," Harthon explains. "During all of that calculated context, that's potentially compute that's being spent reproducing state, rather than doing new work. When you start looking at the problem that way, GPU utilization starts looking like it's partly a storage problem."
This reframing is driving renewed interest in a metric borrowed from networking: goodput, or useful tokens per dollar, rather than raw tokens per dollar.
The AI context memory tier and how it works
The industry's response is taking structural form. A new tier is emerging between GPU memory and traditional network storage, designed specifically to hold and serve inference context, a layer distinct from drives inside GPU servers (G3) and storage servers over the network (G4), engineered to serve context data back to accelerators as rapidly as possible.
"If you're building a data center starting in the second half of this year, or the beginning of next year, you can't think about storage only living in two places," Stryker says. "Storage has to live in at least three places to handle the context memory tier, and that's likely to be a permanent fixture in how the infrastructure gets built going forward."
It's analogous to the emergence of object storage as a category, which didn't exist until enough workloads needed it. And once it did, it developed its own primitives, SLAs, cost models, and an ecosystem of vendors.
"The context tier looks like it might be on a similar arc," Harthorn says. "That volumetric pressure is causing the category to form, rather than any one vendor's road map."
For infrastructure leaders, this means actively planning for the new tier rather than treating it as optional. Deploying additional NAND at this layer reduces dependency on DRAM, which is orders of magnitude more expensive per gigabyte and constrained in both availability and thermal headroom.
"In terms of your investment effectiveness, you're laying out less cash to do it if you rely on the SSD layer in the way that Nvidia is now recommending and prescribing for a lot of use cases," Stryker adds.
What flash needs to deliver to support AI inference
Participating meaningfully in the inference stack places new demands on SSD technology. Tail latency, the worst-case performance of a drive, must be predictable, not just fast on average. An orchestration system that allocates GPU resources based on expected storage response times cannot tolerate unexpected multi-second delays. Consistent, observable performance matters more here than peak throughput.
Beyond latency, density becomes a critical concern, especially at hyperscale. In data centers where power, not cost, is the binding constraint, watts per petabyte becomes the operative metric. Floating gate NAND, the manufacturing approach at the core of Solidigm's products, is suited to that calculation. Network integration via NVMe over Fabrics, RDMA, and eventual CXL support is also essential, given the tight latency budgets of active inference pipelines.
"The drives have to have reliable performance characteristics, beyond the throughput side and being able to transfer as much data as possible as fast as possible, the way that training needed," Harthon says. "Now it's about being able to do it very consistently, in a way that's very observable to the people operating and orchestrating these systems."
How enterprise AI leaders should plan for the context tier
The standards, software primitives, and best practices being established now will define how AI inference infrastructure operates for years to come. Solidigm is engaged in that process through standards bodies, partner lab collaborations, and published research, which is critical precisely because the category is still forming.
"The interesting question for the next couple of years isn't whether AI infrastructure needs more compute," Harthorn says. "It's whether it can use what it has more efficiently. A lot of that answer runs through this tier that is being built today."
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact [email protected].








English (US) ·