Nvidia may postpone volume ramp-up of Blackwell machines: TrendForce

5 days ago 3
Dell servers based on Nvidia GB200
(Image credit: CoreWeave)

Nvidia may have to postpone the volume ramp of next-generation AI servers based on the B200 and GB200 platforms due to overheating, power consumption, and the necessity to optimize interconnections, according to a TrendForce report. The market research firm believes that mass production and peak shipments of Blackwell machines will occur sometime in mid-2025, which means a nearly half-year delay. Nvidia has yet to confirm or deny the claims.

As expected, Nvidia and its partners can ship only limited quantities of Blackwell-based servers in 2024, as the company will have to use its low-yielding B200 for them. However, Dell is already shipping Blackwell server racks. However, although refined versions of Nvidia's B200 processors entered mass production in October and, therefore, will get to the company's hands in January, TrendForce does not expect the ramp of Blackwell-based servers to skyrocket immediately. According to the firm, due to overheating, power consumption, and requirements for higher-speed interconnects, mass production and peak shipments of B200 and GB200 will occur only between the second and the third quarter of 2025. 

Just several months ago, it was reported that an Nvidia NVL72 rack based on the GB200 platform with 72 B200 GPUs would consume 120 kW of power, which already is significantly higher than current AI server racks (typical high-density rack power is up to 20kW, while an H100-based rack reportedly consumes around 40kW). TrendForce now claims that Nvidia had updated the specification of the device, and now it consumes 140 kW, which is more than typical data centers can provide to a single rack. 

The problem is that Nvidia's Blackwell GPUs were reportedly prone to overheating in servers equipped with 72 processors even when the racks consumed up to 120 kW per rack. This issue has forced Nvidia to repeatedly revise its server rack designs, as overheating not only reduces GPU performance but also risks hardware damage. A 140 kW per rack power consumption means further alterations to server designs, which could result in setbacks. 

Increased power consumption means additional cooling requirements. Liquid cooling is essential for Blackwell servers, but modern sidecar coolant distribution units (CDUs) can only handle 60 kW—80 kW of thermal power. To that end, cooling system providers are optimizing cold plate designs and aiming to double or triple the capacity of CDUs. TrendForce expects the performance of liquid-to-liquid in-row CDUs to exceed 1.3 mW, with further advancements possible, so excessive heat dissipation will eventually cease to be a major problem. 

However, according to the report, power consumption and heat management are not the only issues that Nvidia and its partners have to solve. TrendForce claims that Nvidia has to optimize its interconnections but doesn't elaborate on which interconnections must be optimized. 

It remains to be seen how the claimed teething problems with Nvidia's B200 and GB200 servers affect the launch timeframe and availability of B200A based on simplified Blackwell processors and the B300 and GB300 machines featuring refreshed Blackwell GPUs. While B200A will likely feature a considerably lower power consumption compared to B200/GB200, the refreshed B300-series Blackwell GPUs promise to come with more memory and feature higher compute performance, which usually comes at higher power, so these products will likely consume even more than 140 kW per rack, necessitating even more sophisticated components and cooling.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

Read Entire Article