xAI Colossus supercomputer with 100K H100 GPUs comes online — Musk lays out plans to double GPU count to 200K with 50K H100 and 50K H200

1 month ago 14
Happy times for OpenAI
(Image credit: Elon Musk on X / Twitter)

Elon Musk's X (formerly Twitter) has brought the world's most powerful training system online. The Colossus supercomputer uses as many as 100,000 Nvidia H100 GPUs for training and is set to expand with another 50,000 Nvidia H100 and H200 GPUs in the coming months.

"This weekend, the xAI team brought our Colossus 100K H100 training cluster online," Elon Musk wrote in an X post. "From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200K (50K H200s) in a few months."

According to Michael Dell, the head of the high-tech giant, Dell developed and assembled the Colossus system quickly. This highlights that the server maker has accumulated considerable experience deploying AI servers during the last few years' AI boom.

Elon Musk and his companies have been busy making supercomputer-related announcements recently. In late August, Tesla announced its Cortex AI cluster featuring 50,000 Nvidia H100 GPUs and 20,000 of Tesla's Dojo AI wafer-sized chips. Even before that, in late July, X kicked off AI training on the Memphis Supercluster, comprising 100,000 liquid-cooled H100 GPUs. This supercomputer has to consume at least 150 MW of power, as 100,000 H100 GPUs consume around 70 MW.

Although all of these clusters are formally operational and even training AI models, it is entirely unclear how many are actually online today. First, it takes some time to debug and optimize the settings of those superclusters. Second, X needs to ensure that they get enough power, and while Elon Musk's company has been using 14 diesel generators to power its Memphis supercomputer, they were still not enough to feed all 100,000 H100 GPUs.

xAI's training of the Grok version 2 large language model (LLM) required up to 20,000 Nvidia H100 GPUs, and Musk predicted that future versions, such as Grok 3, will need even more resources, potentially around 100,000 Nvidia H100 processors for training. To that end, xAI needs its vast data centers to train Grok 3 and then run inference on this model.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

Read Entire Article