Making AI work is increasingly a matter of the network, latest benchmark test shows

2 days ago 10
data flowing through chip concept
yucelyilmaz/Getty Images

The latest test of speed in training an artificial intelligence (AI) neural network is only partly about the fastest chips from Nvidia, AMD, and Intel. Increasingly, speed is also about the connections made between those chips, or the computer networking approaches that involve a battle of vendors and technologies.

Also: Tech prophet Mary Meeker just dropped a massive report on AI trends - here's your TL;DR

The MLCommons, which benchmarks AI systems, on Wednesday announced the latest scores by Nvidia and others for what's called MLPerf Training, a twice-yearly report of how long it takes in minutes to train a neural network such as a large language model (LLM) "to convergence," meaning, until the neural network can perform to a specified level of accuracy. 

mlperf-training-v5-0-tasks
MLCommons

The latest results show how large the AI systems have become. The scaling of chips and related components is making AI computers ever more dependent on the connections between the chips. 

This round, called 5.0, is the twelfth installment of the training test. In the six years since the first test, 0.5, the number of GPUs has soared from 32 chips to the current test, 5.0, with systems of 8,192 GPU chips topping the size category.

Also: 4 ways business leaders are using AI to solve problems and create real value

Because AI systems are scaling to thousands of chips, and, in the real world, tens of thousands, hundreds of thousands, and, eventually, millions of GPU chips, "the network, and the configuration of the network, and the algorithms used to map the problem onto the network, become more significant," said David Kanter, head of the MLCommons, in a media briefing to discuss the results.

Much of AI is a matter of simple math, linear algebra operations, such as a vector multiplied by a matrix. The magic happens when those operations are performed in parallel across many chips, with different versions of the data. 

Also: 5 ways to turn AI's time-saving magic into your productivity superpower

"One of the simplest ways to do that is with something called data parallelism, where you have the same [AI] model on multiple nodes," said Kanter, referring to parts of a multi-chip computer, called nodes, that can function independently of one another. "Then the data just comes in, and then you communicate those results," across all parts of the computer, he said. 

"Networking is quite intrinsic to this," added Kanter. "You'll often see different communications algorithms that get used for different topologies and different scales," referring to the arrangement of chips and how they're connected, the compute "topology."

The largest system in this training round, with 8,192 chips, was submitted by Nvidia, whose chips, as usual, turned in the fastest scores for all of the benchmark tests. Nvidia's machine was built using its most common part in production, its H100 GPU, in conjunction with Intel CPU chips, 2,048 of them. 

A more powerful system, however, debuted: Nvidia's combined CPU-GPU part, the Grace-Blackwell 200. It was entered into the test in a joint effort between IBM and AI cloud-hosting giant CoreWeave, in the form of a machine taking up a whole equipment rack, called the NVL72. 

Also: The hidden data crisis threatening your AI transformation plans

The largest configuration submitted by CoreWeave and IBM carries 2,496 Blackwell GPUs and 1,248 Grace CPUs. (While the GB200 NVL72 was submitted by IBM and CoreWeave, the machine's design belongs to Nvidia.) 

The benchmark drew a record 201 performance submissions from 20 submitting organizations, including Nvidia, Advanced Micro Devices, ASUSTeK, Cisco Systems, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, Hewlett Packard Enterprise, IBM, Krai, Lambda, Lenovo, MangoBoost, Nebius, Oracle, Quanta Cloud Technology, SCITIX, Supermicro, and TinyCorp.

The latest round of the benchmark consisted of seven individual tasks, including training the BERT large language model and training the Stable Diffusion image-generation model.

This round saw the addition of a new test of speed: how fast it takes to fully train Meta Platforms' Llama 3.1 405B large language model. That task was completed in just under 21 minutes on the fastest system, the Nvidia 8,192 H100 machine. The Grace-Blackwell system with 2,496 GPUs was not far behind, at just over 27 minutes. 

Full results and specs of the machines can be seen on the MLCommons site.

Within those numbers, there is no exact measure for how much of a role networking plays in giant systems. Test results from one generation of MLPerf to another show improvement on the same benchmarks, even with the same number of chips. 

mlcommons-mlperf-training-5-0-comparison-64-way
MLCommons

For example, the best time to train Stable Diffusion using 64 chips at a time dropped to three minutes from 10 in the prior round, last fall. How much of that drop is due to the chips getting better versus improved networking and systems engineering is hard to say.

Also: OpenAI wants ChatGPT to be your 'super assistant' - what that means

Instead, participants in the MLPerf noted several factors that may lead to measurable performance differences. 

"The connection scalability is more important as you have to scale the size of the network," said Rachata Ausavarungnirun of MangoBoost, a maker of SmartNIC technology and software, in the same media briefing. MangoBoost submitted machines assembled with eight, 16, and 32 Advanced Micro Devices' MI300X GPUs, which compete with the Nvidia chips.

That element of connection scalability, said Ausavarungnirun, has to do with "not just how fast the compute will take or how fast the memory is, but how much of the network becomes the bottleneck and has to be accelerated. That gets more and more important as you grow" the number of chips.

Different networking technologies, such as Ethernet, and different networking protocols, such as TCP-IP, "have different characteristics in terms of how much effective throughput these different [AI] models are able to actually see," said Chetan Kapoor with CoreWeave, which submitted the Nvidia NVL72, in the same media briefing. 

Also: 30% of Americans are now active AI users, says new ComScore data

Such a difference in throughput "directly maps to the overall system utilization," he said, meaning it can improve or degrade how efficiently the chips are used to do those linear algebra operations.

"I think that's also a factor that the industry is making a lot of progress on, which is to continue to push the boundaries of effective network utilization," said Kapoor.

Part of Nvidia's achievement of "phenomenal scaling efficiency" is the communication going on inside its machines, said Dave Salvator, Nvidia's director of accelerated computing products, in a separate media briefing. 

The 2,496-way Grace-Blackwell NVL72 was able to achieve what's known as 90% scaling efficiency, meaning the performance of the machine improves almost in direct proportion to how many chips are connected together, said Salvator. 

To reach that level of efficiency, Nvidia made the most of its NVLink communications technology that connects the chips, said Salvator. "It's also things like our collective communications libraries, called NCCL, and our ability to do things like overlap computing communications to really get that best scaling efficiency," said Salvator.

Also: How Salesforce's 5-level framework for AI agents finally cuts through the hype

Although it's difficult to separate the role of networking, and although there is improvement with the same number of chips from one round to the next, nevertheless, the results reinforce the continued value of building bigger and bigger systems, which has been an article of faith in the AI field. Increasing the number of chips is dramatically reducing training time. 

mlperf-training-5-0-moores-law-comparison
MLCommons

Kanter showed a graph comparing test time improvements since the 0.5 round. The speed-up is faster than the individual improvements of any single computer chip, he said, precisely because making a machine is a whole-system problem that includes things such as network efficiency. 

"What you can see is that through the combination of silicon architecture, algorithms, scale, everything, we're outpacing Moore's Law," said Kanter, referring to the decades-old semiconductor industry rule of thumb for progress in transistors. That speed-up is especially the case, he said, "on some of the most pressing workloads of the day, things like generative AI."

"This is actually setting a pretty high bar," said Kanter.

Read Entire Article