Intel and AMD's new ACE CPU extensions bring an efficient AI-oriented instruction set to x86 — a new design makes matrix multiplication more power- and density-efficient

8 hours ago 12

Most all you hear about "running an AI model" involves a GPU of some sort, but not every AI task is suited to that hardware. Smaller models or single-user latency-sensitive operations can benefit from running on the CPU instead, as it avoids the overhead of shuffling data to and from the GPU. There are also many situations where there is no GPU available to begin with, or it's a meek integrated affair with limited capabilities. Intel and AMD have recently released the full specification for the ACE CPU extensions that make it easier and more power-efficient to run the aforementioned AI tasks on x86 processors.

ACE comes in by offering a technical standard that leverages the existing AVX10 registers but adds silicon dedicated to matrix multiplication. This brings multiple benefits, but the key advantages are better power efficiency, easier development and optimization, and leveraging AVX's 512-bit inputs. The latter makes for easy integration with existing designs by eschewing the need for ACE-specific inputs.

Matrix multiplication is the cornerstone of AI workloads: take a table of numbers, and run a multiplication-addition loop over the whole thing. This has always been possible with most any CPU, though at limited speed. Even today, running these loops uses a lot of power, even when leveraging x86's AVX10 multiply-accumulate instructions — something that's technically a hack, as AVX wasn't designed with 2D matrix operations multiplication in mind.

For the same number of input vectors, ACE can perform 16x as many operations, compared to AVX10. Note this doesn't necessarily mean a 16x speedup, as that will depend on each individual implementation, but it's reasonable to expect that Intel and AMD will dedicate more silicon to this task in future designs to improve performance. Plus, as each ACE instruction performs more work than its equivalent AVX10 loop, there's less CPU instruction overhead and potentially better RAM bandwidth usage right off the bat.

The benefits go far beyond just using fewer instructions for the same thing. ACE is intended to be implementation-agnostic, meaning that ML frameworks and their underlying libraries (PyTorch, TensorFlow) can just write one code path instead of having multiple variations depending on the underlying hardware and its degree of AVX support.

ACE native supports most every data type used in ML operations (including but not limited to INT8, INT32, FP8, FP16, FP32, BF16), but it also can use Open Compute Project's MX block-scaled formats natively, something that AVX10 does not provide. Developers will also be able to move some NPU-specific workloads back to CPU when they need something done now and fast. In those situations, not having to deal with the fact that each NPU is different is a huge boon, too, as ACE offers a consistent target across x86 hardware.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Bruno Ferreira is a contributing writer for Tom's Hardware. He has decades of experience with PC hardware and assorted sundries, alongside a career as a developer. He's obsessed with detail and has a tendency to ramble on the topics he loves. When not doing that, he's usually playing games, or at live music shows and festivals.

Read Entire Article