This tiny AMD PC just ran a massive 397B AI Model that required a server room full of GPUs a year ago

2 hours ago 10

AMD's Ryzen AI Halo recently went on sale for $4,000, sparking an interesting debate about how it compares to Nvidia's slightly pricier DGX Spark offering.

The configuration that the Ryzen AI Halo offers, however, has been on the market for a few months now, and while most OEMs and enterprise providers are offering the same flavor and configuration, Shenzhen-based memory and storage company Longsys has taken things a step further.

The storage giant demonstrated a localized version of a 397B-parameter AI model running on its own version of the Ryzen AI Halo, featuring the same 16-core Ryzen AI Max+ 395 and 128GB of RAM configuration.

How was the Ryzen AI Max+ 395 able to run such a massive model with only 128GB of RAM?

While the model being run was not explicitly stated, it seems to be a customized version derived from Alibaba's Qwen 3.5 397B (A17B), a multimodal foundation model that leverages a Mixture-of-Experts (MoE) approach, which made the original DeepSeek such a potent challenger.

Even if it was leveraging INT4 quantization, the memory requirements far exceed the memory the device demonstrating the feat had on offer: only 96GB of VRAM is available to the GPU in a 128GB unified configuration, versus an estimated 200-250GB of VRAM the model needs to run.

The secret sauce is Longsys's recently unveiled custom SPU and iSA configuration that offers the ability to compress data in real time, a feat that the company says allows it to fit as much as twice the amount of data in storage drives of up to 128GB, leveraging a caching layer that reduces DRAM requirements considerably.

The approach involves offloading experts not in active use to a large, fast storage buffer that the AI chip can then reintroduce them from if needed.

Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!

In a press release, Longsys claimed its approach worked by targeting, "the pain points of MoE LLMs", such as large parameter counts, rapid KV Cache expansion, and I/O latency that hampers inference efficiency

"It leverages expert offloading, intelligent cache management, and predictive prefetch algorithms to efficiently resolve storage scheduling challenges and comprehensively improve local AI inference smoothness," the company added.

It is important to note that while the move itself is an impressive feat, Longsys did not provide specifics on compute power in terms of tokens per second, where the Ryzen AI chip is relatively limited compared to most modern AI GPU offerings.

Regardless, the approach that essentially treats storage as memory suggests that localized AI might be able to run considerably larger models, and that memory might not be as hard a constraint for certain approaches.

It signifies that memory constraints can be circumvented by leveraging fast storage and running a frontier-level model that would otherwise require tens of thousands of dollars in AI hardware, which is no small feat. It means that models that were previously constrained to datacenters only can now be run on a device that fits in the palm of your hand.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds.

Rahim Amir is a UAE-based tech writer who enjoys building PCs as much as he enjoys writing about them. He has been professionally writing about PC hardware since 2023, focusing on buyer’s guides, hardware reviews, and sponsored content and features related to tech.

Having built hundreds of gaming PCs and being an avid gamer in his spare time, Rahim tends to have stronger opinions about hardware than most. This is particularly on display when he gets his way with powerful, but minimalistic RGB builds even as Small Form Factor (SFF) PCs come a close second.

Read Entire Article