Huawei's brute force AI tactic seems to be working

Huawei's brute force AI tactic seems to be working — CloudMatrix 384 claimed to outperform Nvidia processors running DeepSeek R1

Huawei's CloudMatrix AI cluster takes a comparatively simple approach in its attempt to beat Nvidia, and the company's researchers and an outside firm claim it has worked, at least in one instance. A recent technical paper portends that a cluster of Ascend 910C chips has surpassed the performance of Nvidia's H800 chip in running DeepSeek's R1 LLM.

Huawei published a technical paper in collaboration with Chinese AI startup SiliconFlow, which finds that Huawei's CloudMatrix 384 cluster can outperform Nvidia in running DeepSeek models. The cluster's hardware and software stack was found to outpace systems using Nvidia's H800 chip, a variant of the H100 pared down for export to China, as well as the H100 itself when running DeepSeek's 671-billion-parameter R1 model.

The CloudMatrix 384 is a brute-force solution for the company, which is barred from access to the leading edge of chip production. The CloudMatrix is a rack-scale system that combines 384 dual-chiplet HiSilicon Ascend 910C NPUs with 192 CPUs across 16 server racks, using optical connections for all intra- and inter-server communications to enable blisteringly quick interconnects.

The research paper contends Huawei's goal with the CM384 was to "reshape the foundation of AI infrastructure," with another Huawei scientist sharing that the paper itself was published "to build confidence within the domestic technology ecosystem in using Chinese-developed NPUs to outperform Nvidia’s GPUs."

On paper, the CloudMatrix 384 cluster can put out more raw power than Nvidia's GB200 NVL72 system, delivering 300 PFLOPs of BF16 compute versus the NVL72's 180 BF15 PFLOPS. The Huawei cluster also has software to compete with Nvidia's for LLMs; the CloudMatrix-Infer LLM solution was claimed to be able to pre-fill prompts with 4.45 tokens generated per second per TFLOPs, and produce responses at a rate of 1.29 tokens per second per TFLOPS, efficiency that the paper claims outpaces Nvidia's SGLang framework.

Of course, the CloudMatrix 384 is not better than Nvidia's solutions across the board, and its major downside is its power consumption and efficiency. The CloudMatrix utilizes four times the power of Nvidia's GB200 NVL72, consuming a total of 559 kW compared to the NVL72's 145 kW. Cramming more chips into one unit surpasses Nvidia in compute power, but at the cost of efficiency, which is roughly 2.3x less.

However, Chinese customers interested in the CloudMatrix are banned from accessing Nvidia-powered AI clusters, so these comparisons matter slightly less to them. Not to mention, energy is abundant in mainland China, with electricity prices in the region falling nearly 40% over the last three years.

As Nvidia boss Jensen Huang shared at France's VivaTech earlier this month, Nvidia is solidly ahead of Huawei's performance chip-for-chip. "Our technology is a generation ahead of theirs," as Huang claims, and Huawei is quick to agree internally. But, as Huang was quick to add, "AI is a parallel problem, so if each one of the computers is not capable … just add more computers."

The CloudMatrix, 16-racks-big and energy hungry though it is, still presents a compelling choice for Chinese customers looking for peak LLM performance, thanks to its wicked-fast interconnects and its solid software stack. For those looking for a deeper dive into the CloudMatrix 384, our article from its release gets much further into the weeds of what helps the "AI supernode" outpace Nvidia's offerings.

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here