Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Tom’s Hardware
Tom’s Hardware
Technology
Jon Martindale

Nvidia claims software and hardware upgrades allow Blackwell Ultra GB300 to dominate MLPerf benchmarks — touts 45% DeepSeek R-1 inference throughput increase over GB200

Nvidia Blackwell Ultra server stack.

Nvidia has broken its own records in MLPerf benchmarks using its latest-generation Blackwell Ultra GB300 NVL72 rack-scale system, delivering what it claims is a 45% increase in inference performance over the Blackwell-based GB200 platform in DeepSeek R1 tests. Combining hardware improvements and software optimizations, Nvidia claims the top spot when running a range of models, and suggests this should be a primary consideration for any developers building out "AI factories," as it could result in major enhancements for revenue generation.

Nvidia's Blackwell architecture is at the heart of its latest-generation RTX 50-series graphics cards, which offer the best performance for gaming, even if AMD's RX 9000-series arguably offers better bang for buck. But it's also what's under the hood of the big AI-powering GPU stacks like its GB200 platform, which is being built into a range of data centers all over the world to power next-generation AI applications. Blackwell Ultra, GB300, is the enhanced version of that with even more performance, and Nvidia has now tested it with some impressive MLPerf records.

The latest version of the MLPerf benchmark includes inference performance testing using the DeepSeek R1, Llama 3.1 405B, Llama 3.1 8B, and Whisper models, and GB300 NVL72 stole the show in all of them. Nvidia claims a 45% increase in performance over GB200 when running the DeepSeek model, and up to five times the performance of older Hopper GPUs - although Nvidia does note those comparative results came from unverified third parties.

Part of these performance enhancements comes from the more capable tensor cores used with Blackwell Ultra, with Nvidia claiming "2X the attention-layer acceleration and 1.5X more AI compute FLOPS." However, it was also made possible by a range of important software improvements and optimizations.

Nvidia utilized its NVFP4 format extensively as part of these benchmarks, which quantized the DeepSeek R1 weights in a way that reduces the overall model size and allows Blackwell Ultra to accelerate the calculations for higher throughput while maintaining accuracy.

For other benchmarks, like the larger Llama 3.1 405B model, Nvidia was able to "shard" the model across multiple GPUs at once, enabling higher throughput while maintaining latency standards. This was only possible because of its 1.8 TBps NVLink fabric between each of its 72 GPUs, for a total bandwidth of 130 TBps.

All of this is part of Nvidia's pitch for Blackwell Ultra as being economically disruptive for "AI factory" development. Greater inference through improved hardware and software optimizations makes GB300 a more potentially profitable platform in Nvidia's idea of the tokenized future of data center workloads. With shipments of GB300 set to start this month, the timing of these new benchmark results seems like no coincidence.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.