Blackwell Ultra: Pushing GPU Performance To New Levels
NVIDIA’s Blackwell platform is already known for powering modern AI inference, but the new Blackwell Ultra based GB300 NVL72 system takes things even further. While this hardware is aimed mainly at data centers and AI clouds rather than home gaming PCs, it still represents the cutting edge of GPU and system design that eventually influences consumer hardware.
The original Blackwell platform has been widely adopted by major inference providers such as Baseten, DeepInfra, Fireworks AI and Together AI. These companies use Blackwell to reduce the cost per AI token by up to ten times compared to previous generations.
The new GB300 NVL72 system, based on the Blackwell Ultra GPU, builds on this by dramatically improving performance per watt and cost per token. This is especially important for two fast growing AI uses:
- Agentic AI that performs multistep reasoning and actions
- AI coding assistants that operate on huge codebases with very long context windows
These workloads are extremely demanding. They need very low latency so interactions feel real time, and they need to handle large amounts of context when analyzing entire repositories of code. That is where GB300 NVL72 shines.
Throughput, Latency And Cost: How GB300 NVL72 Scales
Independent analysis shows how big the jump is compared with NVIDIA’s previous Hopper platform. When NVIDIA combines hardware advances with software optimizations, the improvements stack up.
The GB200 NVL72 system, based on the earlier Blackwell chips, already delivered more than ten times more tokens per watt than Hopper. That translated to roughly one tenth the cost per token. Continuous software work from teams behind NVIDIA TensorRT LLM, NVIDIA Dynamo, Mooncake and SGLang has further boosted Blackwell performance, especially for mixture of experts models that are popular in large language models.
On top of these gains, the GB300 NVL72 pushes the frontier even more:
- Up to 50 times higher throughput per megawatt than Hopper
These numbers come from optimizing both the silicon and the software stack. Some key technical elements include:
- High performance GPU kernels tuned for efficiency and low latency
- NVIDIA NVLink Symmetric Memory for direct GPU to GPU memory access without going through the CPU
- Programmatic dependent launch, which starts preparing the next kernel before the previous one has fully completed to reduce idle time
All of this means that for real time agentic AI and interactive coding assistants, GB300 NVL72 can serve many more users at lower cost without sacrificing responsiveness.
Why Long Context AI Loves GB300 NVL72
One of the hardest problems in modern AI is handling very long context. For coding assistants or AI agents examining entire applications, context windows can easily reach 128 thousand input tokens plus thousands of output tokens.
Both GB200 NVL72 and GB300 NVL72 can deliver ultra low latency for these workloads, but GB300 pulls ahead as context grows. For a representative workload using 128,000 token inputs and 8,000 token outputs, GB300 NVL72 delivers up to 1.5 times lower cost per token than GB200 NVL72.
The Blackwell Ultra GPU inside GB300 is designed specifically to handle this challenge. It offers approximately:
- 1.5 times higher NVFP4 compute performance compared with the earlier Blackwell chip
- Two times faster attention processing, which is the core operation in transformer based models
As an AI agent reads more of a codebase, the context window grows. That improves understanding but also demands much more compute and memory bandwidth. GB300’s improved compute density and attention speed let it process these giant contexts efficiently, enabling more capable coding assistants that can truly reason over entire repositories.
Cloud Deployment And The Road To Rubin
Major cloud providers are already rolling out these systems. Microsoft Azure, CoreWeave and Oracle Cloud Infrastructure are deploying GB300 NVL72 clusters for low latency, long context AI workloads such as agentic coding and advanced coding assistants.
For these providers, the appeal is straightforward:
- Lower token costs mean they can offer more powerful AI services at better prices
- Higher performance per watt helps control data center power and cooling budgets
- Better long context performance unlocks new AI applications that were previously too expensive or too slow
CoreWeave in particular highlights that as inference becomes the center of AI production, long context performance and token efficiency are now critical metrics. Their AI cloud is designed to translate the raw gains from GB200 and GB300 into predictable performance and cost efficiency for customers running workloads at scale.
NVIDIA is not stopping at Blackwell and Blackwell Ultra. The next platform, Rubin, is positioned as another major leap. Rubin combines six new chips into a single AI supercomputer design. NVIDIA claims that for mixture of experts inference Rubin will deliver:
- Up to 10 times higher throughput per megawatt than Blackwell
For training next generation frontier models, Rubin is expected to train large mixture of experts models using only one fourth the number of GPUs required with Blackwell. That will further reduce infrastructure costs for companies at the cutting edge of AI.
For PC hardware enthusiasts and gamers, these platforms are far beyond desktop use. However, the same architectural ideas and software optimizations often filter down into future consumer GPUs. Features like better memory interconnects, smarter kernel scheduling and more efficient low precision compute can eventually improve gaming performance, AI features inside games and local AI tools running on consumer graphics cards.
Original article and image: https://blogs.nvidia.com/blog/data-blackwell-ultra-performance-lower-cost-agentic-ai/
