NVIDIA Smashes Graph500 Record With 8,192 H100 GPUs On Commercial Clou

NVIDIA Sets a New Graph500 World Record

NVIDIA has hit a major milestone in high performance computing, showing just how far modern GPU clusters have come and what that could mean for future PC and cloud gamers. The company reached a record breaking 410 trillion traversed edges per second on the Graph500 breadth first search benchmark, taking the number one spot on the 31st Graph500 list.

The most interesting part is that this result did not come from a secret government supercomputer. It ran on a commercially available cluster hosted by CoreWeave in a Dallas data center. The setup used 8,192 NVIDIA H100 GPUs to chew through a gigantic graph with 2.2 trillion vertices and 35 trillion edges.

To give that scale some context, imagine every person on Earth having around 150 friends. That would create a graph with about 1.2 trillion relationships. The NVIDIA and CoreWeave cluster could scan through all of those friend connections in roughly three milliseconds. That is the kind of speed that lets a system connect huge amounts of data almost instantly.

Even more impressive, this win is not just brute force. Compared to other top systems on the Graph500 list, NVIDIA delivered more than double the performance using far fewer nodes. One competing top ten system used about 9,000 nodes, while NVIDIA’s cluster used just over 1,000. That translates to about three times better performance per dollar, which matters a lot when thinking about the future of large scale cloud gaming, AI and simulation workloads.

Why Graphs And Graph500 Matter

Graph500 is a benchmark designed to test how well a system handles graphs at scale. A graph is a way of modeling relationships between things. Social networks, recommendation systems, fraud detection, routing and many security tools are all built on graphs.

In a graph, individual things such as people or accounts are called vertices, and the links between them are called edges. Some vertices may have only a few connections while others have tens of thousands. That uneven structure makes graphs sparse and irregular, very different from smooth image grids or sequences of text that many AI models work on.

The Graph500 benchmark focuses on breadth first search. That is a process that systematically explores every vertex and edge in a graph, layer by layer, as quickly as possible. Performance is measured in traversed edges per second.

A high Graph500 score tells you that a system has:

Very fast interconnects between nodes
High memory bandwidth
Well tuned software that can keep all that hardware busy

It is basically a way to test how quickly a computer can connect related pieces of data at massive scale. That is important not just for scientific simulations but also for anything that needs to connect users, content, transactions or events in real time. Those same capabilities are starting to matter more in modern gaming backends, recommendation engines and large multiplayer infrastructures.

Reinventing Graph Processing On GPUs

Traditionally, huge graph and sparse linear algebra workloads have leaned on big CPU clusters. As graphs grow to trillions of edges, CPUs end up spending a lot of time just shuffling data between nodes. That movement becomes a bottleneck.

Developers have used techniques like active messages to make this more efficient. Instead of hauling data around, you send small messages that perform work where the data already lives. That helps, but on classic systems those active messages still run on CPUs, which limits throughput and scalability.

NVIDIA’s record breaking run takes a very different approach. The team built a full stack GPU focused solution from the ground up, using:

NVIDIA H100 GPUs for massive parallel processing and memory bandwidth
The CUDA platform and NVSHMEM programming model
Spectrum X networking and InfiniBand GPUDirect Async, known as IBGDA, to let GPUs talk directly to the network

With IBGDA, GPUs no longer have to wait on CPUs to manage network traffic. Instead, GPUs can send and receive active messages directly over the InfiniBand network. NVIDIA redesigned message aggregation and communication so that hundreds of thousands of GPU threads can send active messages at the same time, instead of the few hundred threads typical on CPU only systems.

The result is that the entire active messaging layer runs on GPUs. Messages are created, sent, received and processed all within GPU memory, with no CPU in the critical path. This setup fully exploits the parallelism of H100 GPUs and the speed of modern GPU focused networking.

Running on CoreWeave’s infrastructure, this design delivered more than double the performance of comparable Graph500 runs, while using a fraction of the node count and cost. It is a clear example of what a well integrated stack of GPUs, fast networking and tuned software can do when everything is designed to work together.

Why This Matters For The Future Of Computing And Gaming

At first glance, a graph benchmark might sound like a niche high performance computing topic. But the implications reach much further into AI, cloud services and eventually user facing applications such as gaming.

Many high performance computing fields such as fluid dynamics and weather forecasting rely on similar sparse data structures and communication patterns to what Graph500 tests. For decades those workloads have been tied mainly to CPU based systems. NVIDIA’s top result on Graph500 and two other entries it has in the top ten show that GPUs can now handle these massive irregular workloads efficiently.

For the wider tech and gaming ecosystem, this points toward a future where commercially available GPU clusters can power huge simulations, real time analytics and graph heavy AI models that support next generation experiences. As cloud providers like CoreWeave make these GPU packed clusters more accessible, studios and developers will be able to tap into supercomputer class performance without building their own on premises systems.

In short, this record is more than a trophy. It is proof that GPU centric architectures are evolving beyond dense AI training into the complex, irregular workloads that underpin many modern applications. That shift will unlock new possibilities for everything from recommendation systems and security tools to large scale online worlds and cloud powered game logic.

Original article and image: https://blogs.nvidia.com/blog/h100-coreweave-graph500/