SC25: NVIDIA’s Next Chapter of AI Supercomputing
At SC25 in St. Louis, NVIDIA showed what the next few years of AI supercomputing could look like. The company covered everything from compact desktop sized supercomputers to quantum ready networks and power aware data centers.
NVIDIA founder and CEO Jensen Huang made a surprise appearance to hype up Grace Blackwell, the company’s next generation architecture, and joked that they are manufacturing supercomputers like chiclets. He also brought physical proof of that claim in the form of DGX Spark systems that were given away to attendees.
DGX Spark is NVIDIA’s new desktop AI supercomputer. It delivers a petaflop of AI performance and 128GB of unified memory in a small workstation form factor. That is enough power to run inference on models up to 200 billion parameters and fine tune them locally without sending data to the cloud.
Because DGX Spark is built on Grace Blackwell, it combines NVIDIA CPUs, GPUs, high speed networking, CUDA libraries and the full NVIDIA AI software stack in one box. Its unified memory plus NVLink C2C interconnect delivers around five times the bandwidth of PCIe Gen5, which means much quicker data exchange between the CPU and GPU and smoother training and fine tuning for large models.
New Brains for Physics, Simulation and AI Factories
A big theme at SC25 was using AI not just for chatbots and image generation, but to power physics, engineering and simulation workloads.
NVIDIA Apollo is a new family of open AI physics models. These models mix modern machine learning techniques like neural operators, transformers and diffusion with deep domain knowledge from areas such as semiconductor design, fluid dynamics, structural mechanics and weather modeling.
Companies like Applied Materials, Cadence, LAM Research, Siemens and Synopsys are already adopting Apollo to speed up design and simulation. NVIDIA will release pretrained checkpoints and reference workflows so developers can plug Apollo into their own tools for training, inference and benchmarking.
On the software side, NVIDIA Warp is another piece of the physics puzzle. Warp is an open source Python framework that lets developers write high performance physics simulations that run on GPUs, often up to 245 times faster than CPU based versions.
Warp gives Python users a structured way to build 3D simulations for robotics, machine learning and digital twins with performance close to hand tuned CUDA code. It integrates with popular ML stacks like PyTorch and JAX and connects to NVIDIA platforms such as Omniverse and PhysicsNeMo. Companies including Siemens, Neural Concept and Luminary Cloud are already using Warp to scale up their simulation pipelines.
Under the hood of these AI factories, NVIDIA is pushing BlueField 4 DPUs as the processor for the operating system of AI infrastructure. BlueField 4 combines a 64 core Grace CPU with ConnectX 9 networking to offload networking, storage and security tasks from CPUs and GPUs. This frees more compute for actual AI workloads while enabling zero trust security and multi tenant setups.
Storage vendors DDN, VAST Data and WEKA are building on BlueField 4 to move data smarter and faster:
- DDN is using BlueField 4 to drive next generation AI factories and keep GPUs fed for AI and HPC.
- VAST Data is focusing on intelligent data movement and real time efficiency in large AI clusters.
- WEKA is running its NeuralMesh architecture directly on BlueField 4 so storage services execute on the DPU itself.
Together these moves effectively turn storage from a bottleneck into a performance multiplier for large AI and scientific jobs.
Faster Networks, Quantum Links and Smarter Power
As AI clusters grow, networking and power are quickly becoming the hardest problems. NVIDIA answered both with new tech across optics, quantum connectivity and data center control.
On the networking side, TACC, Lambda and CoreWeave announced plans to integrate NVIDIA Quantum X Photonics co packaged optics switches into their next generation systems. These InfiniBand switches fuse electronic and photonic components on the same package, removing the need for traditional pluggable transceivers.
This design delivers about three and a half times better power efficiency and up to ten times higher resiliency. Jobs can run up to five times longer without interruption because the usual point of failure, the pluggable optic, is gone. Quantum X800 InfiniBand switches, which power trillion parameter scale generative models at 800 gigabits per second end to end, also benefit from features like SHARPv4 in network computation and FP8 support for more efficient training.
Looking beyond classical networks, more than a dozen major supercomputing centers around the world are adopting NVQLink, a universal interconnect that ties NVIDIA GPUs directly to quantum processors. NVQLink is built on the CUDA Q software platform and delivers up to 40 petaflops of AI performance at FP4 precision in hybrid classical quantum workflows.
Quantinuum’s new Helios quantum processing unit was integrated with NVIDIA GPUs through NVQLink and achieved the first real time decoding of scalable qLDPC quantum error correction codes. Thanks to microsecond level latency, the system reached around 99 percent fidelity with correction compared with around 95 percent without, with a reaction time of 60 microseconds.
Research centers across Asia Pacific, Europe, the Middle East and the United States are using NVQLink to prototype real hybrid applications and error correction schemes, laying the groundwork for practical quantum classical systems.
NVIDIA also announced a major collaboration with RIKEN in Japan to build two new GPU accelerated supercomputers for AI for science and quantum computing. These systems will use over two thousand Blackwell GPUs connected with GB200 NVL4 and Quantum X800 networking and are planned to be operational in 2026. They are part of Japan’s broader sovereign AI push and the roadmap toward FugakuNEXT by 2030.
On the CPU side of the stack, Arm is adopting NVIDIA NVLink Fusion, the high bandwidth coherent interconnect initially created for Grace Blackwell. NVLink Fusion lets Arm Neoverse CPUs connect directly into the NVLink ecosystem so partners can build rack scale systems where CPUs, GPUs and accelerators share memory and bandwidth more efficiently.
All of this needs power, and NVIDIA is treating power as a software problem. The NVIDIA Domain Power Service, or DPS, runs as a Kubernetes service and models how energy flows through a data center. It works with NVIDIA Omniverse DSX Blueprint and other tools to constrain and steer power dynamically so operators can squeeze more performance per megawatt without adding new hardware.
DPS can even talk to the power grid through APIs for automated demand response and load shedding. The idea is to make AI factories grid aware so every watt is used where it matters most.
To cap off the performance story, NVIDIA and CoreWeave took first place on the 30th Graph500 breadth first search benchmark using 8,192 H100 GPUs. Their system hit 410 trillion traversed edges per second on a graph with 2.2 trillion vertices and 35 trillion edges, more than doubling the previous record. The run pulled together Hopper GPUs, Quantum 2 InfiniBand, CUDA, NVSHMEM and GPUDirect technologies into one massive graph crunching machine.
Original article and image: https://blogs.nvidia.com/blog/accelerated-computing-networking-supercomputing-ai/
