How A New NVIDIA DGX B200 Is Powering AI Research
The Hao AI Lab at the University of California San Diego has just added a serious piece of hardware to its toolkit: an NVIDIA DGX B200 system. This is one of NVIDIA’s most powerful AI platforms, designed to deliver massive performance for training and running large language models.
For researchers at UC San Diego’s School of Computing, Information and Data Sciences and the San Diego Supercomputer Center, this system opens the door for faster experimentation, larger models, and more advanced AI applications. Assistant professor Hao Zhang explains that the DGX B200 lets them prototype and test ideas far more quickly than with previous generation hardware.
Alongside the DGX B200, the team also taps into NVIDIA H200 GPUs. Combined, this hardware gives them the compute power they need to work on cutting edge AI problems like video generation, game based benchmarks, and ultra low latency language model serving.
FastVideo And Lmgame Bench: Pushing AI Into Video And Games
Two of the most interesting projects that benefit directly from the DGX B200 are FastVideo and Lmgame Bench.
FastVideo aims to train a family of video generation models that can create a five second video clip from a short text prompt in around five seconds of compute time. That might sound simple, but video generation is extremely demanding. Models need huge amounts of GPU memory and compute to handle high resolution frames, smooth motion, and consistent visual style.
With the DGX B200 and H200 GPUs, the Hao AI Lab can iterate faster on model architectures, training methods, and optimizations. That means they can push toward real time or near real time video generation from text, which has big implications for content creation, virtual worlds, and interactive applications.
Lmgame Bench takes a very different angle. It is a benchmarking suite that evaluates large language models by letting them play popular online style games such as Tetris and Super Mario Bros. Instead of simply measuring accuracy on question answering, Lmgame Bench looks at how well models can reason, plan, and act in dynamic environments.
With Lmgame Bench, users can:
- Test a single model to see how it performs on different game tasks
- Pit two models against each other and compare their behavior
- Measure practical performance, not just static benchmarks
This kind of benchmark is especially interesting for people who care about agents, game bots, and interactive AI systems. It shows how models respond over time, not just to one off prompts.
Beyond these two flagship projects, the Hao AI Lab is also working on new ways to reduce latency for large language model serving. Their goal is to push models closer to real time responsiveness, even as they get larger and more capable.
From Throughput To Goodput: Smarter LLM Performance Tuning
A key concept that originated in the Hao AI Lab is DistServe, a system that has influenced how modern AI infrastructure handles large language model inference. DistServe is closely tied to another important idea: disaggregated inference, which is already used in production platforms such as NVIDIA Dynamo.
Traditionally, LLM performance has been measured by throughput: how many tokens per second the system can generate. Higher throughput usually means lower cost per token, which is great for scaling. The problem is that throughput alone does not tell you what users actually experience. A system could have high throughput but still feel slow for individual requests.
That is where the DistServe team introduced a better metric called goodput. Goodput measures how much useful work the system does while still meeting user specified latency targets, often called service level objectives.
In other words:
- Throughput cares about raw speed and cost
- Goodput cares about speed while respecting user experience
By focusing on goodput, engineers can tune systems so that they are both efficient and responsive, instead of optimizing one at the expense of the other.
Disaggregated Inference: Splitting Prefill And Decode Across GPUs
To achieve high goodput, DistServe looks closely at how an LLM handles a user request. The process has two main phases:
- Prefill: The model reads the user’s input and generates the first token. This stage is very compute intensive.
- Decode: The model generates the rest of the tokens one by one, using previous outputs as context. This stage is more memory intensive.
Historically, prefill and decode have run on the same GPU. That seems simple but it causes both jobs to compete for the same resources. The result can be slower responses for users, especially when many requests are hitting the system at once.
The DistServe researchers discovered that splitting these two phases across different sets of GPUs can significantly increase goodput. This approach is called prefill and decode disaggregation.
In practice this means:
- One group of GPUs is dedicated to the compute heavy prefill phase
- Another group of GPUs is dedicated to the memory heavy decode phase
- The interference between the two phases is minimized
- Latency stays low while the system scales to handle more users
This disaggregated inference strategy allows AI platforms to grow in capacity without sacrificing responsiveness or answer quality. NVIDIA Dynamo, an open source framework that helps scale generative AI models at high efficiency and low cost, builds on these concepts to make disaggregated inference easier to deploy in real systems.
With the DGX B200 powering their experiments, the Hao AI Lab continues to refine these ideas and apply them across different domains, from language to video to scientific research. The combination of cutting edge hardware and smarter serving strategies is shaping the next generation of AI performance and is directly influencing how large language models will serve users in real time.
Original article and image: https://blogs.nvidia.com/blog/ucsd-generative-ai-research-dgx-b200/
