How Snapchat Supercharges Data with NVIDIA GPUs
Snapchat is not just pushing out fun filters and AI stickers. Behind the scenes its parent company Snap is running a huge amount of data processing to decide which new features make the cut. To keep up with billions of data points and tight time windows the company has shifted a big part of its data infrastructure from CPUs to NVIDIA GPUs running on Google Cloud.
This move is not about graphics in the Snapchat app itself. It is about using GPU acceleration to crunch massive datasets faster and cheaper using Apache Spark and NVIDIA data processing libraries. For anyone interested in GPUs cloud infrastructure and performance this is a strong example of how modern services are evolving beyond traditional CPU based setups.
Every feature that ships to Snapchat’s more than 940 million monthly active users goes through rigorous A B testing. That means controlled experiments with subsets of users where the team tracks engagement performance and monetization across thousands of metrics. Doing this at scale requires serious compute power and smart use of hardware.
From CPU Bound to GPU Accelerated Data Pipelines
Snap relies on Apache Spark a widely used distributed data processing framework to run its experiments. The workload is intense. Each morning during a three hour window Snap processes more than 10 petabytes of data across thousands of experiments. Previously this all ran on CPU based infrastructure which limited how far they could scale without costs exploding.
To solve this Snap adopted Apache Spark accelerated by NVIDIA cuDF. cuDF is an open GPU DataFrame library designed for high performance data processing that plugs into Spark with minimal friction. The key win for the engineering team is that they can run their existing Spark applications on NVIDIA GPUs with no code changes. That means they keep the same pipelines and logic while tapping into GPU acceleration under the hood.
By running Spark on NVIDIA GPUs Snap reports around 4 times faster runtimes for these data processing jobs while using the same number of machines. Faster runtimes translate into more experiments completed in the same time window or the option to scale back hardware while keeping performance steady.
Snap’s setup combines several layers of NVIDIA and Google Cloud technology including
- NVIDIA CUDA X libraries which provide GPU optimized building blocks for data and AI workloads
- The NVIDIA cuDF library and Spark accelerator to speed up DataFrame operations on GPUs
- Google Kubernetes Engine to manage containers and orchestrate the GPU powered Spark clusters
- Google Cloud G2 virtual machines powered by NVIDIA L4 GPUs to run the workloads
This full stack approach lets Snap scale its data processing horizontally in the cloud while taking advantage of modern GPU hardware.
Cost Savings and Scaling for Future Features
Performance is not the only reason Snap moved to GPUs. Cost and scalability were major drivers. The company had an ambitious roadmap to increase experimentation which would have dramatically raised compute costs on a CPU only setup.
After migrating to cuDF accelerated Spark on NVIDIA GPUs Snap measured significant benefits over a two month period. The team reported around 76 percent daily cost savings on Google Kubernetes Engine compared with the previous CPU only workflows. That is a huge reduction especially when you are running thousands of experiments and processing petabytes of data every day.
Initially Snap estimated it might need about 5,500 GPUs running concurrently to support all of its workloads at scale. By optimizing pipelines with NVIDIA experts and using the cuDF suite of microservices for qualification testing configuration and tuning they managed to bring that number down to around 2,100 GPUs running concurrently. In other words the same or better throughput on less than half the originally projected GPU count.
The cuDF microservices also help automate much of the migration process. They can analyze Spark jobs decide which ones are a good fit for GPU acceleration and then configure them appropriately. This automation is important when you are dealing with many pipelines and trying to avoid manual tuning for every single job.
For Snap experimentation is at the core of the company. Moving from CPU based infrastructure to GPU accelerated pipelines allows the team to run more experiments track more metrics and serve more users without blowing up the budget. That means they can test everything from visible features like AI generated stickers and new map tools to behind the scenes changes like performance optimizations and compatibility updates for new operating systems.
So far Snap has migrated its two biggest data pipelines to the new GPU based setup and the results have been strong enough that the company plans to expand this model to a broader range of production workloads. For developers and tech enthusiasts this highlights a bigger industry trend. GPUs are no longer just for rendering frames in games or training neural networks. They are becoming a default choice for large scale data processing where parallel computation can dramatically cut both time and cost.
If you are working with big data or managing large A B testing systems Snap’s experience is a clear signal. Combining Apache Spark with GPU acceleration through tools like NVIDIA cuDF and running it on managed cloud infrastructure can open up a lot more room for experimentation and innovation without requiring a full rewrite of your existing code.
Original article and image: https://blogs.nvidia.com/blog/snap-accelerated-data-processing/