NVIDIA Launches a New Multilingual Speech AI Dataset and Models

Most of the world’s 7000 languages are not available in AI tools today. NVIDIA is working to change that. They have released a new open dataset called Granary and two speech AI models that support 25 European languages, including less common ones like Croatian, Estonian, and Maltese.

These new tools make it easier for developers to build helpful speech recognition and translation systems, such as chatbots and voice assistants, that work for people around the world. The Granary dataset is a huge collection of audio samples in different languages. It includes about 650000 hours for speech recognition and 350000 hours for speech translation. Anyone can use this dataset for free to train their own language models.

Along with Granary, NVIDIA released two AI models. The first is Canary 1b v2, which can turn European speech into text and translate between English and 24 other languages with high accuracy. The second model is called Parakeet tdt 0 point 6b v3. It is designed for fast and large scale transcriptions. Both models are available on the Hugging Face platform.

To create Granary, NVIDIA partnered with researchers from Carnegie Mellon University and Fondazione Bruno Kessler. Instead of needing humans to label all the audio data, they used an advanced processing system called NVIDIA NeMo Speech Data Processor. This tool organized the audio for AI training with less human effort. The process and tools are available for anyone to use on GitHub.

Granary is a helpful starting point for developers who want to work with Europe’s official languages, plus Russian and Ukrainian. It is especially valuable for languages that do not have a lot of high quality training data. With Granary, it is possible to achieve good results using less training data compared to other datasets.

The new Canary and Parakeet models show what is possible with Granary. Canary 1b v2 is tuned for accuracy and can handle complex transcription and translation tasks. Parakeet tdt 0 point 6b v3 works quickly for longer audio files and can automatically identify the spoken language. Both models give clear transcripts with accurate punctuation, capitalization, and timing for each word.

NVIDIA’s open source approach means anyone can take these tools and methods to build or improve their own multilingual speech systems. This will help make AI voice technology more accessible and useful for speakers of many languages.

You can try out Granary and the new models on Hugging Face or learn more through NVIDIA’s GitHub pages.

Original article and image: https://blogs.nvidia.com/blog/speech-ai-dataset-models/