Dataocean AI Unveils GigaSpeech 2: Advancing Low-Resource Language Recognition

Dataocean AI Unveils GigaSpeech 2: Advancing Low-Resource Language Recognition

2024-09-26 data

Unknown, Thursday, 26 September 2024.
Dataocean AI collaborates with top institutions to launch GigaSpeech 2, an open-source dataset designed to enhance automatic speech recognition for underserved languages. The project includes 30,000 hours of transcribed audio covering Thai, Indonesian, and Vietnamese, achieving over 97% word accuracy and outperforming industry leaders in Thai language processing.

Collaborative Effort

The GigaSpeech 2 project is the result of a collaborative effort between Dataocean AI and several prestigious institutions, including Shanghai Jiao Tong University, the Chinese University of Hong Kong, Tsinghua University, Pengcheng Lab, AISpeech, Birch AI, and Seasalt AI. This partnership leverages the expertise and resources of these renowned entities to bring a comprehensive and high-quality dataset to the global AI community.

A Technological Milestone

GigaSpeech 2 is designed to address the significant challenges faced by automatic speech recognition (ASR) systems in processing low-resource languages. With approximately 30,000 hours of automatically transcribed audio, the dataset covers Thai, Indonesian, and Vietnamese languages extensively. The dataset’s construction involves advanced techniques such as data crawling, transcription using Whisper, forced alignment with TorchAudio, and iterative refinement utilizing Noisy Student Training (NST) strategies.

Impressive Accuracy and Performance

The GigaSpeech 2 dataset has achieved impressive word accuracy rates exceeding 97% for both Thai and Indonesian languages. This remarkable accuracy positions GigaSpeech 2 as a competitive alternative to industry-leading models such as OpenAI Whisper and Google USM Chirp, particularly in Thai language processing. The dataset’s efficiency is further highlighted by its use of fewer parameters compared to Whisper large-v3, making it a robust and resource-efficient solution for commercial applications.

Broad Applications and Accessibility

Beyond its technical achievements, GigaSpeech 2 is designed to support a wide range of applications across various domains, from agriculture to technology. The dataset is publicly accessible, encouraging researchers and developers to innovate and build upon its foundation. By providing a high-quality, open-source resource, Dataocean AI and its partners aim to democratize access to advanced ASR technologies, particularly for languages that have historically been underrepresented in AI research.

The Future of ASR

As the AI industry continues to evolve, the importance of inclusive and diverse datasets cannot be overstated. GigaSpeech 2 represents a significant step forward in making advanced ASR technologies more accessible and effective for low-resource languages. By fostering collaboration and innovation, initiatives like GigaSpeech 2 pave the way for a more inclusive digital future, where language barriers are minimized, and technological advancements are shared more equitably.

Bronnen


www.businesswire.com www.afp.com ASR technology data set investorshangout.com