National University of Singapore and Nanyang Technological University, among others, have open-sourced Mega-ASR to reduce ASR illusions and missing characters under extreme noise conditions.
This article is machine translated
Show original
According to ME News, on May 22 (UTC+8), based on Beating's monitoring, a team from the National University of Singapore, Nanyang Technological University, and the Shanghai Artificial Intelligence Laboratory jointly open-sourced Mega-ASR, the first robust speech recognition foundation model for all scenarios. This model aims to address issues such as hallucinations, missing words, and blank output in real-world speech recognition environments. Driven by Qwen3-ASR 1.7B, the model achieves up to nearly 30% performance improvement compared to models like Whisper, Gemini 3 Pro, and Seed-ASR in extremely complex acoustic environments. The project is currently open-source on GitHub, with all code and model weights released under the Apache 2.0 license. The research team constructed the Voices-in-the-wild-2M training dataset, containing 2.4 million samples and totaling 11,000 hours. The dataset synthesizes seven atomic acoustic effects—reverberation, echo, additive noise, far-field, frequency packet loss, bandwidth limitation, and shearing distortion—through a simulated pipeline based on spectral physics characteristics, generating 54 composite environmental scenarios. To ensure training stability, the team calibrated the dataset's difficulty distribution through physical plausibility checks after filtering out samples with word error rates exceeding 70%. In terms of training mechanisms, Mega-ASR introduces A2S-SFT, a progressively supervised fine-tuning approach from acoustics to semantics, aligning audio features in stages to enhance the model's semantic recovery capabilities under heavy interference. During policy optimization, the model employs a dual-granularity word error rate gating strategy to optimize DG-WGPO for reinforcement learning. When the input audio quality is good and the word error rate is low, the system focuses on character-level acoustic detail reconstruction. If the audio is severely distorted and the word error rate is high, the decision mechanism shifts to sentence-level semantic reconstruction, significantly reducing hallucinations and missing words common in large models. To address the slight decrease in recognition rate that may occur with clean audio, Mega-ASR incorporates a dynamic routing mechanism. The routing decision-maker automatically assesses the quality of the current audio and intelligently decides whether to apply LoRA to fine-tune the weights, thus ensuring the model outputs optimal results in both clean and noisy scenarios. (Source: ME)
Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments
Share
Relevant content


