Voice activity detection, RNN-based denoising, and loudness normalization to improve transcription accuracy for user-generated audio
Designed and implemented an audio preprocessing pipeline that denoises, normalizes, and segments user-recorded audio before transcription. The pipeline reduces hallucinations, skips non-speech content, and delivers more accurate transcriptions for fan engagement and issue-reporting flows.
User-generated audio from posts and voice reports often contained background noise, long silences, and music. Transcribers produced irrelevant or hallucinated text, and processing full-length audio wasted cost and hurt accuracy.
We built a multi-stage pipeline: (1) RNN-based denoising (FFmpeg arnndn) and broadcast loudness normalization to clean and level the signal; (2) a VAD cloud function to detect speech segments and produce speech-only audio; (3) silence detection to skip truly silent clips; (4) optional LLM checks for hallucination and relevance. Transcription runs on the cleaned, speech-only stream for higher accuracy and lower noise.