FanKave

Voice activity detection, RNN-based denoising, and loudness normalization to improve transcription accuracy for user-generated audio

Client FanKave

Year 2024

Services

Audio Processing Voice Activity Detection (VAD) API Integration Transcription Pipeline

View Project Visit Website

OVERVIEW

The Project

Designed and implemented an audio preprocessing pipeline that denoises, normalizes, and segments user-recorded audio before transcription. The pipeline reduces hallucinations, skips non-speech content, and delivers more accurate transcriptions for fan engagement and issue-reporting flows.

THE CHALLENGE

What We Solved

Challenge

User-generated audio from posts and voice reports often contained background noise, long silences, and music. Transcribers produced irrelevant or hallucinated text, and processing full-length audio wasted cost and hurt accuracy.

Solution

We built a multi-stage pipeline: (1) RNN-based denoising (FFmpeg arnndn) and broadcast loudness normalization to clean and level the signal; (2) a VAD cloud function to detect speech segments and produce speech-only audio; (3) silence detection to skip truly silent clips; (4) optional LLM checks for hallucination and relevance. Transcription runs on the cleaned, speech-only stream for higher accuracy and lower noise.

THE RESULTS

Numbers that matter

Higher Transcription Relevance Fewer nonsensical or off-topic transcriptions from noise

Improved Processing Efficiency Only speech segments sent to ASR; silent/noise clips skipped

Reduced False Positives VAD and silence checks avoid transcribing empty or non-speech audio