Work / FanKave
AI / Machine Learning

FanKave

Voice activity detection, RNN-based denoising, and loudness normalization to improve transcription accuracy for user-generated audio

Client FanKave
Year 2024
Services
Audio Processing Voice Activity Detection (VAD) API Integration Transcription Pipeline
View Project Visit Website
FanKave
OVERVIEW

The Project

Designed and implemented an audio preprocessing pipeline that denoises, normalizes, and segments user-recorded audio before transcription. The pipeline reduces hallucinations, skips non-speech content, and delivers more accurate transcriptions for fan engagement and issue-reporting flows.

THE CHALLENGE

What We Solved

Challenge

User-generated audio from posts and voice reports often contained background noise, long silences, and music. Transcribers produced irrelevant or hallucinated text, and processing full-length audio wasted cost and hurt accuracy.

Solution

We built a multi-stage pipeline: (1) RNN-based denoising (FFmpeg arnndn) and broadcast loudness normalization to clean and level the signal; (2) a VAD cloud function to detect speech segments and produce speech-only audio; (3) silence detection to skip truly silent clips; (4) optional LLM checks for hallucination and relevance. Transcription runs on the cleaned, speech-only stream for higher accuracy and lower noise.

Project showcase
Project showcase
THE RESULTS

Numbers that matter

Higher Transcription Relevance Fewer nonsensical or off-topic transcriptions from noise
Improved Processing Efficiency Only speech segments sent to ASR; silent/noise clips skipped
Reduced False Positives VAD and silence checks avoid transcribing empty or non-speech audio