AI-powered analysis of screen recordings that turns user-submitted videos into structured, time-ordered reproduction steps for faster triage and resolution
Built a screen recording analysis service that uses multimodal AI to watch user-submitted bug-report videos and produce structured, timestamped reproduction steps, error indicators, and a concise issue summary—so support and engineering can triage and fix issues without re-watching long recordings.
Users often submit screen recordings when reporting bugs, but support and developers had to watch entire videos manually to extract reproduction steps. Transcriptions and logs were disconnected from what was on screen, and critical moments were easy to miss or poorly documented.
We designed and implemented an analysis pipeline that accepts a video URL plus optional context (issue title, description, transcription, logs). The service downloads the recording, uploads it to Google's multimodal AI (Gemini), and uses a structured prompt to extract time-ordered key events—each with a timestamp (MM:SS), short summary, description, and significance score (1–5). The model correlates on-screen actions with logs and transcription, outputs a single issue summary, and we validate and filter results (e.g. discarding timestamps beyond video duration) before returning structured JSON. Retries and schema validation ensure reliable, actionable output for downstream triage and issue resolution.