Module B / Code recipe / 10-20 min
Extracting audio with FFmpeg, then transcribing
Turn MP4, MOV, or MKV into clean audio before transcription.
TL;DR
Use this lesson to turn MP4, MOV, or MKV into clean audio before transcription. Treat it as practical guidance, not a rigid rulebook.
Why it matters
Captions make clips understandable without sound, searchable after publishing, and reviewable by editors before export. The goal is to help you make a stronger clip without taking away your creative freedom.
What you will learn
Prerequisites
- An audio file, video file, URL, or exported clip
- A target output format such as SRT, VTT, burned-in MP4, or transcript text
What you need
Core concept
Caption work is part accuracy and part design. The workflow only works if viewers can read the result quickly on a phone.
Example
Scenario
Auto-captions are mostly correct, but the clip contains names, numbers, jargon, or fast speech.
Move
Apply the workflow to a short section first and proofread the result at phone size.
Result
The caption pass becomes readable and accurate enough that sound-off viewers can follow the clip.
How to do it
- 1Use FFmpeg to extract a clean mono audio file before sending it to transcription.
- 2Check the audio file by playing the first few seconds and a later section before uploading.
- 3Transcribe the extracted audio and keep word timings if you will create captions.
- 4Use the transcript to find cut points or caption lines, then return to the original video for final export.
- 5Save the exact command that worked so future clips use the same reliable path.
Expected output
A caption or transcript artifact that is proofread, timed, readable on a phone, and matched to the target platform.
Practice task
Produce a clean caption pass
- 1Take a 20-30 second section of a real clip.
- 2Apply the caption or transcript workflow from this lesson.
- 3Proofread it with sound on, then watch it again with sound off at phone size.
Check your work
Common mistakes and fixes
Troubleshooting
Related resources
Reference snippets
OpenAI speech-to-text request
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header "Authorization: Bearer $OPENAI_API_KEY" \
--header "Content-Type: multipart/form-data" \
--form file=@audio.mp3 \
--form model=gpt-4o-transcribePython speech-to-text request
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
)
print(transcript.text)Whisper word timestamps
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@audio.mp3" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"Extract audio before transcription
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 audio.wavDownload audio from a URL for local review
yt-dlp -x --audio-format mp3 "https://example.com/video"