Module A / Guide / 10-20 min
MacWhisper and whisper.cpp for fully local transcription
Transcribe private or large audio locally and export captions without cloud upload.
TL;DR
Use this lesson to transcribe private or large audio locally and export captions without cloud upload. Treat it as practical guidance, not a rigid rulebook.
Why it matters
Captions make clips understandable without sound, searchable after publishing, and reviewable by editors before export. The goal is to help you make a stronger clip without taking away your creative freedom.
What you will learn
Prerequisites
- An audio file, video file, URL, or exported clip
- A target output format such as SRT, VTT, burned-in MP4, or transcript text
What you need
Core concept
Caption work is part accuracy and part design. The workflow only works if viewers can read the result quickly on a phone.
Example
Scenario
Auto-captions are mostly correct, but the clip contains names, numbers, jargon, or fast speech.
Move
Apply the workflow to a short section first and proofread the result at phone size.
Result
The caption pass becomes readable and accurate enough that sound-off viewers can follow the clip.
How to do it
- 1Choose local transcription when privacy, large files, or offline work matter more than cloud convenience.
- 2Install the local app or binary and pick a model size that your machine can run comfortably.
- 3Transcribe one short sample before processing a full episode.
- 4Export captions or transcript text, then proofread jargon and speaker-specific terms.
- 5Keep the source, transcript, and caption files organized so they can be reused in the edit.
Expected output
A caption or transcript artifact that is proofread, timed, readable on a phone, and matched to the target platform.
Practice task
Produce a clean caption pass
- 1Take a 20-30 second section of a real clip.
- 2Apply the caption or transcript workflow from this lesson.
- 3Proofread it with sound on, then watch it again with sound off at phone size.
Check your work
Common mistakes and fixes
Troubleshooting
Related resources
Reference snippets
OpenAI speech-to-text request
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header "Authorization: Bearer $OPENAI_API_KEY" \
--header "Content-Type: multipart/form-data" \
--form file=@audio.mp3 \
--form model=gpt-4o-transcribePython speech-to-text request
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
)
print(transcript.text)Whisper word timestamps
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@audio.mp3" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"Extract audio before transcription
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 audio.wavDownload audio from a URL for local review
yt-dlp -x --audio-format mp3 "https://example.com/video"