Section 3

Captions and Transcripts

Transcribe audio, video, and URLs, then ship captions that viewers can actually read.

Accurate transcripts and platform-ready captions for short-form clips, long-form uploads, and review workflows.

Who it is for

Creators, editors, accessibility reviewers, and developers handling MP3s, videos, URLs, SRT, VTT, ASS, and burned-in captions.

Time to first value

First transcript in 5-15 minutes

Lessons in this track

18 resources

Concept primer

Caption work starts by choosing the right transcription path for the source. Local audio, local video, and online URLs each have different failure points.

Formats matter. SRT is simple and widely supported, VTT fits web video, ASS supports styling, and burned-in captions are often required for social feeds where sidecar files are ignored.

Accuracy claims can be misleading. Clean studio audio with common vocabulary is easy; noisy panels, jargon, names, tickers, and accents need custom vocabulary and human proofreading.

For short-form clips, captions are also a design system. Line length, safe-zone placement, contrast, timing, and style consistency determine whether viewers can follow without sound.

Operating workflow

Step 1

Choose source path: local audio, local video, or URL.

Step 2

Transcribe with local AI, cloud AI, no-code tooling, or human service based on risk.

Step 3

Export SRT, VTT, ASS, transcript text, or burned-in MP4 according to platform need.

Step 4

Proof names, jargon, numbers, punctuation, timing, and speaker labels.

Step 5

Check accessibility, safe zones, and readability before publishing.

Tool and option comparison

Tool / Option	Best for	Output options	Strength	Watch-out
Whisper local/API	Developers and private/local workflows	TXT, SRT, VTT, JSON by implementation	Strong baseline transcription and broad ecosystem	Needs setup and proofreading for jargon
AssemblyAI	API transcription and diarization	JSON, SRT, VTT	Speaker labels, word timings, custom vocabulary	Async polling and API cost management
Deepgram	Low-latency API workflows	JSON, captions by tooling	Fast speech-to-text and developer controls	Requires integration work
Descript	Non-coders editing by transcript	Transcript, captions, video export	Edit media by editing text	Less ideal for code pipelines
Otter	Meetings and simple drag/drop transcripts	Transcript exports	Easy no-code workflow	Not built for stylized social captions
Rev / GoTranscript / 3PlayMedia	High-stakes or compliance-heavy material	Human transcript and captions	Human review and higher reliability	Slower and more expensive
Submagic / CapCut / Captions.ai / Veed	Burned-in short-form captions	Styled MP4 exports	Fast social caption styling	Caption correctness still needs manual review
yt-dlp + transcription	YouTube and web media ingestion	Audio files, existing captions	Flexible URL workflows	Must respect rights, access, and platform terms

Reference snippets

OpenAI speech-to-text request

curl --request POST \
  --url https://api.openai.com/v1/audio/transcriptions \
  --header "Authorization: Bearer $OPENAI_API_KEY" \
  --header "Content-Type: multipart/form-data" \
  --form file=@audio.mp3 \
  --form model=gpt-4o-transcribe

Python speech-to-text request

from openai import OpenAI

client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
    )

print(transcript.text)

Whisper word timestamps

curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=word"

Extract audio before transcription

ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 audio.wav

Download audio from a URL for local review

yt-dlp -x --audio-format mp3 "https://example.com/video"

Lessons

Shareable learning path

18 lessons

Track

Format

18 lessons shown

18 total

Module A / Python + cURL

Speech-to-text API in 5 minutes: GPT-4o Transcribe and Whisper

Send an audio file to a speech-to-text API and return transcript text plus captions.

Captions and Transcripts

Tool and option comparison

Reference snippets

Shareable learning path

Speech-to-text API in 5 minutes: GPT-4o Transcribe and Whisper

MacWhisper and whisper.cpp for fully local transcription

Otter and Descript drag-and-drop for non-coders

Extracting audio with FFmpeg, then transcribing

Descript end-to-end: import, edit transcript, export captions

Speaker diarization: getting Speaker 1 and Speaker 2 right

yt-dlp + Whisper for YouTube

Twitch VODs, Twitter/X, Spotify, TikTok: what works and what does not

No-code URL-to-transcript tools

SRT, VTT, ASS, TTML decoded

Which platforms accept which caption formats

Submagic vs. CapCut vs. Captions.ai vs. Veed

Karaoke highlight, single-word pop, emoji injection: when each works

Building a brand-locked caption template

Custom vocabulary: stop the AI from mishearing your jargon

The 7 most common transcription errors and how to catch them

WCAG 2.x and platform accessibility rules

Multi-language workflows and translated captions

Manual Editing

Clip Quality Checklist