Module A / Python + cURL
Speech-to-text API in 5 minutes: GPT-4o Transcribe and Whisper
Send an audio file to a speech-to-text API and return transcript text plus captions.
Section 3
Transcribe audio, video, and URLs, then ship captions that viewers can actually read.
Accurate transcripts and platform-ready captions for short-form clips, long-form uploads, and review workflows.
Who it is for
Creators, editors, accessibility reviewers, and developers handling MP3s, videos, URLs, SRT, VTT, ASS, and burned-in captions.
Time to first value
First transcript in 5-15 minutes
Lessons in this track
18 resources
Concept primer
Caption work starts by choosing the right transcription path for the source. Local audio, local video, and online URLs each have different failure points.
Formats matter. SRT is simple and widely supported, VTT fits web video, ASS supports styling, and burned-in captions are often required for social feeds where sidecar files are ignored.
Accuracy claims can be misleading. Clean studio audio with common vocabulary is easy; noisy panels, jargon, names, tickers, and accents need custom vocabulary and human proofreading.
For short-form clips, captions are also a design system. Line length, safe-zone placement, contrast, timing, and style consistency determine whether viewers can follow without sound.
Operating workflow
Step 1
Choose source path: local audio, local video, or URL.
Step 2
Transcribe with local AI, cloud AI, no-code tooling, or human service based on risk.
Step 3
Export SRT, VTT, ASS, transcript text, or burned-in MP4 according to platform need.
Step 4
Proof names, jargon, numbers, punctuation, timing, and speaker labels.
Step 5
Check accessibility, safe zones, and readability before publishing.
| Tool / Option | Best for | Output options | Strength | Watch-out |
|---|---|---|---|---|
| Whisper local/API | Developers and private/local workflows | TXT, SRT, VTT, JSON by implementation | Strong baseline transcription and broad ecosystem | Needs setup and proofreading for jargon |
| AssemblyAI | API transcription and diarization | JSON, SRT, VTT | Speaker labels, word timings, custom vocabulary | Async polling and API cost management |
| Deepgram | Low-latency API workflows | JSON, captions by tooling | Fast speech-to-text and developer controls | Requires integration work |
| Descript | Non-coders editing by transcript | Transcript, captions, video export | Edit media by editing text | Less ideal for code pipelines |
| Otter | Meetings and simple drag/drop transcripts | Transcript exports | Easy no-code workflow | Not built for stylized social captions |
| Rev / GoTranscript / 3PlayMedia | High-stakes or compliance-heavy material | Human transcript and captions | Human review and higher reliability | Slower and more expensive |
| Submagic / CapCut / Captions.ai / Veed | Burned-in short-form captions | Styled MP4 exports | Fast social caption styling | Caption correctness still needs manual review |
| yt-dlp + transcription | YouTube and web media ingestion | Audio files, existing captions | Flexible URL workflows | Must respect rights, access, and platform terms |
OpenAI speech-to-text request
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header "Authorization: Bearer $OPENAI_API_KEY" \
--header "Content-Type: multipart/form-data" \
--form file=@audio.mp3 \
--form model=gpt-4o-transcribePython speech-to-text request
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
)
print(transcript.text)Whisper word timestamps
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@audio.mp3" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"Extract audio before transcription
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 audio.wavDownload audio from a URL for local review
yt-dlp -x --audio-format mp3 "https://example.com/video"Lessons
18 lessons
Track
Format
Module A / Python + cURL
Send an audio file to a speech-to-text API and return transcript text plus captions.
Module A / Guide
Transcribe private or large audio locally and export captions without cloud upload.
Module A / Walkthrough
Create a transcript from an MP3 or WAV without command-line setup.
Module B / Code recipe
Turn MP4, MOV, or MKV into clean audio before transcription.
Module B / Video + article
Use transcript-driven editing and export captions after corrections.
Module B / Guide
Identify speakers reliably in interviews, panels, and podcasts.
Module C / Code recipe
Download permissible YouTube audio or captions and produce a transcript.
Module C / Reference
Understand URL ingestion limits, cookies, RSS workarounds, and platform constraints.
Module C / Comparison
Pick a browser-based workflow when command-line tools are unnecessary.
Module D / Reference
Choose the right caption format for editing, upload, web playback, and styling.
Module D / Matrix
Map YouTube, LinkedIn, Vimeo, TikTok, Reels, Shorts, and X to the right delivery method.
Module E / Comparison
Compare burned-in caption tools by speed, style, correction workflow, and export quality.
Module E / Style guide
Use caption motion and emphasis intentionally instead of chasing every trend.
Module E / Template
Create repeatable caption fonts, colors, positions, strokes, and safe zones.
Module F / Guide
Feed names, protocols, tickers, product words, and community terms into transcription review.
Module F / Checklist
Find misheard jargon, speaker errors, off-sync timing, long lines, and hallucinated words.
Module F / Reference
Understand accessibility basics for captions, flashing content, transcript availability, and readability.
Module F / Workflow
Translate captions while preserving timing, speaker meaning, and platform formatting.
Cheat sheet
Further reading
What to learn next