Module A / Python + cURL / 10-20 min
Speech-to-text API in 5 minutes: GPT-4o Transcribe and Whisper
Send an audio file to a speech-to-text API and return transcript text plus captions.
TL;DR
Use this lesson to send an audio file to a speech-to-text API and return transcript text plus captions. Treat it as practical guidance, not a rigid rulebook.
Why it matters
Captions make clips understandable without sound, searchable after publishing, and reviewable by editors before export. The goal is to help you make a stronger clip without taking away your creative freedom.
What you will learn
Prerequisites
- An audio file, video file, URL, or exported clip
- A target output format such as SRT, VTT, burned-in MP4, or transcript text
What you need
Core concept
Caption work is part accuracy and part design. The workflow only works if viewers can read the result quickly on a phone.
Example
Scenario
Auto-captions are mostly correct, but the clip contains names, numbers, jargon, or fast speech.
Move
Apply the workflow to a short section first and proofread the result at phone size.
Result
The caption pass becomes readable and accurate enough that sound-off viewers can follow the clip.
How to do it
- 1Prepare a short audio file and keep the first request small so setup errors are easy to diagnose.
- 2Send the file to the speech-to-text API and request transcript text plus timed caption output when available.
- 3Check names, jargon, and numbers manually; Whisper is strong, but it is not a final proofreader.
- 4Export SRT or VTT only if the publishing workflow accepts sidecar captions.
- 5Save the transcript, captions, and request settings so the same workflow can be repeated.
Expected output
A saved transcript plus at least one caption output file or timed transcript that can be checked against the source audio.
Practice task
Produce a clean caption pass
- 1Take a 20-30 second section of a real clip.
- 2Apply the caption or transcript workflow from this lesson.
- 3Proofread it with sound on, then watch it again with sound off at phone size.
Check your work
Common mistakes and fixes
Troubleshooting
Related resources
Reference snippets
OpenAI speech-to-text request
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header "Authorization: Bearer $OPENAI_API_KEY" \
--header "Content-Type: multipart/form-data" \
--form file=@audio.mp3 \
--form model=gpt-4o-transcribePython speech-to-text request
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
)
print(transcript.text)Whisper word timestamps
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@audio.mp3" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"Extract audio before transcription
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 audio.wavDownload audio from a URL for local review
yt-dlp -x --audio-format mp3 "https://example.com/video"