Module A / Python + cURL / 10-20 min

Speech-to-text API in 5 minutes: GPT-4o Transcribe and Whisper

Send an audio file to a speech-to-text API and return transcript text plus captions.

TL;DR

Use this lesson to send an audio file to a speech-to-text API and return transcript text plus captions. Treat it as practical guidance, not a rigid rulebook.

Why it matters

Captions make clips understandable without sound, searchable after publishing, and reviewable by editors before export. The goal is to help you make a stronger clip without taking away your creative freedom.

What you will learn

Understand the caption or transcript decision behind this workflow.

Produce a usable transcript, caption file, or burned-in caption pass for one clip.

Catch the caption mistakes that most often hurt readability, accuracy, or platform fit.

Prerequisites

An audio file, video file, URL, or exported clip
A target output format such as SRT, VTT, burned-in MP4, or transcript text

What you need

A short WAV, MP3, or M4A sample under one minute.

A speech-to-text API key and terminal access.

cURL or Python installed locally.

A folder where you can save transcript and caption outputs.

Core concept

Caption work is part accuracy and part design. The workflow only works if viewers can read the result quickly on a phone.

Example

Scenario

Auto-captions are mostly correct, but the clip contains names, numbers, jargon, or fast speech.

Move

Apply the workflow to a short section first and proofread the result at phone size.

Result

The caption pass becomes readable and accurate enough that sound-off viewers can follow the clip.

How to do it

1Prepare a short audio file and keep the first request small so setup errors are easy to diagnose.
2Send the file to the speech-to-text API and request transcript text plus timed caption output when available.
3Check names, jargon, and numbers manually; Whisper is strong, but it is not a final proofreader.
4Export SRT or VTT only if the publishing workflow accepts sidecar captions.
5Save the transcript, captions, and request settings so the same workflow can be repeated.

Expected output

A saved transcript plus at least one caption output file or timed transcript that can be checked against the source audio.

Practice task

Produce a clean caption pass

1Take a 20-30 second section of a real clip.
2Apply the caption or transcript workflow from this lesson.
3Proofread it with sound on, then watch it again with sound off at phone size.

Check your work

Names, numbers, jargon, acronyms, and claims are correct.

Caption lines are short, timed well, and readable on a phone.

The final export uses the caption method the target platform will actually show.

Common mistakes and fixes

Do not finish Speech-to-text API in 5 minutes: GPT-4o Transcribe and Whisper without checking the exact words against the audio.

Do not let long caption blocks fill the screen on mobile.

Do not ignore names, numbers, acronyms, tickers, and niche terms.

Do not assume every platform will show sidecar caption files.

Do not export before checking caption placement at phone size.

Troubleshooting

If upload fails, test a smaller audio file and confirm the file path, model name, and API key.

If captions have wrong names or jargon, add a custom vocabulary note where supported and still proofread manually.

If timing is missing, request a timed output format or use a transcription provider that returns word-level timestamps.

Related resources

Caption QA checklistOpen resource SRT/VTT/ASS format referenceOpen resource Custom vocabulary templateOpen resource

Reference snippets

OpenAI speech-to-text request

curl --request POST \
  --url https://api.openai.com/v1/audio/transcriptions \
  --header "Authorization: Bearer $OPENAI_API_KEY" \
  --header "Content-Type: multipart/form-data" \
  --form file=@audio.mp3 \
  --form model=gpt-4o-transcribe

Python speech-to-text request

from openai import OpenAI

client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
    )

print(transcript.text)

Whisper word timestamps

curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@audio.mp3" \
  -F model="whisper-1" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=word"

Extract audio before transcription

ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 audio.wav

Download audio from a URL for local review

yt-dlp -x --audio-format mp3 "https://example.com/video"