Module A / Walkthrough / 10-20 min
Otter and Descript drag-and-drop for non-coders
Create a transcript from an MP3 or WAV without command-line setup.
TL;DR
Use this lesson to create a transcript from an MP3 or WAV without command-line setup. Treat it as practical guidance, not a rigid rulebook.
Why it matters
Captions make clips understandable without sound, searchable after publishing, and reviewable by editors before export. The goal is to help you make a stronger clip without taking away your creative freedom.
What you will learn
Prerequisites
- An audio file, video file, URL, or exported clip
- A target output format such as SRT, VTT, burned-in MP4, or transcript text
What you need
Core concept
Caption work is part accuracy and part design. The workflow only works if viewers can read the result quickly on a phone.
Example
Scenario
Auto-captions are mostly correct, but the clip contains names, numbers, jargon, or fast speech.
Move
Apply the workflow to a short section first and proofread the result at phone size.
Result
The caption pass becomes readable and accurate enough that sound-off viewers can follow the clip.
How to do it
- 1Use drag-and-drop transcription when speed and simplicity matter more than code-level control.
- 2Upload a clean MP3 or WAV and wait for the transcript before editing.
- 3Correct names, filler words, speaker labels, and paragraph breaks inside the tool.
- 4Export transcript text or captions in the format your editor or platform needs.
- 5Move to a more controlled workflow only if the no-code tool blocks your caption style, export, or privacy requirements.
Expected output
A caption or transcript artifact that is proofread, timed, readable on a phone, and matched to the target platform.
Practice task
Produce a clean caption pass
- 1Take a 20-30 second section of a real clip.
- 2Apply the caption or transcript workflow from this lesson.
- 3Proofread it with sound on, then watch it again with sound off at phone size.
Check your work
Common mistakes and fixes
Troubleshooting
Related resources
Reference snippets
OpenAI speech-to-text request
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header "Authorization: Bearer $OPENAI_API_KEY" \
--header "Content-Type: multipart/form-data" \
--form file=@audio.mp3 \
--form model=gpt-4o-transcribePython speech-to-text request
from openai import OpenAI
client = OpenAI()
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
)
print(transcript.text)Whisper word timestamps
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@audio.mp3" \
-F model="whisper-1" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"Extract audio before transcription
ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 audio.wavDownload audio from a URL for local review
yt-dlp -x --audio-format mp3 "https://example.com/video"