Auto-Subtitles
Automatically transcribe speech and add word-level subtitles to any video.
Goal
Take a video with speech, transcribe the audio using AI, and overlay synchronized subtitles with word-by-word highlighting.
Full Example
from videopython import Video
from videopython.ai import AudioToText
from videopython.editing import TranscriptionOverlay
def add_subtitles(input_path: str, output_path: str):
# Load video
video = Video.from_path(input_path)
# Transcribe audio
transcriber = AudioToText()
transcription = transcriber.transcribe(video)
# Apply subtitle overlay. Geometry is resolution-independent by
# default, so the same overlay works at any output resolution.
overlay = TranscriptionOverlay(
style="boxed", # color/border/background/highlight preset
region="bottom", # top | center | bottom
font_scale=0.055, # fraction of frame height (auto-scales)
)
video = overlay.apply(video, transcription)
# Save with burned-in subtitles
video.save(output_path)
add_subtitles("interview.mp4", "interview_subtitled.mp4")
Step-by-Step Breakdown
1. Transcribe Audio
The result is a Transcription carrying segments with word-level timestamps:
for segment in transcription.segments:
print(f"{segment.start:.2f}-{segment.end:.2f}: {segment.text}")
for word in segment.words:
print(f" {word.word} ({word.start:.2f}-{word.end:.2f})")
See Text & Transcription for the full data classes.
Model options:
tiny,base,small,medium,large,turbo- Enable diarization with
AudioToText(enable_diarization=True)when needed. - VAD-gated language detection runs by default (
enable_vad=True); passenable_vad=Falseto skip it. - Bias Whisper toward brand or proper-noun spellings with
AudioToText(vocabulary=["Klarna", "Allegro"]). See Brand-name vocabulary biasing.
2. Configure Subtitle Style
overlay = TranscriptionOverlay(
style="boxed", # boxed | outline | clean | karaoke
region="bottom", # top | center | bottom
font_scale=0.055, # font height as a fraction of frame height
font_filename=None, # optional: path to a .ttf, or None for the bundled font
)
Key parameters (the recommended, resolution-independent surface):
style-- A named look bundling text/highlight colors, border, and background so you express intent instead of a dozen numbers.boxedreproduces the historical defaults.region-- Which vertical safe-area band the box sits in:top,center, orbottom.font_scale-- Base font height as a fraction of frame height. Because it is relative, the same plan renders correctly whether the output is 480p or 4K, and an upstreamface_crop/resizecannot make it overflow. It auto-shrinks towardmin_font_scaleif a cue would still not fit.font_filename-- Path to a TrueType font, orNonefor the bundled default.
Advanced overrides
The absolute fields (font_size, text_color, highlight_color,
background_color, position, anchor, box_width, margin, ...)
still exist as optional overrides for back-compat: leave them unset to
derive from style/region/font_scale, or set one to pin it. Prefer
not setting font_size -- an absolute size chosen without knowing the
final (post-transform) frame is exactly what used to overflow at render
time. With the relative surface, VideoEdit.validate() also rejects an
un-fittable plan up front instead of crashing mid-render.
3. Apply Overlay
The overlay renders each word at its exact timestamp, highlighting the current word being spoken.
Customization
Styling Options
# Minimal outlined subtitles near the bottom (no background box)
overlay = TranscriptionOverlay(style="clean", region="bottom")
# Bigger karaoke-style subtitles centered for short-form vertical video
overlay = TranscriptionOverlay(style="karaoke", region="center", font_scale=0.07)
# Need a specific look? Override individual fields on top of a preset:
overlay = TranscriptionOverlay(style="outline", text_color=(255, 255, 0))
Processing Long Videos
For long videos, transcription can take time. Consider processing in segments:
from videopython import Video
# Process first 5 minutes
video = Video.from_path("long_video.mp4", start_second=0, end_second=300)
Tips
- Font size: Prefer
font_scaleover absolutefont_size. ~0.05-0.06 reads well across resolutions; bump toward 0.07-0.08 for mobile/short-form. It auto-scales, so you never re-tune it per output size. - Contrast: The
boxed/karaokepresets put text on a semi-transparent box that works on most backgrounds;outline/cleanrely on a border instead. - Position:
region="bottom"is standard;region="center"suits short-form vertical video. - Languages: Whisper supports 90+ languages. The API auto-detects language by default.