Skip to content

Auto-Subtitles

Automatically transcribe speech and add word-level subtitles to any video.

Goal

Take a video with speech, transcribe the audio using AI, and overlay synchronized subtitles with word-by-word highlighting.

Full Example

from videopython import Video
from videopython.ai import AudioToText
from videopython.editing import TranscriptionOverlay

def add_subtitles(input_path: str, output_path: str):
    # Load video
    video = Video.from_path(input_path)

    # Transcribe audio
    transcriber = AudioToText()
    transcription = transcriber.transcribe(video)

    # Apply subtitle overlay. Geometry is resolution-independent by
    # default, so the same overlay works at any output resolution.
    overlay = TranscriptionOverlay(
        style="boxed",      # color/border/background/highlight preset
        region="bottom",    # top | center | bottom
        font_scale=0.055,   # fraction of frame height (auto-scales)
    )
    video = overlay.apply(video, transcription)

    # Save with burned-in subtitles
    video.save(output_path)

add_subtitles("interview.mp4", "interview_subtitled.mp4")

Step-by-Step Breakdown

1. Transcribe Audio

transcriber = AudioToText()  # Local Whisper model
transcription = transcriber.transcribe(video)

The result is a Transcription carrying segments with word-level timestamps:

for segment in transcription.segments:
    print(f"{segment.start:.2f}-{segment.end:.2f}: {segment.text}")
    for word in segment.words:
        print(f"  {word.word} ({word.start:.2f}-{word.end:.2f})")

See Text & Transcription for the full data classes.

Model options:

  • tiny, base, small, medium, large, turbo
  • Enable diarization with AudioToText(enable_diarization=True) when needed.
  • VAD-gated language detection runs by default (enable_vad=True); pass enable_vad=False to skip it.
  • Bias Whisper toward brand or proper-noun spellings with AudioToText(vocabulary=["Klarna", "Allegro"]). See Brand-name vocabulary biasing.

2. Configure Subtitle Style

overlay = TranscriptionOverlay(
    style="boxed",       # boxed | outline | clean | karaoke
    region="bottom",     # top | center | bottom
    font_scale=0.055,    # font height as a fraction of frame height
    font_filename=None,  # optional: path to a .ttf, or None for the bundled font
)

Key parameters (the recommended, resolution-independent surface):

  • style -- A named look bundling text/highlight colors, border, and background so you express intent instead of a dozen numbers. boxed reproduces the historical defaults.
  • region -- Which vertical safe-area band the box sits in: top, center, or bottom.
  • font_scale -- Base font height as a fraction of frame height. Because it is relative, the same plan renders correctly whether the output is 480p or 4K, and an upstream face_crop/resize cannot make it overflow. It auto-shrinks toward min_font_scale if a cue would still not fit.
  • font_filename -- Path to a TrueType font, or None for the bundled default.

Advanced overrides

The absolute fields (font_size, text_color, highlight_color, background_color, position, anchor, box_width, margin, ...) still exist as optional overrides for back-compat: leave them unset to derive from style/region/font_scale, or set one to pin it. Prefer not setting font_size -- an absolute size chosen without knowing the final (post-transform) frame is exactly what used to overflow at render time. With the relative surface, VideoEdit.validate() also rejects an un-fittable plan up front instead of crashing mid-render.

3. Apply Overlay

video = overlay.apply(video, transcription)

The overlay renders each word at its exact timestamp, highlighting the current word being spoken.

Customization

Styling Options

# Minimal outlined subtitles near the bottom (no background box)
overlay = TranscriptionOverlay(style="clean", region="bottom")

# Bigger karaoke-style subtitles centered for short-form vertical video
overlay = TranscriptionOverlay(style="karaoke", region="center", font_scale=0.07)

# Need a specific look? Override individual fields on top of a preset:
overlay = TranscriptionOverlay(style="outline", text_color=(255, 255, 0))

Processing Long Videos

For long videos, transcription can take time. Consider processing in segments:

from videopython import Video

# Process first 5 minutes
video = Video.from_path("long_video.mp4", start_second=0, end_second=300)

Tips

  • Font size: Prefer font_scale over absolute font_size. ~0.05-0.06 reads well across resolutions; bump toward 0.07-0.08 for mobile/short-form. It auto-scales, so you never re-tune it per output size.
  • Contrast: The boxed/karaoke presets put text on a semi-transparent box that works on most backgrounds; outline/clean rely on a border instead.
  • Position: region="bottom" is standard; region="center" suits short-form vertical video.
  • Languages: Whisper supports 90+ languages. The API auto-detects language by default.