Skip to content

AI Dubbing

Dub videos into different languages or replace speech with custom text using voice cloning.

Local Pipeline

Video dubbing runs with a local pipeline combining Whisper for transcription, MarianMT or Qwen3 for translation, Chatterbox Multilingual TTS for speech synthesis, and Demucs for source separation. Translation backend selection is automatic by default — see Translation Backend for details.

VideoDubber

Main class for video dubbing and voice revoicing.

Basic Dubbing

Translate speech to another language while preserving the original speaker's voice:

from videopython.ai.dubbing import VideoDubber
from videopython.base import Video

video = Video.from_path("video.mp4")
dubber = VideoDubber()

# Dub to Spanish with voice cloning
result = dubber.dub(
    video=video,
    target_lang="es",
    source_lang="en",
    preserve_background=True,  # Keep music and sound effects
    voice_clone=True,          # Clone original speaker's voice
)

# Save dubbed video
dubbed_video = video.add_audio(result.dubbed_audio, overlay=False)
dubbed_video.save("dubbed_video.mp4")

# Or use convenience method
dubbed_video = dubber.dub_and_replace(video, target_lang="es")
dubbed_video.save("dubbed_video.mp4")

Voice Revoicing

Replace speech with completely different text using the original speaker's voice:

from videopython.ai.dubbing import VideoDubber
from videopython.base import Video

video = Video.from_path("video.mp4")
dubber = VideoDubber()

# Make the person say something different
result = dubber.revoice(
    video=video,
    text="Hello everyone! This is a completely different message.",
    preserve_background=True,
)

print(f"Original duration: {result.original_duration:.1f}s")
print(f"New speech duration: {result.speech_duration:.1f}s")

# Save revoiced video (trimmed to speech length)
revoiced_video = dubber.revoice_and_replace(
    video=video,
    text="Hello everyone! This is a completely different message.",
)
revoiced_video.save("revoiced_video.mp4")

Progress Tracking

def on_progress(stage: str, progress: float) -> None:
    print(f"[{progress*100:5.1f}%] {stage}")

result = dubber.dub(
    video=video,
    target_lang="es",
    progress_callback=on_progress,
)

Memory-Efficient Dubbing

The default pipeline keeps all four models (Whisper, Demucs, the translation backend, Chatterbox) resident in memory and operates on Video objects that hold every frame in RAM. For long or high-resolution sources — or memory-constrained hardware — two flags trade a modest amount of latency for a much lower memory ceiling.

Unload models between stages with low_memory=True:

# Each stage's model is released after it runs, so only one is resident at a time.
# Recommended for GPUs with <=12GB VRAM or hosts with <32GB RAM.
dubber = VideoDubber(low_memory=True)
dubbed_video = dubber.dub_and_replace(video, target_lang="es")

Skip frame loading with dub_file():

# Operates on file paths; extracts audio via ffmpeg, runs the pipeline on the
# audio only, and muxes the dubbed audio back into the source video using
# ffmpeg stream-copy (no video re-encode). Peak memory is bounded by model
# weights and the audio track, independent of video length and resolution.
dubber = VideoDubber(low_memory=True)
result = dubber.dub_file(
    input_path="long_video.mp4",
    output_path="dubbed.mp4",
    target_lang="es",
)

Use dub_file() when you don't need frame-level access in Python. Combine with low_memory=True for the smallest memory footprint. See Processing Large Videos for a worked example.

Whisper Model Selection

Pick the Whisper model size used for transcription. Larger models are more accurate but use more VRAM and run slower. Default is turbo — large-v3 quality at ~8x the speed of large (and ~2x faster than small), so the out-of-the-box dubbing path is now both more accurate and faster.

# Even higher accuracy on very noisy or heavily accented audio
dubber = VideoDubber(whisper_model="large")

# Lower VRAM footprint for short clips
dubber = VideoDubber(whisper_model="tiny")

Supported sizes: tiny, base, small, medium, large, turbo.

Anti-Hallucination Knobs

VideoDubber forwards three Whisper decoder kwargs to AudioToText so dubbing inherits the same defaults — most importantly condition_on_previous_text=False, which prevents a single hallucinated filler from cascading through the whole dubbed track on noisy or sparse-speech inputs.

# Defaults already protect against the cascading-hallucination failure mode.
dubber = VideoDubber()

# Tighter no-speech gate for a film with heavy ambient music.
dubber = VideoDubber(no_speech_threshold=0.85)

See AudioToText for the full rationale.

Brand-Name Vocabulary

Pass a list of brand names, product names, or proper nouns that may appear in the source audio. The list is forwarded to AudioToText and biases Whisper's first-window decoder via initial_prompt, recovering near-mishears (e.g. Klarna → "carna") on brand-monitoring inputs.

dubber = VideoDubber(vocabulary=["Klarna", "Allegro", "InPost"])

See Brand-name vocabulary biasing for normalization rules and the token-budget guard.

Translation Backend

Two translation backends ship with the dubbing pipeline:

  • MarianMT (Helsinki-NLP) — fast on CPU, segment-isolated translation. Covers ~30 high-resource language pairs out of the box.
  • Qwen3 — Qwen3-4B-Instruct via llama-cpp-python (Q4_K_M GGUF, ~2.4 GB, downloaded on first use). Context-aware: prompts include a per-segment character budget derived from source duration and a low_confidence hint sourced from Whisper avg_logprob. Per-segment fallback to Marian if Qwen parse-retries both fail and the language pair is supported.
# Auto resolver: Qwen3 on GPU when supported, MarianMT on CPU.
dubber = VideoDubber(translator="auto")

# Force MarianMT (e.g. CPU machines where Qwen3 wall time is impractical).
dubber = VideoDubber(translator="marian")

# Force Qwen3. Logs a WARNING on CPU because Qwen3-4B Q4_K_M runs ~10-15x
# slower than Marian without GPU acceleration.
dubber = VideoDubber(translator="qwen3")

Hard failures from Qwen3 (both the primary call and the per-segment Marian fallback fail) are surfaced on DubbingResult.translation_failures as a list of segment indices; those segments land on the result with empty translated text. Empty list under MarianTranslator.

If neither backend covers the requested pair the auto resolver raises UnsupportedLanguageError (importable from videopython.ai.dubbing).

Output Options for dub_file

dub_file() writes the dubbed video by stream-copying the source video and muxing the new audio. Two extras carry through automatically and one is opt-in:

  • Subtitles pass-through (automatic). Subtitle streams from the source video are stream-copied into the output by default. Sources without subtitles are tolerated.
  • Source loudness match (automatic). The dubbed audio is gain-matched to the source via BS.1770 integrated-loudness measurement (pyloudnorm, BSD-3) so the dub lands within ~1 LU of the source on dialogue-heavy mixes. Falls back to peak-amplitude match for clips shorter than 400 ms; post-gain peaks are clamped to 0.99.
  • keep_original_audio=True (opt-in). Retains the source audio as a secondary audio track behind the dubbed one. Useful for editorial A/B; the dubbed track stays the default-playback track.
result = dubber.dub_file(
    input_path="interview.mp4",
    output_path="interview_es.mp4",
    target_lang="es",
    keep_original_audio=True,  # source audio rides along as track #2
)

Transcript Quality Gating

Even with condition_on_previous_text=False, sufficiently degenerate input (ambient music, mostly-silent windows misread as speech) can still produce unusable transcripts. The pipeline runs a cheap heuristic over the Whisper output and exposes the assessment on every result.

Three checks fire flags:

  • Dominant phrase — one phrase covers ≥70% of segment characters (catches cascades like the Japanese YouTube outro 「ご視聴ありがとうございました」).
  • Low decoder confidence — median avg_logprob < -1.5.
  • Sparse speech — speech-region duration is <5% of clip duration on inputs >30s.

The recommendation is "reject" when the dominance flag fires together with at least one other flag, "warn" when any single flag fires, "ok" otherwise. Single repetition alone (chants, song lyrics) only warns.

result = dubber.dub(video, target_lang="es")

q = result.transcript_quality
if q is not None:
    print(q.recommendation)            # "ok" | "warn" | "reject"
    print(q.dominant_phrase_fraction)  # 0.0-1.0
    print(q.flags)                     # ["dominant_phrase", ...]

Use strict_quality=True to refuse low-quality transcripts before paying for Demucs, translation, and TTS:

from videopython.ai.dubbing import GarbageTranscriptError

dubber = VideoDubber(strict_quality=True)
try:
    dubber.dub(video, target_lang="es")
except GarbageTranscriptError as exc:
    print("Refused:", exc.quality.flags)

Timing Summary

DubbingResult.timing_summary aggregates the per-segment timing adjustments the synchronizer applied to fit translated speech into source durations. High truncation rates indicate translation produced text that was too long for the source's spoken regions — a quality red flag worth surfacing in eval harnesses or product UI.

result = dubber.dub(video, target_lang="es")

ts = result.timing_summary
if ts is not None:
    print(f"{ts.clean_count}/{ts.total_segments} clean")
    print(f"{ts.truncated_count} truncated, worst {ts.max_truncation_seconds:.2f}s")
    print(f"mean speed factor {ts.mean_speed_factor:.3f}")

Source-Prosody Expressiveness

ChatterboxMultilingualTTS.generate() exposes exaggeration, cfg_weight, and temperature knobs. The dubbing pipeline derives an Expressiveness profile per segment from source vocals RMS (relative to whole-vocals baseline) and forwards it to Chatterbox, so the dub tracks the source's loud/quiet shape instead of using flat defaults on every segment.

Three buckets, picked by-ear on cam1_1min.mp4:

RMS ratio vs baseline exaggeration cfg_weight
< 0.7× (calm) 0.3 0.7
0.7×–1.3× (normal) Chatterbox default Chatterbox default
> 1.3× (dramatic) 0.85 0.35

The Expressiveness dataclass is exported from videopython.ai.dubbing.

Supplying a Pre-Computed Transcription

dub(), dub_and_replace(), and dub_file() accept an optional transcription argument. Pass a pre-computed Transcription to skip the internal Whisper step — useful when you've already transcribed (and possibly hand-edited) the source.

Per-speaker voice cloning is driven by speaker labels on the supplied transcription. Three cases:

Supplied transcription enable_diarization Behavior
Has speaker labels any Use supplied speakers; enable_diarization ignored
No speakers True Run pyannote on the audio, attach speakers to supplied words
No speakers False Use as-is; all segments share a single voice clone

The diarize-on-supplied path requires word-level timings on the supplied transcription — transcriptions loaded from SRT (one synthetic word per block) are rejected.

# Workflow: transcribe, edit, then dub with per-speaker cloning
from videopython.ai.dubbing import VideoDubber
from videopython.ai.understanding.audio import AudioToText
from videopython.base import Video

video = Video.from_path("video.mp4")

# 1. Transcribe with diarization
transcriber = AudioToText(enable_diarization=True)
transcription = transcriber.transcribe(video)

# 2. Edit segment text in-place (correct misrecognitions, etc.)
for seg in transcription.segments:
    if "incorrect word" in seg.text:
        seg.text = seg.text.replace("incorrect word", "correct word")

# 3. Dub using the edited transcription. Speaker labels from step 1 are
#    preserved, so each speaker gets their own cloned voice.
dubber = VideoDubber()
dubbed_video = dubber.dub_and_replace(
    video=video,
    target_lang="es",
    transcription=transcription,
)

If you have a transcription without speakers and want per-speaker cloning, pass enable_diarization=True — pyannote will run standalone (skipping the Whisper re-transcription).

VideoDubber

Dubs videos into different languages using the local pipeline.

Accepts either a :class:DubbingConfig or the same knobs as flat kwargs (device, low_memory, whisper_model, translator, etc.) -- the flat path builds a DubbingConfig internally. See :class:DubbingConfig for the full knob list and defaults.

Source code in src/videopython/ai/dubbing/dubber.py
class VideoDubber:
    """Dubs videos into different languages using the local pipeline.

    Accepts either a :class:`DubbingConfig` or the same knobs as flat kwargs
    (``device``, ``low_memory``, ``whisper_model``, ``translator``, etc.) --
    the flat path builds a ``DubbingConfig`` internally. See
    :class:`DubbingConfig` for the full knob list and defaults.
    """

    def __init__(self, config: DubbingConfig | None = None, **kwargs: Any):
        if config is not None and kwargs:
            raise TypeError("Pass either `config=` or knob kwargs, not both")
        self.config = config or DubbingConfig(**kwargs)
        self._local_pipeline: Any = None
        logger.info(
            "VideoDubber initialized with %s",
            " ".join(f"{k}={v}" for k, v in self.config.init_log_fields().items()),
        )

    def _init_local_pipeline(self) -> None:
        from videopython.ai.dubbing.pipeline import LocalDubbingPipeline

        self._local_pipeline = LocalDubbingPipeline(config=self.config)

    def dub(
        self,
        video: Video,
        target_lang: str,
        source_lang: str | None = None,
        preserve_background: bool = True,
        voice_clone: bool = True,
        enable_diarization: bool = False,
        progress_callback: Callable[[str, float], None] | None = None,
        transcription: Any = None,
    ) -> DubbingResult:
        """Dub a video into a target language.

        Args:
            enable_diarization: Enable speaker diarization to clone each speaker's
                voice separately. With ``transcription=None``, runs alongside Whisper.
                With a supplied ``transcription`` that has no speakers, runs pyannote
                standalone and overlays speakers onto the supplied words. Ignored when
                the supplied transcription already has speaker labels.
            transcription: Optional pre-computed ``Transcription`` to skip the Whisper
                step. Speaker labels on the supplied transcription drive per-speaker
                voice cloning. If it has no speakers, pass ``enable_diarization=True``
                to add them via pyannote (requires word-level timings).
        """
        if self._local_pipeline is None:
            self._init_local_pipeline()

        return self._local_pipeline.process(
            source_audio=video.audio,
            target_lang=target_lang,
            source_lang=source_lang,
            preserve_background=preserve_background,
            voice_clone=voice_clone,
            enable_diarization=enable_diarization,
            progress_callback=progress_callback,
            transcription=transcription,
        )

    def dub_and_replace(
        self,
        video: Video,
        target_lang: str,
        source_lang: str | None = None,
        preserve_background: bool = True,
        voice_clone: bool = True,
        enable_diarization: bool = False,
        progress_callback: Callable[[str, float], None] | None = None,
        transcription: Any = None,
    ) -> Video:
        """Dub a video and return a new video with the dubbed audio.

        Args:
            transcription: Optional pre-computed ``Transcription`` to skip the Whisper
                step. Speaker labels on the supplied transcription drive per-speaker
                voice cloning. See ``dub()`` for the interaction with
                ``enable_diarization``.
        """
        result = self.dub(
            video=video,
            target_lang=target_lang,
            source_lang=source_lang,
            preserve_background=preserve_background,
            voice_clone=voice_clone,
            enable_diarization=enable_diarization,
            progress_callback=progress_callback,
            transcription=transcription,
        )
        return video.add_audio(result.dubbed_audio, overlay=False)

    def dub_file(
        self,
        input_path: str | Path,
        output_path: str | Path,
        target_lang: str,
        source_lang: str | None = None,
        preserve_background: bool = True,
        voice_clone: bool = True,
        enable_diarization: bool = False,
        progress_callback: Callable[[str, float], None] | None = None,
        transcription: Any = None,
        keep_original_audio: bool = False,
    ) -> DubbingResult:
        """Dub a video file in place on disk without loading video frames into memory.

        Extracts the audio track via ffmpeg, runs the dubbing pipeline on the
        audio only, then muxes the dubbed audio back into the source video
        using ffmpeg stream-copy (no video re-encode). Peak memory is bounded
        by model weights and the audio track — independent of video length and
        resolution.

        Use this instead of ``dub_and_replace`` when the source video is long
        or high-resolution and you don't need frame-level access in Python.

        Args:
            input_path: Path to the source video file.
            output_path: Path to write the dubbed video. Overwritten if it exists.
            target_lang: Target language code (e.g. ``"es"``, ``"fr"``).
            source_lang: Source language code, or ``None`` to auto-detect.
            preserve_background: Preserve background music/effects via source separation.
            voice_clone: Clone the source speaker's voice for the dubbed track.
            enable_diarization: Enable speaker diarization for per-speaker voice cloning.
                See ``dub()`` for the interaction with ``transcription``.
            progress_callback: Optional callback ``(stage: str, progress: float) -> None``.
            transcription: Optional pre-computed ``Transcription`` to skip the Whisper
                step. Speaker labels on the supplied transcription drive per-speaker
                voice cloning. If it has no speakers, pass ``enable_diarization=True``
                to add them via pyannote (requires word-level timings).
            keep_original_audio: If True, retain the source audio in the output
                as a secondary track behind the dubbed one (editorial A/B).

        Returns:
            ``DubbingResult`` with the dubbed audio, translated segments, and
            source transcription. The output video is written to ``output_path``.
        """
        from videopython.ai.dubbing.remux import replace_audio_stream_from_audio
        from videopython.audio import Audio

        input_path = Path(input_path)
        output_path = Path(output_path)

        if not input_path.exists():
            raise FileNotFoundError(f"Input video not found: {input_path}")

        logger.info("dub_file: loading audio from %s", input_path)
        source_audio = Audio.from_path(input_path)

        if self._local_pipeline is None:
            self._init_local_pipeline()

        result = self._local_pipeline.process(
            source_audio=source_audio,
            target_lang=target_lang,
            source_lang=source_lang,
            preserve_background=preserve_background,
            voice_clone=voice_clone,
            enable_diarization=enable_diarization,
            progress_callback=progress_callback,
            transcription=transcription,
        )

        # Stream the dubbed Audio directly into ffmpeg via stdin instead of
        # going through a temp WAV on disk. For a 2h dub the temp file would
        # be ~10 GB written-then-read; the streaming path drops both copies.
        replace_audio_stream_from_audio(
            video_path=input_path,
            audio=result.dubbed_audio,
            output_path=output_path,
            keep_original_audio=keep_original_audio,
        )

        return result

    def revoice(
        self,
        video: Video,
        text: str,
        preserve_background: bool = True,
        progress_callback: Callable[[str, float], None] | None = None,
    ) -> RevoiceResult:
        """Replace speech in a video with new text using voice cloning."""
        if self._local_pipeline is None:
            self._init_local_pipeline()

        return self._local_pipeline.revoice(
            source_audio=video.audio,
            text=text,
            preserve_background=preserve_background,
            progress_callback=progress_callback,
        )

    def revoice_and_replace(
        self,
        video: Video,
        text: str,
        preserve_background: bool = True,
        progress_callback: Callable[[str, float], None] | None = None,
    ) -> Video:
        """Revoice a video and return a new video with the revoiced audio."""
        result = self.revoice(
            video=video,
            text=text,
            preserve_background=preserve_background,
            progress_callback=progress_callback,
        )

        speech_duration = result.speech_duration
        video_duration = video.total_seconds

        if video_duration > speech_duration:
            from videopython.editing.transforms import CutSeconds

            output_video = CutSeconds(start=0, end=speech_duration).apply(video)
        else:
            output_video = video

        return output_video.add_audio(result.revoiced_audio, overlay=False)

    @staticmethod
    def get_supported_languages() -> dict[str, str]:
        from videopython.ai.generation.translation import TextTranslator

        return TextTranslator.get_supported_languages()

dub

dub(
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None]
    | None = None,
    transcription: Any = None,
) -> DubbingResult

Dub a video into a target language.

Parameters:

Name Type Description Default
enable_diarization bool

Enable speaker diarization to clone each speaker's voice separately. With transcription=None, runs alongside Whisper. With a supplied transcription that has no speakers, runs pyannote standalone and overlays speakers onto the supplied words. Ignored when the supplied transcription already has speaker labels.

False
transcription Any

Optional pre-computed Transcription to skip the Whisper step. Speaker labels on the supplied transcription drive per-speaker voice cloning. If it has no speakers, pass enable_diarization=True to add them via pyannote (requires word-level timings).

None
Source code in src/videopython/ai/dubbing/dubber.py
def dub(
    self,
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None] | None = None,
    transcription: Any = None,
) -> DubbingResult:
    """Dub a video into a target language.

    Args:
        enable_diarization: Enable speaker diarization to clone each speaker's
            voice separately. With ``transcription=None``, runs alongside Whisper.
            With a supplied ``transcription`` that has no speakers, runs pyannote
            standalone and overlays speakers onto the supplied words. Ignored when
            the supplied transcription already has speaker labels.
        transcription: Optional pre-computed ``Transcription`` to skip the Whisper
            step. Speaker labels on the supplied transcription drive per-speaker
            voice cloning. If it has no speakers, pass ``enable_diarization=True``
            to add them via pyannote (requires word-level timings).
    """
    if self._local_pipeline is None:
        self._init_local_pipeline()

    return self._local_pipeline.process(
        source_audio=video.audio,
        target_lang=target_lang,
        source_lang=source_lang,
        preserve_background=preserve_background,
        voice_clone=voice_clone,
        enable_diarization=enable_diarization,
        progress_callback=progress_callback,
        transcription=transcription,
    )

dub_and_replace

dub_and_replace(
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None]
    | None = None,
    transcription: Any = None,
) -> Video

Dub a video and return a new video with the dubbed audio.

Parameters:

Name Type Description Default
transcription Any

Optional pre-computed Transcription to skip the Whisper step. Speaker labels on the supplied transcription drive per-speaker voice cloning. See dub() for the interaction with enable_diarization.

None
Source code in src/videopython/ai/dubbing/dubber.py
def dub_and_replace(
    self,
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None] | None = None,
    transcription: Any = None,
) -> Video:
    """Dub a video and return a new video with the dubbed audio.

    Args:
        transcription: Optional pre-computed ``Transcription`` to skip the Whisper
            step. Speaker labels on the supplied transcription drive per-speaker
            voice cloning. See ``dub()`` for the interaction with
            ``enable_diarization``.
    """
    result = self.dub(
        video=video,
        target_lang=target_lang,
        source_lang=source_lang,
        preserve_background=preserve_background,
        voice_clone=voice_clone,
        enable_diarization=enable_diarization,
        progress_callback=progress_callback,
        transcription=transcription,
    )
    return video.add_audio(result.dubbed_audio, overlay=False)

dub_file

dub_file(
    input_path: str | Path,
    output_path: str | Path,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None]
    | None = None,
    transcription: Any = None,
    keep_original_audio: bool = False,
) -> DubbingResult

Dub a video file in place on disk without loading video frames into memory.

Extracts the audio track via ffmpeg, runs the dubbing pipeline on the audio only, then muxes the dubbed audio back into the source video using ffmpeg stream-copy (no video re-encode). Peak memory is bounded by model weights and the audio track — independent of video length and resolution.

Use this instead of dub_and_replace when the source video is long or high-resolution and you don't need frame-level access in Python.

Parameters:

Name Type Description Default
input_path str | Path

Path to the source video file.

required
output_path str | Path

Path to write the dubbed video. Overwritten if it exists.

required
target_lang str

Target language code (e.g. "es", "fr").

required
source_lang str | None

Source language code, or None to auto-detect.

None
preserve_background bool

Preserve background music/effects via source separation.

True
voice_clone bool

Clone the source speaker's voice for the dubbed track.

True
enable_diarization bool

Enable speaker diarization for per-speaker voice cloning. See dub() for the interaction with transcription.

False
progress_callback Callable[[str, float], None] | None

Optional callback (stage: str, progress: float) -> None.

None
transcription Any

Optional pre-computed Transcription to skip the Whisper step. Speaker labels on the supplied transcription drive per-speaker voice cloning. If it has no speakers, pass enable_diarization=True to add them via pyannote (requires word-level timings).

None
keep_original_audio bool

If True, retain the source audio in the output as a secondary track behind the dubbed one (editorial A/B).

False

Returns:

Type Description
DubbingResult

DubbingResult with the dubbed audio, translated segments, and

DubbingResult

source transcription. The output video is written to output_path.

Source code in src/videopython/ai/dubbing/dubber.py
def dub_file(
    self,
    input_path: str | Path,
    output_path: str | Path,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None] | None = None,
    transcription: Any = None,
    keep_original_audio: bool = False,
) -> DubbingResult:
    """Dub a video file in place on disk without loading video frames into memory.

    Extracts the audio track via ffmpeg, runs the dubbing pipeline on the
    audio only, then muxes the dubbed audio back into the source video
    using ffmpeg stream-copy (no video re-encode). Peak memory is bounded
    by model weights and the audio track — independent of video length and
    resolution.

    Use this instead of ``dub_and_replace`` when the source video is long
    or high-resolution and you don't need frame-level access in Python.

    Args:
        input_path: Path to the source video file.
        output_path: Path to write the dubbed video. Overwritten if it exists.
        target_lang: Target language code (e.g. ``"es"``, ``"fr"``).
        source_lang: Source language code, or ``None`` to auto-detect.
        preserve_background: Preserve background music/effects via source separation.
        voice_clone: Clone the source speaker's voice for the dubbed track.
        enable_diarization: Enable speaker diarization for per-speaker voice cloning.
            See ``dub()`` for the interaction with ``transcription``.
        progress_callback: Optional callback ``(stage: str, progress: float) -> None``.
        transcription: Optional pre-computed ``Transcription`` to skip the Whisper
            step. Speaker labels on the supplied transcription drive per-speaker
            voice cloning. If it has no speakers, pass ``enable_diarization=True``
            to add them via pyannote (requires word-level timings).
        keep_original_audio: If True, retain the source audio in the output
            as a secondary track behind the dubbed one (editorial A/B).

    Returns:
        ``DubbingResult`` with the dubbed audio, translated segments, and
        source transcription. The output video is written to ``output_path``.
    """
    from videopython.ai.dubbing.remux import replace_audio_stream_from_audio
    from videopython.audio import Audio

    input_path = Path(input_path)
    output_path = Path(output_path)

    if not input_path.exists():
        raise FileNotFoundError(f"Input video not found: {input_path}")

    logger.info("dub_file: loading audio from %s", input_path)
    source_audio = Audio.from_path(input_path)

    if self._local_pipeline is None:
        self._init_local_pipeline()

    result = self._local_pipeline.process(
        source_audio=source_audio,
        target_lang=target_lang,
        source_lang=source_lang,
        preserve_background=preserve_background,
        voice_clone=voice_clone,
        enable_diarization=enable_diarization,
        progress_callback=progress_callback,
        transcription=transcription,
    )

    # Stream the dubbed Audio directly into ffmpeg via stdin instead of
    # going through a temp WAV on disk. For a 2h dub the temp file would
    # be ~10 GB written-then-read; the streaming path drops both copies.
    replace_audio_stream_from_audio(
        video_path=input_path,
        audio=result.dubbed_audio,
        output_path=output_path,
        keep_original_audio=keep_original_audio,
    )

    return result

revoice

revoice(
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None]
    | None = None,
) -> RevoiceResult

Replace speech in a video with new text using voice cloning.

Source code in src/videopython/ai/dubbing/dubber.py
def revoice(
    self,
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None] | None = None,
) -> RevoiceResult:
    """Replace speech in a video with new text using voice cloning."""
    if self._local_pipeline is None:
        self._init_local_pipeline()

    return self._local_pipeline.revoice(
        source_audio=video.audio,
        text=text,
        preserve_background=preserve_background,
        progress_callback=progress_callback,
    )

revoice_and_replace

revoice_and_replace(
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None]
    | None = None,
) -> Video

Revoice a video and return a new video with the revoiced audio.

Source code in src/videopython/ai/dubbing/dubber.py
def revoice_and_replace(
    self,
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None] | None = None,
) -> Video:
    """Revoice a video and return a new video with the revoiced audio."""
    result = self.revoice(
        video=video,
        text=text,
        preserve_background=preserve_background,
        progress_callback=progress_callback,
    )

    speech_duration = result.speech_duration
    video_duration = video.total_seconds

    if video_duration > speech_duration:
        from videopython.editing.transforms import CutSeconds

        output_video = CutSeconds(start=0, end=speech_duration).apply(video)
    else:
        output_video = video

    return output_video.add_audio(result.revoiced_audio, overlay=False)

DubbingConfig

Knobs shared by VideoDubber and LocalDubbingPipeline. Accept either config=DubbingConfig(...) or pass the same knobs as flat kwargs — the constructor builds a DubbingConfig internally.

from videopython.ai.dubbing import DubbingConfig, VideoDubber

# Flat kwargs (recommended for ad-hoc calls)
dubber = VideoDubber(device="cuda", low_memory=True, whisper_model="large")

# Explicit config (recommended for reusable presets)
config = DubbingConfig(
    device="cuda",
    low_memory=True,
    whisper_model="large",
    translator="qwen3",
    vocabulary=["Klarna", "Allegro"],
)
dubber = VideoDubber(config=config)

DubbingConfig

Bases: BaseModel

Knobs shared by :class:VideoDubber and :class:LocalDubbingPipeline.

Accepted as either config=DubbingConfig(...) or flat kwargs on the two constructors; the flat path builds a DubbingConfig internally.

Attributes:

Name Type Description
device str | None

Execution device (cpu, cuda, mps, or None for auto).

low_memory bool

When True, each pipeline stage (Whisper, Demucs, MarianMT, Chatterbox TTS) is unloaded from memory after it runs, so only one model is resident at a time. Trades per-run latency (~10-30s of extra model loads) for a much lower memory ceiling. Recommended for GPUs with <=12GB VRAM or hosts with <32GB RAM. Default False.

whisper_model WhisperModel

Whisper model size used for transcription. Larger models give better accuracy at the cost of VRAM and latency. One of tiny, base, small, medium, large, turbo. Default turbo.

condition_on_previous_text bool

Forwarded to AudioToText. Defaults to False (Whisper's own default is True). With conditioning on, a single hallucinated filler phrase cascades through the rest of the file. See AudioToText for the full rationale.

no_speech_threshold float

Forwarded to AudioToText. Whisper's no-speech gate; raise to drop more low-confidence windows.

logprob_threshold float | None

Forwarded to AudioToText. Whisper's average log-probability gate.

vocabulary list[str] | None

Forwarded to AudioToText. Optional list of brand names, product names, or proper nouns to bias Whisper's first-window decoder via initial_prompt. Recovers near-mishears (e.g. Klarna -> "carna") on brand-monitoring inputs without new model deps.

strict_quality bool

When True, the pipeline raises :class:GarbageTranscriptError before Demucs/translation/TTS run if the transcript-quality heuristic returns "reject". When False (default), low-quality transcripts are logged at WARNING but processing continues. Either way the :class:TranscriptQuality is exposed on DubbingResult for inspection.

translator TranslatorChoice

Translation backend to use. "auto" (default) picks Qwen3 on GPU, MarianMT on CPU; "marian" and "qwen3" force the named backend regardless of device. See :class:videopython.ai.generation.qwen3.Qwen3Translator for tradeoffs (Qwen3 is slower on CPU but produces context-aware, length-budgeted output).

Source code in src/videopython/ai/dubbing/config.py
class DubbingConfig(BaseModel):
    """Knobs shared by :class:`VideoDubber` and :class:`LocalDubbingPipeline`.

    Accepted as either ``config=DubbingConfig(...)`` or flat kwargs on the
    two constructors; the flat path builds a ``DubbingConfig`` internally.

    Attributes:
        device: Execution device (``cpu``, ``cuda``, ``mps``, or ``None`` for auto).
        low_memory: When True, each pipeline stage (Whisper, Demucs, MarianMT,
            Chatterbox TTS) is unloaded from memory after it runs, so only one
            model is resident at a time. Trades per-run latency (~10-30s of
            extra model loads) for a much lower memory ceiling. Recommended
            for GPUs with <=12GB VRAM or hosts with <32GB RAM. Default False.
        whisper_model: Whisper model size used for transcription. Larger
            models give better accuracy at the cost of VRAM and latency. One
            of ``tiny``, ``base``, ``small``, ``medium``, ``large``, ``turbo``.
            Default ``turbo``.
        condition_on_previous_text: Forwarded to ``AudioToText``. Defaults to
            ``False`` (Whisper's own default is ``True``). With conditioning
            on, a single hallucinated filler phrase cascades through the rest
            of the file. See ``AudioToText`` for the full rationale.
        no_speech_threshold: Forwarded to ``AudioToText``. Whisper's
            no-speech gate; raise to drop more low-confidence windows.
        logprob_threshold: Forwarded to ``AudioToText``. Whisper's average
            log-probability gate.
        vocabulary: Forwarded to ``AudioToText``. Optional list of brand
            names, product names, or proper nouns to bias Whisper's
            first-window decoder via ``initial_prompt``. Recovers
            near-mishears (e.g. Klarna -> "carna") on brand-monitoring
            inputs without new model deps.
        strict_quality: When True, the pipeline raises
            :class:`GarbageTranscriptError` before Demucs/translation/TTS
            run if the transcript-quality heuristic returns ``"reject"``.
            When False (default), low-quality transcripts are logged at
            WARNING but processing continues. Either way the
            :class:`TranscriptQuality` is exposed on ``DubbingResult`` for
            inspection.
        translator: Translation backend to use. ``"auto"`` (default) picks
            Qwen3 on GPU, MarianMT on CPU; ``"marian"`` and ``"qwen3"`` force
            the named backend regardless of device. See
            :class:`videopython.ai.generation.qwen3.Qwen3Translator` for
            tradeoffs (Qwen3 is slower on CPU but produces context-aware,
            length-budgeted output).
    """

    model_config = ConfigDict(frozen=True)

    device: str | None = None
    low_memory: bool = False
    whisper_model: WhisperModel = "turbo"
    condition_on_previous_text: bool = False
    no_speech_threshold: float = 0.6
    logprob_threshold: float | None = -1.0
    vocabulary: list[str] | None = None
    strict_quality: bool = False
    translator: TranslatorChoice = "auto"

    def init_log_fields(self) -> dict[str, object]:
        """Subset of fields surfaced in the init-log line.

        Hand-picked so log noise stays bounded as the config grows.
        """
        return {
            "device": self.device.lower() if isinstance(self.device, str) else "auto",
            "low_memory": self.low_memory,
            "whisper_model": self.whisper_model,
            "translator": self.translator,
        }

init_log_fields

init_log_fields() -> dict[str, object]

Subset of fields surfaced in the init-log line.

Hand-picked so log noise stays bounded as the config grows.

Source code in src/videopython/ai/dubbing/config.py
def init_log_fields(self) -> dict[str, object]:
    """Subset of fields surfaced in the init-log line.

    Hand-picked so log noise stays bounded as the config grows.
    """
    return {
        "device": self.device.lower() if isinstance(self.device, str) else "auto",
        "low_memory": self.low_memory,
        "whisper_model": self.whisper_model,
        "translator": self.translator,
    }

DubbingResult

Result of a dubbing operation containing the dubbed audio and metadata.

result = dubber.dub(video, target_lang="es")

print(f"Translated {result.num_segments} segments")
print(f"Source language: {result.source_lang}")
print(f"Target language: {result.target_lang}")

# Access translated segments
for segment in result.translated_segments:
    print(f"'{segment.original_text}' -> '{segment.translated_text}'")

# Access voice samples used for cloning
for speaker, sample in result.voice_samples.items():
    print(f"{speaker}: {sample.metadata.duration_seconds:.1f}s sample")

DubbingResult

Bases: BaseModel

Result of a video dubbing operation.

Attributes:

Name Type Description
dubbed_audio Audio

The final dubbed audio track.

translated_segments list[TranslatedSegment]

List of translated segments with timing.

source_transcription Transcription

Original transcription of the source audio.

source_lang str

Detected or specified source language.

target_lang str

Target language for dubbing.

separated_audio SeparatedAudio | None

Separated audio components (if preserve_background=True).

voice_samples dict[str, Audio]

Dictionary mapping speaker IDs to voice sample Audio.

timing_summary TimingSummary | None

Aggregate stats over per-segment timing adjustments.

transcript_quality TranscriptQuality | None

Heuristic quality assessment of the transcription (None when the pipeline returned early on an empty transcription).

translation_failures list[int]

Indices of segments where translation failed entirely. Used by Qwen3Translator when both the primary call and the per-segment Marian fallback fail; those segments are dubbed with empty text. Empty list under MarianTranslator (Marian has no failure mode that drops segments).

Source code in src/videopython/ai/dubbing/models.py
class DubbingResult(BaseModel):
    """Result of a video dubbing operation.

    Attributes:
        dubbed_audio: The final dubbed audio track.
        translated_segments: List of translated segments with timing.
        source_transcription: Original transcription of the source audio.
        source_lang: Detected or specified source language.
        target_lang: Target language for dubbing.
        separated_audio: Separated audio components (if preserve_background=True).
        voice_samples: Dictionary mapping speaker IDs to voice sample Audio.
        timing_summary: Aggregate stats over per-segment timing adjustments.
        transcript_quality: Heuristic quality assessment of the transcription
            (None when the pipeline returned early on an empty transcription).
        translation_failures: Indices of segments where translation failed
            entirely. Used by Qwen3Translator when both the primary call and
            the per-segment Marian fallback fail; those segments are dubbed
            with empty text. Empty list under MarianTranslator (Marian has
            no failure mode that drops segments).
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    dubbed_audio: Audio
    translated_segments: list[TranslatedSegment]
    source_transcription: Transcription
    source_lang: str
    target_lang: str
    separated_audio: SeparatedAudio | None = None
    voice_samples: dict[str, Audio] = Field(default_factory=dict)
    timing_summary: TimingSummary | None = None
    transcript_quality: TranscriptQuality | None = None
    translation_failures: list[int] = Field(default_factory=list)

    @property
    def num_segments(self) -> int:
        """Number of translated segments."""
        return len(self.translated_segments)

    @property
    def total_duration(self) -> float:
        """Total duration of the dubbed audio."""
        return self.dubbed_audio.metadata.duration_seconds

    def get_segments_by_speaker(self) -> dict[str, list[TranslatedSegment]]:
        """Group translated segments by speaker.

        Returns:
            Dictionary mapping speaker IDs to their segments.
        """
        segments_by_speaker: dict[str, list[TranslatedSegment]] = {}
        for segment in self.translated_segments:
            speaker = segment.speaker or "unknown"
            if speaker not in segments_by_speaker:
                segments_by_speaker[speaker] = []
            segments_by_speaker[speaker].append(segment)
        return segments_by_speaker

num_segments property

num_segments: int

Number of translated segments.

total_duration property

total_duration: float

Total duration of the dubbed audio.

get_segments_by_speaker

get_segments_by_speaker() -> dict[
    str, list[TranslatedSegment]
]

Group translated segments by speaker.

Returns:

Type Description
dict[str, list[TranslatedSegment]]

Dictionary mapping speaker IDs to their segments.

Source code in src/videopython/ai/dubbing/models.py
def get_segments_by_speaker(self) -> dict[str, list[TranslatedSegment]]:
    """Group translated segments by speaker.

    Returns:
        Dictionary mapping speaker IDs to their segments.
    """
    segments_by_speaker: dict[str, list[TranslatedSegment]] = {}
    for segment in self.translated_segments:
        speaker = segment.speaker or "unknown"
        if speaker not in segments_by_speaker:
            segments_by_speaker[speaker] = []
        segments_by_speaker[speaker].append(segment)
    return segments_by_speaker

RevoiceResult

Result of a revoicing operation.

result = dubber.revoice(video, text="New message here")

print(f"Text: {result.text}")
print(f"Speech duration: {result.speech_duration:.1f}s")
print(f"Voice sample: {result.voice_sample.metadata.duration_seconds:.1f}s")

RevoiceResult

Bases: BaseModel

Result of a voice replacement operation.

Attributes:

Name Type Description
revoiced_audio Audio

The final audio with new speech.

text str

The text that was spoken.

separated_audio SeparatedAudio | None

Separated audio components (if preserve_background=True).

voice_sample Audio | None

Voice sample used for cloning.

original_duration float

Duration of the original audio.

speech_duration float

Duration of the generated speech.

Source code in src/videopython/ai/dubbing/models.py
class RevoiceResult(BaseModel):
    """Result of a voice replacement operation.

    Attributes:
        revoiced_audio: The final audio with new speech.
        text: The text that was spoken.
        separated_audio: Separated audio components (if preserve_background=True).
        voice_sample: Voice sample used for cloning.
        original_duration: Duration of the original audio.
        speech_duration: Duration of the generated speech.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    revoiced_audio: Audio
    text: str
    separated_audio: SeparatedAudio | None = None
    voice_sample: Audio | None = None
    original_duration: float = 0.0
    speech_duration: float = 0.0

    @property
    def total_duration(self) -> float:
        """Total duration of the revoiced audio."""
        return self.revoiced_audio.metadata.duration_seconds

total_duration property

total_duration: float

Total duration of the revoiced audio.

TranslatedSegment

Individual translated speech segment with timing information.

TranslatedSegment

Bases: BaseModel

A segment of translated text with timing information.

Attributes:

Name Type Description
original_segment _TranscriptionSegmentField

The original transcription segment.

translated_text str

The translated text.

source_lang str

Source language code (e.g., "en").

target_lang str

Target language code (e.g., "es").

speaker str | None

Speaker identifier if available.

start float

Start time in seconds.

end float

End time in seconds.

Source code in src/videopython/ai/dubbing/models.py
class TranslatedSegment(BaseModel):
    """A segment of translated text with timing information.

    Attributes:
        original_segment: The original transcription segment.
        translated_text: The translated text.
        source_lang: Source language code (e.g., "en").
        target_lang: Target language code (e.g., "es").
        speaker: Speaker identifier if available.
        start: Start time in seconds.
        end: End time in seconds.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    original_segment: _TranscriptionSegmentField
    translated_text: str
    source_lang: str
    target_lang: str
    speaker: str | None = None
    start: float = 0.0
    end: float = 0.0

    @model_validator(mode="after")
    def _default_timing_from_segment(self) -> TranslatedSegment:
        # ``start == end == 0.0`` is the dataclass-era sentinel for "use the
        # original segment's timing." Preserved so legacy callers (and the
        # dub cache wire format) keep working.
        if self.start == 0.0 and self.end == 0.0:
            self.start = self.original_segment.start
            self.end = self.original_segment.end
        if self.speaker is None:
            self.speaker = self.original_segment.speaker
        return self

    @property
    def original_text(self) -> str:
        """Get the original text from the segment."""
        return self.original_segment.text

    @property
    def duration(self) -> float:
        """Duration of the segment in seconds."""
        return self.end - self.start

original_text property

original_text: str

Get the original text from the segment.

duration property

duration: float

Duration of the segment in seconds.

SeparatedAudio

Audio separated into vocals and background components.

SeparatedAudio

Bases: BaseModel

Audio separated into different components.

Attributes:

Name Type Description
vocals Audio

Isolated vocal/speech track.

background Audio

Combined background audio (music + effects).

music Audio | None

Isolated music track (if available).

effects Audio | None

Isolated sound effects track (if available).

original Audio

The original unseparated audio.

Source code in src/videopython/ai/dubbing/models.py
class SeparatedAudio(BaseModel):
    """Audio separated into different components.

    Attributes:
        vocals: Isolated vocal/speech track.
        background: Combined background audio (music + effects).
        music: Isolated music track (if available).
        effects: Isolated sound effects track (if available).
        original: The original unseparated audio.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    vocals: Audio
    background: Audio
    original: Audio
    music: Audio | None = None
    effects: Audio | None = None

    @property
    def has_detailed_separation(self) -> bool:
        """Check if music and effects are separated."""
        return self.music is not None and self.effects is not None

has_detailed_separation property

has_detailed_separation: bool

Check if music and effects are separated.

Expressiveness

Per-segment Chatterbox generate() knobs (exaggeration, cfg_weight, temperature). None on any field means "let Chatterbox use its default". The dubbing pipeline derives this from source vocals RMS automatically; the type is exposed for users who want to inspect or override per-segment values.

Expressiveness

Bases: BaseModel

Chatterbox generate() knobs derived from source-segment prosody.

None on any field means "let Chatterbox use its own default" -- avoids pinning the dub against future Chatterbox default changes.

Attributes:

Name Type Description
exaggeration float | None

Emotional intensity. Chatterbox default 0.5; 0.7+ produces dramatic output.

cfg_weight float | None

Classifier-free guidance weight. Chatterbox default 0.5; lower values (~0.3) slow pacing.

temperature float | None

Sampling temperature. Chatterbox default 0.8.

Source code in src/videopython/ai/dubbing/models.py
class Expressiveness(BaseModel):
    """Chatterbox ``generate()`` knobs derived from source-segment prosody.

    ``None`` on any field means "let Chatterbox use its own default" --
    avoids pinning the dub against future Chatterbox default changes.

    Attributes:
        exaggeration: Emotional intensity. Chatterbox default ``0.5``;
            ``0.7+`` produces dramatic output.
        cfg_weight: Classifier-free guidance weight. Chatterbox default
            ``0.5``; lower values (~``0.3``) slow pacing.
        temperature: Sampling temperature. Chatterbox default ``0.8``.
    """

    model_config = ConfigDict(frozen=True)

    exaggeration: float | None = None
    cfg_weight: float | None = None
    temperature: float | None = None

    def as_kwargs(self) -> dict[str, float]:
        """Knobs as a dict, dropping ``None`` entries.

        Suitable for ``**``-expansion into Chatterbox.
        """
        return {
            name: value
            for name, value in (
                ("exaggeration", self.exaggeration),
                ("cfg_weight", self.cfg_weight),
                ("temperature", self.temperature),
            )
            if value is not None
        }

as_kwargs

as_kwargs() -> dict[str, float]

Knobs as a dict, dropping None entries.

Suitable for **-expansion into Chatterbox.

Source code in src/videopython/ai/dubbing/models.py
def as_kwargs(self) -> dict[str, float]:
    """Knobs as a dict, dropping ``None`` entries.

    Suitable for ``**``-expansion into Chatterbox.
    """
    return {
        name: value
        for name, value in (
            ("exaggeration", self.exaggeration),
            ("cfg_weight", self.cfg_weight),
            ("temperature", self.temperature),
        )
        if value is not None
    }

TimingSummary

Aggregate stats over per-segment timing adjustments applied by the synchronizer. Surfaces truncation and speed-change counts that translation quality eval harnesses can compare across backends.

TimingSummary

Bases: BaseModel

Aggregate stats over per-segment timing adjustments.

Surfaces how aggressively the timing synchronizer had to compress or truncate dubbed segments to fit the source's spoken regions. High truncation rates indicate translation produced text too long for the source duration.

Source code in src/videopython/ai/dubbing/models.py
class TimingSummary(BaseModel):
    """Aggregate stats over per-segment timing adjustments.

    Surfaces how aggressively the timing synchronizer had to compress or
    truncate dubbed segments to fit the source's spoken regions. High
    truncation rates indicate translation produced text too long for the
    source duration.
    """

    total_segments: int
    clean_count: int
    stretched_count: int
    truncated_count: int
    mean_speed_factor: float
    max_truncation_seconds: float

    @classmethod
    def from_adjustments(cls, adjustments: list[TimingAdjustment]) -> TimingSummary:
        """Aggregate a list of TimingAdjustments into a TimingSummary."""
        total = len(adjustments)
        if total == 0:
            return cls(
                total_segments=0,
                clean_count=0,
                stretched_count=0,
                truncated_count=0,
                mean_speed_factor=1.0,
                max_truncation_seconds=0.0,
            )

        clean = 0
        stretched = 0
        truncated = 0
        speed_sum = 0.0
        max_truncation = 0.0
        for adj in adjustments:
            speed_sum += adj.speed_factor
            if adj.was_truncated:
                truncated += 1
                truncation = adj.original_duration - adj.actual_duration
                if truncation > max_truncation:
                    max_truncation = truncation
            elif abs(adj.speed_factor - 1.0) <= CLEAN_SPEED_TOLERANCE:
                clean += 1
            else:
                stretched += 1

        return cls(
            total_segments=total,
            clean_count=clean,
            stretched_count=stretched,
            truncated_count=truncated,
            mean_speed_factor=speed_sum / total,
            max_truncation_seconds=max_truncation,
        )

from_adjustments classmethod

from_adjustments(
    adjustments: list[TimingAdjustment],
) -> TimingSummary

Aggregate a list of TimingAdjustments into a TimingSummary.

Source code in src/videopython/ai/dubbing/models.py
@classmethod
def from_adjustments(cls, adjustments: list[TimingAdjustment]) -> TimingSummary:
    """Aggregate a list of TimingAdjustments into a TimingSummary."""
    total = len(adjustments)
    if total == 0:
        return cls(
            total_segments=0,
            clean_count=0,
            stretched_count=0,
            truncated_count=0,
            mean_speed_factor=1.0,
            max_truncation_seconds=0.0,
        )

    clean = 0
    stretched = 0
    truncated = 0
    speed_sum = 0.0
    max_truncation = 0.0
    for adj in adjustments:
        speed_sum += adj.speed_factor
        if adj.was_truncated:
            truncated += 1
            truncation = adj.original_duration - adj.actual_duration
            if truncation > max_truncation:
                max_truncation = truncation
        elif abs(adj.speed_factor - 1.0) <= CLEAN_SPEED_TOLERANCE:
            clean += 1
        else:
            stretched += 1

    return cls(
        total_segments=total,
        clean_count=clean,
        stretched_count=stretched,
        truncated_count=truncated,
        mean_speed_factor=speed_sum / total,
        max_truncation_seconds=max_truncation,
    )

TranscriptQuality

Heuristic quality assessment over a Whisper transcription. Surfaced on every DubbingResult; drives the optional strict_quality reject path.

TranscriptQuality

Bases: BaseModel

Quality assessment of a Whisper transcription.

Attributes:

Name Type Description
recommendation Recommendation

"ok" (continue), "warn" (continue, log), or "reject" (caller should refuse to dub if strict_quality).

dominant_phrase str | None

The repeating phrase that triggered the dominance flag, or None when the flag didn't fire.

dominant_phrase_fraction float

Character-count share of the most common normalized segment phrase. 0.0 when no segments.

median_avg_logprob float | None

Median of avg_logprob across segments that carry it; None when no segment had a logprob (e.g. SRT-loaded).

speech_fraction float

Sum of segment durations divided by the audio's wall-clock duration.

flags list[str]

Human-readable list of which checks fired.

Source code in src/videopython/ai/dubbing/quality.py
class TranscriptQuality(BaseModel):
    """Quality assessment of a Whisper transcription.

    Attributes:
        recommendation: ``"ok"`` (continue), ``"warn"`` (continue, log), or
            ``"reject"`` (caller should refuse to dub if strict_quality).
        dominant_phrase: The repeating phrase that triggered the dominance
            flag, or None when the flag didn't fire.
        dominant_phrase_fraction: Character-count share of the most common
            normalized segment phrase. 0.0 when no segments.
        median_avg_logprob: Median of ``avg_logprob`` across segments that
            carry it; None when no segment had a logprob (e.g. SRT-loaded).
        speech_fraction: Sum of segment durations divided by the audio's
            wall-clock duration.
        flags: Human-readable list of which checks fired.
    """

    recommendation: Recommendation
    dominant_phrase: str | None
    dominant_phrase_fraction: float
    median_avg_logprob: float | None
    speech_fraction: float
    flags: list[str] = Field(default_factory=list)

GarbageTranscriptError

Raised by the pipeline when strict_quality=True and the transcript-quality heuristic returns recommendation="reject". Carries the triggering TranscriptQuality as error.quality for caller introspection.

GarbageTranscriptError

Bases: RuntimeError

Raised by the dubbing pipeline when strict_quality=True and the transcript heuristic returns recommendation="reject".

The triggering :class:TranscriptQuality is attached as quality so callers can introspect the flags without re-running the pipeline.

Source code in src/videopython/ai/dubbing/quality.py
class GarbageTranscriptError(RuntimeError):
    """Raised by the dubbing pipeline when ``strict_quality=True`` and the
    transcript heuristic returns ``recommendation="reject"``.

    The triggering :class:`TranscriptQuality` is attached as ``quality`` so
    callers can introspect the flags without re-running the pipeline.
    """

    def __init__(self, message: str, quality: TranscriptQuality):
        super().__init__(message)
        self.quality = quality

UnsupportedLanguageError

Raised by the translator auto-resolver when neither MarianMT nor Qwen3 covers the requested (source_lang, target_lang) pair. Carries both fields for caller introspection without parsing the message.

UnsupportedLanguageError

Bases: ValueError

Raised when no available translation backend supports a given (source, target) language pair.

Carries the requested pair so callers can introspect:

try:
    dubber.dub(video, target_lang="xh")
except UnsupportedLanguageError as e:
    print(f"No backend covers {e.source_lang}->{e.target_lang}")
Source code in src/videopython/ai/generation/translation.py
class UnsupportedLanguageError(ValueError):
    """Raised when no available translation backend supports a given
    ``(source, target)`` language pair.

    Carries the requested pair so callers can introspect:

        try:
            dubber.dub(video, target_lang="xh")
        except UnsupportedLanguageError as e:
            print(f"No backend covers {e.source_lang}->{e.target_lang}")
    """

    def __init__(self, source_lang: str, target_lang: str, message: str | None = None):
        self.source_lang = source_lang
        self.target_lang = target_lang
        super().__init__(message or f"No translation backend supports {source_lang}->{target_lang}")

Supported Languages

Get the list of supported languages:

languages = VideoDubber.get_supported_languages()
# {'en': 'English', 'es': 'Spanish', 'fr': 'French', ...}

Supported languages include: English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Czech, Danish, Dutch, Finnish, Greek, Hebrew, Indonesian, Japanese, Korean, Malay, Norwegian, Romanian, Russian, Slovak, Swedish, Tamil, Thai, Turkish, Ukrainian, Vietnamese, Chinese.