AI Dubbing

Dub videos into different languages or replace speech with custom text using voice cloning.

Installation

Dubbing is part of the [ai] extra:

pip install "videopython[ai]"

By default speech is synthesized locally with the bundled TextToSpeech (Chatterbox). Synthesis runs through the pluggable SpeechBackend protocol, so you can inject your own backend (e.g. an out-of-process / remote synthesizer) into VideoDubber to keep chatterbox out of the process.

Local Pipeline

Video dubbing runs with a local pipeline combining Whisper for transcription, a local Ollama model for translation, Chatterbox Multilingual TTS for speech synthesis, and Demucs for source separation. See Translation Backend for details.

VideoDubber

Main class for video dubbing and voice revoicing.

Basic Dubbing

Translate speech to another language while preserving the original speaker's voice:

from videopython.ai.dubbing import VideoDubber
from videopython.base import Video

video = Video.from_path("video.mp4")
dubber = VideoDubber()

# Dub to Spanish with voice cloning
result = dubber.dub(
    video=video,
    target_lang="es",
    source_lang="en",
    preserve_background=True,  # Keep music and sound effects
    voice_clone=True,          # Clone original speaker's voice
)

# Save dubbed video
dubbed_video = video.add_audio(result.dubbed_audio, overlay=False)
dubbed_video.save("dubbed_video.mp4")

# Or use convenience method
dubbed_video = dubber.dub_and_replace(video, target_lang="es")
dubbed_video.save("dubbed_video.mp4")

Voice Revoicing

Replace speech with completely different text using the original speaker's voice:

from videopython.ai.dubbing import VideoDubber
from videopython.base import Video

video = Video.from_path("video.mp4")
dubber = VideoDubber()

# Make the person say something different
result = dubber.revoice(
    video=video,
    text="Hello everyone! This is a completely different message.",
    preserve_background=True,
)

print(f"Original duration: {result.original_duration:.1f}s")
print(f"New speech duration: {result.speech_duration:.1f}s")

# Save revoiced video (trimmed to speech length)
revoiced_video = dubber.revoice_and_replace(
    video=video,
    text="Hello everyone! This is a completely different message.",
)
revoiced_video.save("revoiced_video.mp4")

Progress Tracking

def on_progress(stage: str, progress: float) -> None:
    print(f"[{progress*100:5.1f}%] {stage}")

result = dubber.dub(
    video=video,
    target_lang="es",
    progress_callback=on_progress,
)

Memory-Efficient Dubbing

The default pipeline keeps all four models (Whisper, Demucs, the translation backend, Chatterbox) resident in memory and operates on Video objects that hold every frame in RAM. For long or high-resolution sources — or memory-constrained hardware — two flags trade a modest amount of latency for a much lower memory ceiling.

Unload models between stages with low_memory=True:

# Each stage's model is released after it runs, so only one is resident at a time.
# Recommended for GPUs with <=12GB VRAM or hosts with <32GB RAM.
dubber = VideoDubber(low_memory=True)
dubbed_video = dubber.dub_and_replace(video, target_lang="es")

Skip frame loading with dub_file():

# Operates on file paths; extracts audio via ffmpeg, runs the pipeline on the
# audio only, and muxes the dubbed audio back into the source video using
# ffmpeg stream-copy (no video re-encode). Peak memory is bounded by model
# weights and the audio track, independent of video length and resolution.
dubber = VideoDubber(low_memory=True)
result = dubber.dub_file(
    input_path="long_video.mp4",
    output_path="dubbed.mp4",
    target_lang="es",
)

Use dub_file() when you don't need frame-level access in Python. Combine with low_memory=True for the smallest memory footprint. See Processing Large Videos for a worked example.

Whisper Model Selection

Pick the Whisper model size used for transcription. Larger models are more accurate but use more VRAM and run slower. Default is turbo — large-v3 quality at ~8x the speed of large (and ~2x faster than small), so the out-of-the-box dubbing path is now both more accurate and faster.

# Even higher accuracy on very noisy or heavily accented audio
dubber = VideoDubber(whisper_model="large")

# Lower VRAM footprint for short clips
dubber = VideoDubber(whisper_model="tiny")

Supported sizes: tiny, base, small, medium, large, turbo.

Anti-Hallucination Knobs

VideoDubber forwards three Whisper decoder kwargs to AudioToText so dubbing inherits the same defaults — most importantly condition_on_previous_text=False, which prevents a single hallucinated filler from cascading through the whole dubbed track on noisy or sparse-speech inputs.

# Defaults already protect against the cascading-hallucination failure mode.
dubber = VideoDubber()

# Tighter no-speech gate for a film with heavy ambient music.
dubber = VideoDubber(no_speech_threshold=0.85)

See AudioToText for the full rationale.

Brand-Name Vocabulary

Pass a list of brand names, product names, or proper nouns that may appear in the source audio. The list is forwarded to AudioToText and biases Whisper's first-window decoder via initial_prompt, recovering near-mishears (e.g. Klarna → "carna") on brand-monitoring inputs.

dubber = VideoDubber(vocabulary=["Klarna", "Allegro", "InPost"])

See Brand-name vocabulary biasing for normalization rules and the token-budget guard.

Translation Backend

Translation runs through OllamaTranslator — a single local Ollama text model. It sends the transcription segments under a structured-output schema and reads back length-budgeted, context-aware translations: the prompt carries a per-segment character budget (from source duration) and a low_confidence hint sourced from Whisper avg_logprob. Long sources are split into chunks that fit the model's context window, with one parse-retry on any segments the first pass misses.

# Translation always uses the local Ollama model. Pick the Ollama model + server
# (pull it first: `ollama pull qwen3.6:27b`).
dubber = VideoDubber(translator_model="qwen3.6:27b", translator_host="http://localhost:11434")

Segments the model never returns (after the retry) are surfaced on DubbingResult.translation_failures as a list of indices; those land on the result with empty translated text.

The Ollama translator attempts any language pair, so the pipeline does not reject a requested target language up front.

Pluggable TTS Backend

Speech synthesis is decoupled from the pipeline behind a runtime_checkable SpeechBackend protocol (videopython.ai.dubbing._tts_backend). The bundled local TextToSpeech (Chatterbox) satisfies it structurally; when no backend is injected the pipeline constructs it lazily.

To dub without chatterbox in the process, inject any object exposing generate_audio(text, *, voice_sample_path=..., exaggeration=..., cfg_weight=..., temperature=...) -> Audio (e.g. an out-of-process or remote synthesizer):

from videopython.ai.dubbing import VideoDubber
from videopython.audio import Audio

class RemoteTTS:
    def generate_audio(self, text, voice_sample=None, voice_sample_path=None,
                       exaggeration=None, cfg_weight=None, temperature=None) -> Audio:
        ...  # call your remote/Modal synthesizer, return an Audio

dubber = VideoDubber(tts_backend=RemoteTTS())  # chatterbox never enters the process

videopython ships only the protocol plus the local backend — there is no reference remote/HTTP backend.

Output Options for `dub_file`

dub_file() writes the dubbed video by stream-copying the source video and muxing the new audio. Two extras carry through automatically and one is opt-in:

Subtitles pass-through (automatic). Subtitle streams from the source video are stream-copied into the output by default. Sources without subtitles are tolerated.
Source loudness match (automatic). The dubbed audio is gain-matched to the source via BS.1770 integrated-loudness measurement (pyloudnorm, BSD-3) so the dub lands within ~1 LU of the source on dialogue-heavy mixes. Falls back to peak-amplitude match for clips shorter than 400 ms; post-gain peaks are clamped to 0.99.
keep_original_audio=True (opt-in). Retains the source audio as a secondary audio track behind the dubbed one. Useful for editorial A/B; the dubbed track stays the default-playback track.

result = dubber.dub_file(
    input_path="interview.mp4",
    output_path="interview_es.mp4",
    target_lang="es",
    keep_original_audio=True,  # source audio rides along as track #2
)

Transcript Quality Gating

Even with condition_on_previous_text=False, sufficiently degenerate input (ambient music, mostly-silent windows misread as speech) can still produce unusable transcripts. The pipeline runs a cheap heuristic over the Whisper output and exposes the assessment on every result.

Three checks fire flags:

Dominant phrase — one phrase covers ≥70% of segment characters (catches cascades like the Japanese YouTube outro 「ご視聴ありがとうございました」).
Low decoder confidence — median avg_logprob < -1.5.
Sparse speech — speech-region duration is <5% of clip duration on inputs >30s.

The recommendation is "reject" when the dominance flag fires together with at least one other flag, "warn" when any single flag fires, "ok" otherwise. Single repetition alone (chants, song lyrics) only warns.

result = dubber.dub(video, target_lang="es")

q = result.transcript_quality
if q is not None:
    print(q.recommendation)            # "ok" | "warn" | "reject"
    print(q.dominant_phrase_fraction)  # 0.0-1.0
    print(q.flags)                     # ["dominant_phrase", ...]

Use strict_quality=True to refuse low-quality transcripts before paying for Demucs, translation, and TTS:

from videopython.ai.dubbing import GarbageTranscriptError

dubber = VideoDubber(strict_quality=True)
try:
    dubber.dub(video, target_lang="es")
except GarbageTranscriptError as exc:
    print("Refused:", exc.quality.flags)

Timing Summary

DubbingResult.timing_summary aggregates the per-segment timing adjustments the synchronizer applied to fit translated speech into source durations. High truncation rates indicate translation produced text that was too long for the source's spoken regions — a quality red flag worth surfacing in eval harnesses or product UI.

result = dubber.dub(video, target_lang="es")

ts = result.timing_summary
if ts is not None:
    print(f"{ts.clean_count}/{ts.total_segments} clean")
    print(f"{ts.truncated_count} truncated, worst {ts.max_truncation_seconds:.2f}s")
    print(f"mean speed factor {ts.mean_speed_factor:.3f}")

Source-Prosody Expressiveness

ChatterboxMultilingualTTS.generate() exposes exaggeration, cfg_weight, and temperature knobs. The dubbing pipeline derives an Expressiveness profile per segment from source vocals RMS (relative to whole-vocals baseline) and forwards it to Chatterbox, so the dub tracks the source's loud/quiet shape instead of using flat defaults on every segment.

Three buckets, picked by-ear on cam1_1min.mp4:

RMS ratio vs baseline	`exaggeration`	`cfg_weight`
`< 0.7×` (calm)	`0.3`	`0.7`
`0.7×–1.3×` (normal)	Chatterbox default	Chatterbox default
`> 1.3×` (dramatic)	`0.85`	`0.35`

The Expressiveness dataclass is exported from videopython.ai.dubbing.

Supplying a Pre-Computed Transcription

dub(), dub_and_replace(), and dub_file() accept an optional transcription argument. Pass a pre-computed Transcription to skip the internal Whisper step — useful when you've already transcribed (and possibly hand-edited) the source.

Per-speaker voice cloning is driven by speaker labels on the supplied transcription. Three cases:

Supplied transcription	`enable_diarization`	Behavior
Has speaker labels	any	Use supplied speakers; `enable_diarization` ignored
No speakers	`True`	Run pyannote on the audio, attach speakers to supplied words
No speakers	`False`	Use as-is; all segments share a single voice clone

The diarize-on-supplied path requires word-level timings on the supplied transcription — transcriptions loaded from SRT (one synthetic word per block) are rejected.

# Workflow: transcribe, edit, then dub with per-speaker cloning
from videopython.ai.dubbing import VideoDubber
from videopython.ai.understanding.audio import AudioToText
from videopython.base import Video

video = Video.from_path("video.mp4")

# 1. Transcribe with diarization
transcriber = AudioToText(enable_diarization=True)
transcription = transcriber.transcribe(video)

# 2. Edit segment text in-place (correct misrecognitions, etc.)
for seg in transcription.segments:
    if "incorrect word" in seg.text:
        seg.text = seg.text.replace("incorrect word", "correct word")

# 3. Dub using the edited transcription. Speaker labels from step 1 are
#    preserved, so each speaker gets their own cloned voice.
dubber = VideoDubber()
dubbed_video = dubber.dub_and_replace(
    video=video,
    target_lang="es",
    transcription=transcription,
)

If you have a transcription without speakers and want per-speaker cloning, pass enable_diarization=True — pyannote will run standalone (skipping the Whisper re-transcription).

VideoDubber

Dubs videos into different languages using the local pipeline.

Accepts either a :class:DubbingConfig or the same knobs as flat kwargs (device, low_memory, whisper_model, translator_model, etc.) -- the flat path builds a DubbingConfig internally. See :class:DubbingConfig for the full knob list and defaults.

Source code in src/videopython/ai/dubbing/dubber.py

class VideoDubber:
    """Dubs videos into different languages using the local pipeline.

    Accepts either a :class:`DubbingConfig` or the same knobs as flat kwargs
    (``device``, ``low_memory``, ``whisper_model``, ``translator_model``, etc.)
    -- the flat path builds a ``DubbingConfig`` internally. See
    :class:`DubbingConfig` for the full knob list and defaults.
    """

    def __init__(
        self,
        config: DubbingConfig | None = None,
        *,
        tts_backend: SpeechBackend | None = None,
        **kwargs: Any,
    ):
        self.config = DubbingConfig.from_args(config, **kwargs)
        # Optional injected speech backend. None -> the pipeline lazily builds
        # the local chatterbox-backed TextToSpeech (from the [ai] extra). Inject
        # a SpeechBackend to run synthesis out-of-process (e.g. a remote/Modal
        # function) without loading chatterbox here.
        self._tts_backend = tts_backend
        self._local_pipeline: Any = None
        logger.info(
            "VideoDubber initialized with %s",
            " ".join(f"{k}={v}" for k, v in self.config.init_log_fields().items()),
        )

    def _init_local_pipeline(self) -> None:
        from videopython.ai.dubbing.pipeline import LocalDubbingPipeline

        self._local_pipeline = LocalDubbingPipeline(config=self.config, tts_backend=self._tts_backend)

    def dub(
        self,
        video: Video,
        target_lang: str,
        source_lang: str | None = None,
        preserve_background: bool = True,
        voice_clone: bool = True,
        enable_diarization: bool = False,
        progress_callback: Callable[[str, float], None] | None = None,
        transcription: Any = None,
    ) -> DubbingResult:
        """Dub a video into a target language.

        Args:
            enable_diarization: Enable speaker diarization to clone each speaker's
                voice separately. With ``transcription=None``, runs alongside Whisper.
                With a supplied ``transcription`` that has no speakers, runs pyannote
                standalone and overlays speakers onto the supplied words. Ignored when
                the supplied transcription already has speaker labels.
            transcription: Optional pre-computed ``Transcription`` to skip the Whisper
                step. Speaker labels on the supplied transcription drive per-speaker
                voice cloning. If it has no speakers, pass ``enable_diarization=True``
                to add them via pyannote (requires word-level timings).
        """
        if self._local_pipeline is None:
            self._init_local_pipeline()

        return self._local_pipeline.process(
            source_audio=video.audio,
            target_lang=target_lang,
            source_lang=source_lang,
            preserve_background=preserve_background,
            voice_clone=voice_clone,
            enable_diarization=enable_diarization,
            progress_callback=progress_callback,
            transcription=transcription,
        )

    def dub_and_replace(
        self,
        video: Video,
        target_lang: str,
        source_lang: str | None = None,
        preserve_background: bool = True,
        voice_clone: bool = True,
        enable_diarization: bool = False,
        progress_callback: Callable[[str, float], None] | None = None,
        transcription: Any = None,
    ) -> Video:
        """Dub a video and return a new video with the dubbed audio.

        Args:
            transcription: Optional pre-computed ``Transcription`` to skip the Whisper
                step. Speaker labels on the supplied transcription drive per-speaker
                voice cloning. See ``dub()`` for the interaction with
                ``enable_diarization``.
        """
        result = self.dub(
            video=video,
            target_lang=target_lang,
            source_lang=source_lang,
            preserve_background=preserve_background,
            voice_clone=voice_clone,
            enable_diarization=enable_diarization,
            progress_callback=progress_callback,
            transcription=transcription,
        )
        return video.add_audio(result.dubbed_audio, overlay=False)

    def dub_file(
        self,
        input_path: str | Path,
        output_path: str | Path,
        target_lang: str,
        source_lang: str | None = None,
        preserve_background: bool = True,
        voice_clone: bool = True,
        enable_diarization: bool = False,
        progress_callback: Callable[[str, float], None] | None = None,
        transcription: Any = None,
        keep_original_audio: bool = False,
    ) -> DubbingResult:
        """Dub a video file in place on disk without loading video frames into memory.

        Extracts the audio track via ffmpeg, runs the dubbing pipeline on the
        audio only, then muxes the dubbed audio back into the source video
        using ffmpeg stream-copy (no video re-encode). Peak memory is bounded
        by model weights and the audio track — independent of video length and
        resolution.

        Use this instead of ``dub_and_replace`` when the source video is long
        or high-resolution and you don't need frame-level access in Python.

        Args:
            input_path: Path to the source video file.
            output_path: Path to write the dubbed video. Overwritten if it exists.
            target_lang: Target language code (e.g. ``"es"``, ``"fr"``).
            source_lang: Source language code, or ``None`` to auto-detect.
            preserve_background: Preserve background music/effects via source separation.
            voice_clone: Clone the source speaker's voice for the dubbed track.
            enable_diarization: Enable speaker diarization for per-speaker voice cloning.
                See ``dub()`` for the interaction with ``transcription``.
            progress_callback: Optional callback ``(stage: str, progress: float) -> None``.
            transcription: Optional pre-computed ``Transcription`` to skip the Whisper
                step. Speaker labels on the supplied transcription drive per-speaker
                voice cloning. If it has no speakers, pass ``enable_diarization=True``
                to add them via pyannote (requires word-level timings).
            keep_original_audio: If True, retain the source audio in the output
                as a secondary track behind the dubbed one (editorial A/B).

        Returns:
            ``DubbingResult`` with the dubbed audio, translated segments, and
            source transcription. The output video is written to ``output_path``.
        """
        from videopython.ai.dubbing.remux import replace_audio_stream_from_audio
        from videopython.audio import Audio

        input_path = Path(input_path)
        output_path = Path(output_path)

        if not input_path.exists():
            raise FileNotFoundError(f"Input video not found: {input_path}")

        logger.info("dub_file: loading audio from %s", input_path)
        source_audio = Audio.from_path(input_path)

        if self._local_pipeline is None:
            self._init_local_pipeline()

        result = self._local_pipeline.process(
            source_audio=source_audio,
            target_lang=target_lang,
            source_lang=source_lang,
            preserve_background=preserve_background,
            voice_clone=voice_clone,
            enable_diarization=enable_diarization,
            progress_callback=progress_callback,
            transcription=transcription,
        )

        # Stream the dubbed Audio directly into ffmpeg via stdin instead of
        # going through a temp WAV on disk. For a 2h dub the temp file would
        # be ~10 GB written-then-read; the streaming path drops both copies.
        replace_audio_stream_from_audio(
            video_path=input_path,
            audio=result.dubbed_audio,
            output_path=output_path,
            keep_original_audio=keep_original_audio,
        )

        return result

    def revoice(
        self,
        video: Video,
        text: str,
        preserve_background: bool = True,
        progress_callback: Callable[[str, float], None] | None = None,
    ) -> RevoiceResult:
        """Replace speech in a video with new text using voice cloning."""
        if self._local_pipeline is None:
            self._init_local_pipeline()

        return self._local_pipeline.revoice(
            source_audio=video.audio,
            text=text,
            preserve_background=preserve_background,
            progress_callback=progress_callback,
        )

    def revoice_and_replace(
        self,
        video: Video,
        text: str,
        preserve_background: bool = True,
        progress_callback: Callable[[str, float], None] | None = None,
    ) -> Video:
        """Revoice a video and return a new video with the revoiced audio."""
        result = self.revoice(
            video=video,
            text=text,
            preserve_background=preserve_background,
            progress_callback=progress_callback,
        )

        speech_duration = result.speech_duration
        video_duration = video.total_seconds

        if video_duration > speech_duration:
            output_video = video[: round(speech_duration * video.fps)]
        else:
            output_video = video

        return output_video.add_audio(result.revoiced_audio, overlay=False)

    @staticmethod
    def get_supported_languages() -> dict[str, str]:
        from videopython.ai.dubbing.translation import OllamaTranslator

        return OllamaTranslator.get_supported_languages()

dub

dub(
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None]
    | None = None,
    transcription: Any = None,
) -> DubbingResult

Dub a video into a target language.

Parameters:

Name	Type	Description	Default
`enable_diarization`	`bool`	Enable speaker diarization to clone each speaker's voice separately. With `transcription=None`, runs alongside Whisper. With a supplied `transcription` that has no speakers, runs pyannote standalone and overlays speakers onto the supplied words. Ignored when the supplied transcription already has speaker labels.	`False`
`transcription`	`Any`	Optional pre-computed `Transcription` to skip the Whisper step. Speaker labels on the supplied transcription drive per-speaker voice cloning. If it has no speakers, pass `enable_diarization=True` to add them via pyannote (requires word-level timings).	`None`

Source code in src/videopython/ai/dubbing/dubber.py

def dub(
    self,
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None] | None = None,
    transcription: Any = None,
) -> DubbingResult:
    """Dub a video into a target language.

    Args:
        enable_diarization: Enable speaker diarization to clone each speaker's
            voice separately. With ``transcription=None``, runs alongside Whisper.
            With a supplied ``transcription`` that has no speakers, runs pyannote
            standalone and overlays speakers onto the supplied words. Ignored when
            the supplied transcription already has speaker labels.
        transcription: Optional pre-computed ``Transcription`` to skip the Whisper
            step. Speaker labels on the supplied transcription drive per-speaker
            voice cloning. If it has no speakers, pass ``enable_diarization=True``
            to add them via pyannote (requires word-level timings).
    """
    if self._local_pipeline is None:
        self._init_local_pipeline()

    return self._local_pipeline.process(
        source_audio=video.audio,
        target_lang=target_lang,
        source_lang=source_lang,
        preserve_background=preserve_background,
        voice_clone=voice_clone,
        enable_diarization=enable_diarization,
        progress_callback=progress_callback,
        transcription=transcription,
    )

dub_and_replace

dub_and_replace(
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None]
    | None = None,
    transcription: Any = None,
) -> Video

Dub a video and return a new video with the dubbed audio.

Parameters:

Name	Type	Description	Default
`transcription`	`Any`	Optional pre-computed `Transcription` to skip the Whisper step. Speaker labels on the supplied transcription drive per-speaker voice cloning. See `dub()` for the interaction with `enable_diarization`.	`None`

Source code in src/videopython/ai/dubbing/dubber.py

def dub_and_replace(
    self,
    video: Video,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None] | None = None,
    transcription: Any = None,
) -> Video:
    """Dub a video and return a new video with the dubbed audio.

    Args:
        transcription: Optional pre-computed ``Transcription`` to skip the Whisper
            step. Speaker labels on the supplied transcription drive per-speaker
            voice cloning. See ``dub()`` for the interaction with
            ``enable_diarization``.
    """
    result = self.dub(
        video=video,
        target_lang=target_lang,
        source_lang=source_lang,
        preserve_background=preserve_background,
        voice_clone=voice_clone,
        enable_diarization=enable_diarization,
        progress_callback=progress_callback,
        transcription=transcription,
    )
    return video.add_audio(result.dubbed_audio, overlay=False)

dub_file

dub_file(
    input_path: str | Path,
    output_path: str | Path,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None]
    | None = None,
    transcription: Any = None,
    keep_original_audio: bool = False,
) -> DubbingResult

Dub a video file in place on disk without loading video frames into memory.

Extracts the audio track via ffmpeg, runs the dubbing pipeline on the audio only, then muxes the dubbed audio back into the source video using ffmpeg stream-copy (no video re-encode). Peak memory is bounded by model weights and the audio track — independent of video length and resolution.

Use this instead of dub_and_replace when the source video is long or high-resolution and you don't need frame-level access in Python.

Parameters:

Name	Type	Description	Default
`input_path`	`str \| Path`	Path to the source video file.	required
`output_path`	`str \| Path`	Path to write the dubbed video. Overwritten if it exists.	required
`target_lang`	`str`	Target language code (e.g. `"es"`, `"fr"`).	required
`source_lang`	`str \| None`	Source language code, or `None` to auto-detect.	`None`
`preserve_background`	`bool`	Preserve background music/effects via source separation.	`True`
`voice_clone`	`bool`	Clone the source speaker's voice for the dubbed track.	`True`
`enable_diarization`	`bool`	Enable speaker diarization for per-speaker voice cloning. See `dub()` for the interaction with `transcription`.	`False`
`progress_callback`	`Callable[[str, float], None] \| None`	Optional callback `(stage: str, progress: float) -> None`.	`None`
`transcription`	`Any`	Optional pre-computed `Transcription` to skip the Whisper step. Speaker labels on the supplied transcription drive per-speaker voice cloning. If it has no speakers, pass `enable_diarization=True` to add them via pyannote (requires word-level timings).	`None`
`keep_original_audio`	`bool`	If True, retain the source audio in the output as a secondary track behind the dubbed one (editorial A/B).	`False`

Returns:

Type	Description
`DubbingResult`	`DubbingResult` with the dubbed audio, translated segments, and
`DubbingResult`	source transcription. The output video is written to `output_path`.

Source code in src/videopython/ai/dubbing/dubber.py

def dub_file(
    self,
    input_path: str | Path,
    output_path: str | Path,
    target_lang: str,
    source_lang: str | None = None,
    preserve_background: bool = True,
    voice_clone: bool = True,
    enable_diarization: bool = False,
    progress_callback: Callable[[str, float], None] | None = None,
    transcription: Any = None,
    keep_original_audio: bool = False,
) -> DubbingResult:
    """Dub a video file in place on disk without loading video frames into memory.

    Extracts the audio track via ffmpeg, runs the dubbing pipeline on the
    audio only, then muxes the dubbed audio back into the source video
    using ffmpeg stream-copy (no video re-encode). Peak memory is bounded
    by model weights and the audio track — independent of video length and
    resolution.

    Use this instead of ``dub_and_replace`` when the source video is long
    or high-resolution and you don't need frame-level access in Python.

    Args:
        input_path: Path to the source video file.
        output_path: Path to write the dubbed video. Overwritten if it exists.
        target_lang: Target language code (e.g. ``"es"``, ``"fr"``).
        source_lang: Source language code, or ``None`` to auto-detect.
        preserve_background: Preserve background music/effects via source separation.
        voice_clone: Clone the source speaker's voice for the dubbed track.
        enable_diarization: Enable speaker diarization for per-speaker voice cloning.
            See ``dub()`` for the interaction with ``transcription``.
        progress_callback: Optional callback ``(stage: str, progress: float) -> None``.
        transcription: Optional pre-computed ``Transcription`` to skip the Whisper
            step. Speaker labels on the supplied transcription drive per-speaker
            voice cloning. If it has no speakers, pass ``enable_diarization=True``
            to add them via pyannote (requires word-level timings).
        keep_original_audio: If True, retain the source audio in the output
            as a secondary track behind the dubbed one (editorial A/B).

    Returns:
        ``DubbingResult`` with the dubbed audio, translated segments, and
        source transcription. The output video is written to ``output_path``.
    """
    from videopython.ai.dubbing.remux import replace_audio_stream_from_audio
    from videopython.audio import Audio

    input_path = Path(input_path)
    output_path = Path(output_path)

    if not input_path.exists():
        raise FileNotFoundError(f"Input video not found: {input_path}")

    logger.info("dub_file: loading audio from %s", input_path)
    source_audio = Audio.from_path(input_path)

    if self._local_pipeline is None:
        self._init_local_pipeline()

    result = self._local_pipeline.process(
        source_audio=source_audio,
        target_lang=target_lang,
        source_lang=source_lang,
        preserve_background=preserve_background,
        voice_clone=voice_clone,
        enable_diarization=enable_diarization,
        progress_callback=progress_callback,
        transcription=transcription,
    )

    # Stream the dubbed Audio directly into ffmpeg via stdin instead of
    # going through a temp WAV on disk. For a 2h dub the temp file would
    # be ~10 GB written-then-read; the streaming path drops both copies.
    replace_audio_stream_from_audio(
        video_path=input_path,
        audio=result.dubbed_audio,
        output_path=output_path,
        keep_original_audio=keep_original_audio,
    )

    return result

revoice

revoice(
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None]
    | None = None,
) -> RevoiceResult

Replace speech in a video with new text using voice cloning.

Source code in src/videopython/ai/dubbing/dubber.py

def revoice(
    self,
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None] | None = None,
) -> RevoiceResult:
    """Replace speech in a video with new text using voice cloning."""
    if self._local_pipeline is None:
        self._init_local_pipeline()

    return self._local_pipeline.revoice(
        source_audio=video.audio,
        text=text,
        preserve_background=preserve_background,
        progress_callback=progress_callback,
    )

revoice_and_replace

revoice_and_replace(
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None]
    | None = None,
) -> Video

Revoice a video and return a new video with the revoiced audio.

Source code in src/videopython/ai/dubbing/dubber.py

def revoice_and_replace(
    self,
    video: Video,
    text: str,
    preserve_background: bool = True,
    progress_callback: Callable[[str, float], None] | None = None,
) -> Video:
    """Revoice a video and return a new video with the revoiced audio."""
    result = self.revoice(
        video=video,
        text=text,
        preserve_background=preserve_background,
        progress_callback=progress_callback,
    )

    speech_duration = result.speech_duration
    video_duration = video.total_seconds

    if video_duration > speech_duration:
        output_video = video[: round(speech_duration * video.fps)]
    else:
        output_video = video

    return output_video.add_audio(result.revoiced_audio, overlay=False)

DubbingConfig

Knobs shared by VideoDubber and LocalDubbingPipeline. Accept either config=DubbingConfig(...) or pass the same knobs as flat kwargs — the constructor builds a DubbingConfig internally.

from videopython.ai.dubbing import DubbingConfig, VideoDubber

# Flat kwargs (recommended for ad-hoc calls)
dubber = VideoDubber(device="cuda", low_memory=True, whisper_model="large")

# Explicit config (recommended for reusable presets)
config = DubbingConfig(
    device="cuda",
    low_memory=True,
    whisper_model="large",
    translator_model="qwen3.6:27b",
    vocabulary=["Klarna", "Allegro"],
)
dubber = VideoDubber(config=config)

DubbingConfig

Bases: BaseModel

Knobs shared by :class:VideoDubber and :class:LocalDubbingPipeline.

Accepted as either config=DubbingConfig(...) or flat kwargs on the two constructors; the flat path builds a DubbingConfig internally.

Attributes:

Name	Type	Description
`device`	`str \| None`	Execution device (`cpu`, `cuda`, `mps`, or `None` for auto).
`low_memory`	`bool`	When True, each pipeline stage (Whisper, Demucs, translation, Chatterbox TTS) is unloaded from memory after it runs, so only one model is resident at a time. Trades per-run latency (~10-30s of extra model loads) for a much lower memory ceiling. Recommended for GPUs with <=12GB VRAM or hosts with <32GB RAM. Default False.
`whisper_model`	`WhisperModel`	Whisper model size used for transcription. Larger models give better accuracy at the cost of VRAM and latency. One of `tiny`, `base`, `small`, `medium`, `large`, `turbo`. Default `turbo`.
`condition_on_previous_text`	`bool`	Forwarded to `AudioToText`. Defaults to `False` (Whisper's own default is `True`). With conditioning on, a single hallucinated filler phrase cascades through the rest of the file. See `AudioToText` for the full rationale.
`no_speech_threshold`	`float`	Forwarded to `AudioToText`. Whisper's no-speech gate; raise to drop more low-confidence windows.
`logprob_threshold`	`float \| None`	Forwarded to `AudioToText`. Whisper's average log-probability gate.
`vocabulary`	`list[str] \| None`	Forwarded to `AudioToText`. Optional list of brand names, product names, or proper nouns to bias Whisper's first-window decoder via `initial_prompt`. Recovers near-mishears (e.g. Klarna -> "carna") on brand-monitoring inputs without new model deps.
`strict_quality`	`bool`	When True, the pipeline raises :class:`GarbageTranscriptError` before Demucs/translation/TTS run if the transcript-quality heuristic returns `"reject"`. When False (default), low-quality transcripts are logged at WARNING but processing continues. Either way the :class:`TranscriptQuality` is exposed on `DubbingResult` for inspection.
`translator_model`	`str \| None`	Ollama tag for the translation model (`None` uses the translator's default). `translator_host` sets the server URL.

Source code in src/videopython/ai/dubbing/config.py

class DubbingConfig(BaseModel):
    """Knobs shared by :class:`VideoDubber` and :class:`LocalDubbingPipeline`.

    Accepted as either ``config=DubbingConfig(...)`` or flat kwargs on the
    two constructors; the flat path builds a ``DubbingConfig`` internally.

    Attributes:
        device: Execution device (``cpu``, ``cuda``, ``mps``, or ``None`` for auto).
        low_memory: When True, each pipeline stage (Whisper, Demucs, translation,
            Chatterbox TTS) is unloaded from memory after it runs, so only one
            model is resident at a time. Trades per-run latency (~10-30s of
            extra model loads) for a much lower memory ceiling. Recommended
            for GPUs with <=12GB VRAM or hosts with <32GB RAM. Default False.
        whisper_model: Whisper model size used for transcription. Larger
            models give better accuracy at the cost of VRAM and latency. One
            of ``tiny``, ``base``, ``small``, ``medium``, ``large``, ``turbo``.
            Default ``turbo``.
        condition_on_previous_text: Forwarded to ``AudioToText``. Defaults to
            ``False`` (Whisper's own default is ``True``). With conditioning
            on, a single hallucinated filler phrase cascades through the rest
            of the file. See ``AudioToText`` for the full rationale.
        no_speech_threshold: Forwarded to ``AudioToText``. Whisper's
            no-speech gate; raise to drop more low-confidence windows.
        logprob_threshold: Forwarded to ``AudioToText``. Whisper's average
            log-probability gate.
        vocabulary: Forwarded to ``AudioToText``. Optional list of brand
            names, product names, or proper nouns to bias Whisper's
            first-window decoder via ``initial_prompt``. Recovers
            near-mishears (e.g. Klarna -> "carna") on brand-monitoring
            inputs without new model deps.
        strict_quality: When True, the pipeline raises
            :class:`GarbageTranscriptError` before Demucs/translation/TTS
            run if the transcript-quality heuristic returns ``"reject"``.
            When False (default), low-quality transcripts are logged at
            WARNING but processing continues. Either way the
            :class:`TranscriptQuality` is exposed on ``DubbingResult`` for
            inspection.
        translator_model: Ollama tag for the translation model (``None`` uses the
            translator's default). ``translator_host`` sets the server URL.
    """

    model_config = ConfigDict(frozen=True)

    device: str | None = None
    low_memory: bool = False
    whisper_model: WhisperModel = "turbo"
    condition_on_previous_text: bool = False
    no_speech_threshold: float = 0.6
    logprob_threshold: float | None = -1.0
    vocabulary: list[str] | None = None
    strict_quality: bool = False
    translator_model: str | None = None
    translator_host: str | None = None

    @classmethod
    def from_args(cls, config: DubbingConfig | None = None, /, **kwargs: Any) -> DubbingConfig:
        """Resolve either a ``config`` object or flat knob kwargs into a config.

        The shared accept-one-or-the-other guard for ``VideoDubber`` and
        ``LocalDubbingPipeline``, which both take ``config=DubbingConfig(...)``
        or the flat kwargs (not both).
        """
        if config is not None and kwargs:
            raise TypeError("Pass either `config=` or knob kwargs, not both")
        return config or cls(**kwargs)

    def init_log_fields(self) -> dict[str, object]:
        """Subset of fields surfaced in the init-log line.

        Hand-picked so log noise stays bounded as the config grows.
        """
        return {
            "device": self.device.lower() if isinstance(self.device, str) else "auto",
            "low_memory": self.low_memory,
            "whisper_model": self.whisper_model,
            "translator_model": self.translator_model,
        }

from_args `classmethod`

from_args(
    config: DubbingConfig | None = None, /, **kwargs: Any
) -> DubbingConfig

Resolve either a config object or flat knob kwargs into a config.

The shared accept-one-or-the-other guard for VideoDubber and LocalDubbingPipeline, which both take config=DubbingConfig(...) or the flat kwargs (not both).

Source code in src/videopython/ai/dubbing/config.py

@classmethod
def from_args(cls, config: DubbingConfig | None = None, /, **kwargs: Any) -> DubbingConfig:
    """Resolve either a ``config`` object or flat knob kwargs into a config.

    The shared accept-one-or-the-other guard for ``VideoDubber`` and
    ``LocalDubbingPipeline``, which both take ``config=DubbingConfig(...)``
    or the flat kwargs (not both).
    """
    if config is not None and kwargs:
        raise TypeError("Pass either `config=` or knob kwargs, not both")
    return config or cls(**kwargs)

init_log_fields

init_log_fields() -> dict[str, object]

Subset of fields surfaced in the init-log line.

Hand-picked so log noise stays bounded as the config grows.

Source code in src/videopython/ai/dubbing/config.py

def init_log_fields(self) -> dict[str, object]:
    """Subset of fields surfaced in the init-log line.

    Hand-picked so log noise stays bounded as the config grows.
    """
    return {
        "device": self.device.lower() if isinstance(self.device, str) else "auto",
        "low_memory": self.low_memory,
        "whisper_model": self.whisper_model,
        "translator_model": self.translator_model,
    }

DubbingResult

Result of a dubbing operation containing the dubbed audio and metadata.

result = dubber.dub(video, target_lang="es")

print(f"Translated {result.num_segments} segments")
print(f"Source language: {result.source_lang}")
print(f"Target language: {result.target_lang}")

# Access translated segments
for segment in result.translated_segments:
    print(f"'{segment.original_text}' -> '{segment.translated_text}'")

# Access voice samples used for cloning
for speaker, sample in result.voice_samples.items():
    print(f"{speaker}: {sample.metadata.duration_seconds:.1f}s sample")

DubbingResult

Bases: BaseModel

Result of a video dubbing operation.

Attributes:

Name	Type	Description
`dubbed_audio`	`Audio`	The final dubbed audio track.
`translated_segments`	`list[TranslatedSegment]`	List of translated segments with timing.
`source_transcription`	`Transcription`	Original transcription of the source audio.
`source_lang`	`str`	Detected or specified source language.
`target_lang`	`str`	Target language for dubbing.
`separated_audio`	`SeparatedAudio \| None`	Separated audio components (if preserve_background=True).
`voice_samples`	`dict[str, Audio]`	Dictionary mapping speaker IDs to voice sample Audio.
`timing_summary`	`TimingSummary \| None`	Aggregate stats over per-segment timing adjustments.
`transcript_quality`	`TranscriptQuality \| None`	Heuristic quality assessment of the transcription (None when the pipeline returned early on an empty transcription).
`translation_failures`	`list[int]`	Indices of segments the translator could not translate (missing after its parse-retry pass); those segments are dubbed with empty text.

Source code in src/videopython/ai/dubbing/models.py

class DubbingResult(BaseModel):
    """Result of a video dubbing operation.

    Attributes:
        dubbed_audio: The final dubbed audio track.
        translated_segments: List of translated segments with timing.
        source_transcription: Original transcription of the source audio.
        source_lang: Detected or specified source language.
        target_lang: Target language for dubbing.
        separated_audio: Separated audio components (if preserve_background=True).
        voice_samples: Dictionary mapping speaker IDs to voice sample Audio.
        timing_summary: Aggregate stats over per-segment timing adjustments.
        transcript_quality: Heuristic quality assessment of the transcription
            (None when the pipeline returned early on an empty transcription).
        translation_failures: Indices of segments the translator could not
            translate (missing after its parse-retry pass); those segments are
            dubbed with empty text.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    dubbed_audio: Audio
    translated_segments: list[TranslatedSegment]
    source_transcription: Transcription
    source_lang: str
    target_lang: str
    separated_audio: SeparatedAudio | None = None
    voice_samples: dict[str, Audio] = Field(default_factory=dict)
    timing_summary: TimingSummary | None = None
    transcript_quality: TranscriptQuality | None = None
    translation_failures: list[int] = Field(default_factory=list)

    @property
    def num_segments(self) -> int:
        """Number of translated segments."""
        return len(self.translated_segments)

    @property
    def total_duration(self) -> float:
        """Total duration of the dubbed audio."""
        return self.dubbed_audio.metadata.duration_seconds

    def get_segments_by_speaker(self) -> dict[str, list[TranslatedSegment]]:
        """Group translated segments by speaker.

        Returns:
            Dictionary mapping speaker IDs to their segments.
        """
        segments_by_speaker: dict[str, list[TranslatedSegment]] = {}
        for segment in self.translated_segments:
            speaker = segment.speaker or "unknown"
            if speaker not in segments_by_speaker:
                segments_by_speaker[speaker] = []
            segments_by_speaker[speaker].append(segment)
        return segments_by_speaker

num_segments `property`

num_segments: int

Number of translated segments.

total_duration `property`

total_duration: float

Total duration of the dubbed audio.

get_segments_by_speaker

get_segments_by_speaker() -> dict[
    str, list[TranslatedSegment]
]

Group translated segments by speaker.

Returns:

Type	Description
`dict[str, list[TranslatedSegment]]`	Dictionary mapping speaker IDs to their segments.

Source code in src/videopython/ai/dubbing/models.py

def get_segments_by_speaker(self) -> dict[str, list[TranslatedSegment]]:
    """Group translated segments by speaker.

    Returns:
        Dictionary mapping speaker IDs to their segments.
    """
    segments_by_speaker: dict[str, list[TranslatedSegment]] = {}
    for segment in self.translated_segments:
        speaker = segment.speaker or "unknown"
        if speaker not in segments_by_speaker:
            segments_by_speaker[speaker] = []
        segments_by_speaker[speaker].append(segment)
    return segments_by_speaker

RevoiceResult

Result of a revoicing operation.

result = dubber.revoice(video, text="New message here")

print(f"Text: {result.text}")
print(f"Speech duration: {result.speech_duration:.1f}s")
print(f"Voice sample: {result.voice_sample.metadata.duration_seconds:.1f}s")

RevoiceResult

Bases: BaseModel

Result of a voice replacement operation.

Attributes:

Name	Type	Description
`revoiced_audio`	`Audio`	The final audio with new speech.
`text`	`str`	The text that was spoken.
`separated_audio`	`SeparatedAudio \| None`	Separated audio components (if preserve_background=True).
`voice_sample`	`Audio \| None`	Voice sample used for cloning.
`original_duration`	`float`	Duration of the original audio.
`speech_duration`	`float`	Duration of the generated speech.

Source code in src/videopython/ai/dubbing/models.py

class RevoiceResult(BaseModel):
    """Result of a voice replacement operation.

    Attributes:
        revoiced_audio: The final audio with new speech.
        text: The text that was spoken.
        separated_audio: Separated audio components (if preserve_background=True).
        voice_sample: Voice sample used for cloning.
        original_duration: Duration of the original audio.
        speech_duration: Duration of the generated speech.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    revoiced_audio: Audio
    text: str
    separated_audio: SeparatedAudio | None = None
    voice_sample: Audio | None = None
    original_duration: float = 0.0
    speech_duration: float = 0.0

    @property
    def total_duration(self) -> float:
        """Total duration of the revoiced audio."""
        return self.revoiced_audio.metadata.duration_seconds

total_duration `property`

total_duration: float

Total duration of the revoiced audio.

TranslatedSegment

Individual translated speech segment with timing information.

TranslatedSegment

Bases: BaseModel

A segment of translated text with timing information.

Attributes:

Name	Type	Description
`original_segment`	`_TranscriptionSegmentField`	The original transcription segment.
`translated_text`	`str`	The translated text.
`source_lang`	`str`	Source language code (e.g., "en").
`target_lang`	`str`	Target language code (e.g., "es").
`speaker`	`str \| None`	Speaker identifier if available.
`start`	`float`	Start time in seconds.
`end`	`float`	End time in seconds.

Source code in src/videopython/ai/dubbing/models.py

class TranslatedSegment(BaseModel):
    """A segment of translated text with timing information.

    Attributes:
        original_segment: The original transcription segment.
        translated_text: The translated text.
        source_lang: Source language code (e.g., "en").
        target_lang: Target language code (e.g., "es").
        speaker: Speaker identifier if available.
        start: Start time in seconds.
        end: End time in seconds.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    original_segment: _TranscriptionSegmentField
    translated_text: str
    source_lang: str
    target_lang: str
    speaker: str | None = None
    start: float = 0.0
    end: float = 0.0

    @model_validator(mode="after")
    def _default_timing_from_segment(self) -> TranslatedSegment:
        # ``start == end == 0.0`` is the dataclass-era sentinel for "use the
        # original segment's timing." Preserved so legacy callers (and the
        # dub cache wire format) keep working.
        if self.start == 0.0 and self.end == 0.0:
            self.start = self.original_segment.start
            self.end = self.original_segment.end
        if self.speaker is None:
            self.speaker = self.original_segment.speaker
        return self

    @property
    def original_text(self) -> str:
        """Get the original text from the segment."""
        return self.original_segment.text

    @property
    def duration(self) -> float:
        """Duration of the segment in seconds."""
        return self.end - self.start

original_text `property`

original_text: str

Get the original text from the segment.

duration `property`

duration: float

Duration of the segment in seconds.

SeparatedAudio

Audio separated into vocals and background components.

SeparatedAudio

Bases: BaseModel

Audio separated into different components.

Attributes:

Name	Type	Description
`vocals`	`Audio`	Isolated vocal/speech track.
`background`	`Audio`	Combined background audio (music + effects).
`music`	`Audio \| None`	Isolated music track (if available).
`effects`	`Audio \| None`	Isolated sound effects track (if available).
`original`	`Audio`	The original unseparated audio.

Source code in src/videopython/ai/dubbing/models.py

class SeparatedAudio(BaseModel):
    """Audio separated into different components.

    Attributes:
        vocals: Isolated vocal/speech track.
        background: Combined background audio (music + effects).
        music: Isolated music track (if available).
        effects: Isolated sound effects track (if available).
        original: The original unseparated audio.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    vocals: Audio
    background: Audio
    original: Audio
    music: Audio | None = None
    effects: Audio | None = None

    @property
    def has_detailed_separation(self) -> bool:
        """Check if music and effects are separated."""
        return self.music is not None and self.effects is not None

has_detailed_separation `property`

has_detailed_separation: bool

Check if music and effects are separated.

Expressiveness

Per-segment Chatterbox generate() knobs (exaggeration, cfg_weight, temperature). None on any field means "let Chatterbox use its default". The dubbing pipeline derives this from source vocals RMS automatically; the type is exposed for users who want to inspect or override per-segment values.

Expressiveness

Bases: BaseModel

Chatterbox generate() knobs derived from source-segment prosody.

None on any field means "let Chatterbox use its own default" -- avoids pinning the dub against future Chatterbox default changes.

Attributes:

Name	Type	Description
`exaggeration`	`float \| None`	Emotional intensity. Chatterbox default `0.5`; `0.7+` produces dramatic output.
`cfg_weight`	`float \| None`	Classifier-free guidance weight. Chatterbox default `0.5`; lower values (~`0.3`) slow pacing.
`temperature`	`float \| None`	Sampling temperature. Chatterbox default `0.8`.

Source code in src/videopython/ai/dubbing/models.py

class Expressiveness(BaseModel):
    """Chatterbox ``generate()`` knobs derived from source-segment prosody.

    ``None`` on any field means "let Chatterbox use its own default" --
    avoids pinning the dub against future Chatterbox default changes.

    Attributes:
        exaggeration: Emotional intensity. Chatterbox default ``0.5``;
            ``0.7+`` produces dramatic output.
        cfg_weight: Classifier-free guidance weight. Chatterbox default
            ``0.5``; lower values (~``0.3``) slow pacing.
        temperature: Sampling temperature. Chatterbox default ``0.8``.
    """

    model_config = ConfigDict(frozen=True)

    exaggeration: float | None = None
    cfg_weight: float | None = None
    temperature: float | None = None

    def as_kwargs(self) -> dict[str, float]:
        """Knobs as a dict, dropping ``None`` entries.

        Suitable for ``**``-expansion into Chatterbox.
        """
        return {
            name: value
            for name, value in (
                ("exaggeration", self.exaggeration),
                ("cfg_weight", self.cfg_weight),
                ("temperature", self.temperature),
            )
            if value is not None
        }

as_kwargs

as_kwargs() -> dict[str, float]

Knobs as a dict, dropping None entries.

Suitable for **-expansion into Chatterbox.

Source code in src/videopython/ai/dubbing/models.py

def as_kwargs(self) -> dict[str, float]:
    """Knobs as a dict, dropping ``None`` entries.

    Suitable for ``**``-expansion into Chatterbox.
    """
    return {
        name: value
        for name, value in (
            ("exaggeration", self.exaggeration),
            ("cfg_weight", self.cfg_weight),
            ("temperature", self.temperature),
        )
        if value is not None
    }

TimingSummary

Aggregate stats over per-segment timing adjustments applied by the synchronizer. Surfaces truncation and speed-change counts that translation quality eval harnesses can compare across backends.

TimingSummary

Bases: BaseModel

Aggregate stats over per-segment timing adjustments.

Surfaces how aggressively the timing synchronizer had to compress or truncate dubbed segments to fit the source's spoken regions. High truncation rates indicate translation produced text too long for the source duration.

Source code in src/videopython/ai/dubbing/models.py

class TimingSummary(BaseModel):
    """Aggregate stats over per-segment timing adjustments.

    Surfaces how aggressively the timing synchronizer had to compress or
    truncate dubbed segments to fit the source's spoken regions. High
    truncation rates indicate translation produced text too long for the
    source duration.
    """

    total_segments: int
    clean_count: int
    stretched_count: int
    truncated_count: int
    mean_speed_factor: float
    max_truncation_seconds: float

    @classmethod
    def from_adjustments(cls, adjustments: list[TimingAdjustment]) -> TimingSummary:
        """Aggregate a list of TimingAdjustments into a TimingSummary."""
        total = len(adjustments)
        if total == 0:
            return cls(
                total_segments=0,
                clean_count=0,
                stretched_count=0,
                truncated_count=0,
                mean_speed_factor=1.0,
                max_truncation_seconds=0.0,
            )

        clean = 0
        stretched = 0
        truncated = 0
        speed_sum = 0.0
        max_truncation = 0.0
        for adj in adjustments:
            speed_sum += adj.speed_factor
            if adj.was_truncated:
                truncated += 1
                truncation = adj.original_duration - adj.actual_duration
                if truncation > max_truncation:
                    max_truncation = truncation
            elif abs(adj.speed_factor - 1.0) <= CLEAN_SPEED_TOLERANCE:
                clean += 1
            else:
                stretched += 1

        return cls(
            total_segments=total,
            clean_count=clean,
            stretched_count=stretched,
            truncated_count=truncated,
            mean_speed_factor=speed_sum / total,
            max_truncation_seconds=max_truncation,
        )

from_adjustments `classmethod`

from_adjustments(
    adjustments: list[TimingAdjustment],
) -> TimingSummary

Aggregate a list of TimingAdjustments into a TimingSummary.

Source code in src/videopython/ai/dubbing/models.py

@classmethod
def from_adjustments(cls, adjustments: list[TimingAdjustment]) -> TimingSummary:
    """Aggregate a list of TimingAdjustments into a TimingSummary."""
    total = len(adjustments)
    if total == 0:
        return cls(
            total_segments=0,
            clean_count=0,
            stretched_count=0,
            truncated_count=0,
            mean_speed_factor=1.0,
            max_truncation_seconds=0.0,
        )

    clean = 0
    stretched = 0
    truncated = 0
    speed_sum = 0.0
    max_truncation = 0.0
    for adj in adjustments:
        speed_sum += adj.speed_factor
        if adj.was_truncated:
            truncated += 1
            truncation = adj.original_duration - adj.actual_duration
            if truncation > max_truncation:
                max_truncation = truncation
        elif abs(adj.speed_factor - 1.0) <= CLEAN_SPEED_TOLERANCE:
            clean += 1
        else:
            stretched += 1

    return cls(
        total_segments=total,
        clean_count=clean,
        stretched_count=stretched,
        truncated_count=truncated,
        mean_speed_factor=speed_sum / total,
        max_truncation_seconds=max_truncation,
    )

TranscriptQuality

Heuristic quality assessment over a Whisper transcription. Surfaced on every DubbingResult; drives the optional strict_quality reject path.

TranscriptQuality

Bases: BaseModel

Quality assessment of a Whisper transcription.

Attributes:

Name	Type	Description
`recommendation`	`Recommendation`	`"ok"` (continue), `"warn"` (continue, log), or `"reject"` (caller should refuse to dub if strict_quality).
`dominant_phrase`	`str \| None`	The repeating phrase that triggered the dominance flag, or None when the flag didn't fire.
`dominant_phrase_fraction`	`float`	Character-count share of the most common normalized segment phrase. 0.0 when no segments.
`median_avg_logprob`	`float \| None`	Median of `avg_logprob` across segments that carry it; None when no segment had a logprob (e.g. SRT-loaded).
`speech_fraction`	`float`	Sum of segment durations divided by the audio's wall-clock duration.
`flags`	`list[str]`	Human-readable list of which checks fired.

Source code in src/videopython/ai/dubbing/quality.py

class TranscriptQuality(BaseModel):
    """Quality assessment of a Whisper transcription.

    Attributes:
        recommendation: ``"ok"`` (continue), ``"warn"`` (continue, log), or
            ``"reject"`` (caller should refuse to dub if strict_quality).
        dominant_phrase: The repeating phrase that triggered the dominance
            flag, or None when the flag didn't fire.
        dominant_phrase_fraction: Character-count share of the most common
            normalized segment phrase. 0.0 when no segments.
        median_avg_logprob: Median of ``avg_logprob`` across segments that
            carry it; None when no segment had a logprob (e.g. SRT-loaded).
        speech_fraction: Sum of segment durations divided by the audio's
            wall-clock duration.
        flags: Human-readable list of which checks fired.
    """

    recommendation: Recommendation
    dominant_phrase: str | None
    dominant_phrase_fraction: float
    median_avg_logprob: float | None
    speech_fraction: float
    flags: list[str] = Field(default_factory=list)

GarbageTranscriptError

Raised by the pipeline when strict_quality=True and the transcript-quality heuristic returns recommendation="reject". Carries the triggering TranscriptQuality as error.quality for caller introspection.

GarbageTranscriptError

Bases: AiError, RuntimeError

Raised by the dubbing pipeline when strict_quality=True and the transcript heuristic returns recommendation="reject".

The triggering :class:TranscriptQuality is attached as quality so callers can introspect the flags without re-running the pipeline.

Source code in src/videopython/ai/dubbing/quality.py

class GarbageTranscriptError(AiError, RuntimeError):
    """Raised by the dubbing pipeline when ``strict_quality=True`` and the
    transcript heuristic returns ``recommendation="reject"``.

    The triggering :class:`TranscriptQuality` is attached as ``quality`` so
    callers can introspect the flags without re-running the pipeline.
    """

    def __init__(self, message: str, quality: TranscriptQuality):
        super().__init__(message)
        self.quality = quality

Supported Languages

Get the list of supported languages:

languages = VideoDubber.get_supported_languages()
# {'en': 'English', 'es': 'Spanish', 'fr': 'French', ...}

Supported languages include: English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Czech, Danish, Dutch, Finnish, Greek, Hebrew, Indonesian, Japanese, Korean, Malay, Norwegian, Romanian, Russian, Slovak, Swedish, Tamil, Thai, Turkish, Ukrainian, Vietnamese, Chinese.

AI Dubbing

Installation

Local Pipeline

VideoDubber

Basic Dubbing

Voice Revoicing

Progress Tracking

Memory-Efficient Dubbing

Whisper Model Selection

Anti-Hallucination Knobs

Brand-Name Vocabulary

Translation Backend

Pluggable TTS Backend

Output Options for dub_file

Transcript Quality Gating

Timing Summary

Source-Prosody Expressiveness

Supplying a Pre-Computed Transcription

VideoDubber

dub

dub_and_replace

dub_file

revoice

revoice_and_replace

DubbingConfig

DubbingConfig

from_args classmethod

init_log_fields

DubbingResult

DubbingResult

num_segments property

total_duration property

get_segments_by_speaker

RevoiceResult

RevoiceResult

total_duration property

TranslatedSegment

TranslatedSegment

original_text property

duration property

SeparatedAudio

SeparatedAudio

has_detailed_separation property

Expressiveness

Expressiveness

as_kwargs

TimingSummary

TimingSummary

from_adjustments classmethod

TranscriptQuality

TranscriptQuality

GarbageTranscriptError

GarbageTranscriptError

Supported Languages

Output Options for `dub_file`

from_args `classmethod`

num_segments `property`

total_duration `property`

total_duration `property`

original_text `property`

duration `property`

has_detailed_separation `property`

from_adjustments `classmethod`