AI Understanding

Analyze videos, transcribe audio, describe visual content, and track faces per shot.

For a single aggregate, serializable analysis object across multiple analyzers, see Video Analysis.

Local Model Support

Class	Local Model Family
SceneVLM	Ollama vision model
AudioToText	Whisper
AudioClassifier	AST
SemanticSceneDetector	TransNetV2
FaceShotTracker / FaceSmoothingTracker	OpenCV YuNet
ObjectDetector	D-FINE (COCO)

AudioToText

Anti-hallucination knobs

Three Whisper decoder kwargs are surfaced for tuning on noisy or sparse-speech audio:

from videopython.ai import AudioToText

# Defaults: condition_on_previous_text=False (the cascading-hallucination fix),
# no_speech_threshold=0.6, logprob_threshold=-1.0.
transcriber = AudioToText()

# Tighter no-speech gate to drop more low-confidence windows on a film with
# heavy ambient music.
transcriber = AudioToText(no_speech_threshold=0.85)

# Restore Whisper's upstream default conditioning (e.g. for clean podcasts
# where cross-window context helps disambiguate homophones).
transcriber = AudioToText(condition_on_previous_text=True)

Brand-name vocabulary biasing

Bias Whisper's first-window decoder toward a caller-supplied list of brand names, product names, or proper nouns via the native initial_prompt channel. Recovers near-mishears (e.g. Klarna → "carna", InPost → "in post") on brand-monitoring inputs without any new model dependencies.

from videopython.ai import AudioToText

# Constructor default — applies to every transcribe() call on this instance.
transcriber = AudioToText(vocabulary=["Klarna", "Allegro", "InPost"])
result = transcriber.transcribe(video)

# Per-call override — useful when one transcriber serves multiple tenants.
result = transcriber.transcribe(video, vocabulary=["Pyszne", "Wolt"])

The list is normalized at construction (whitespace stripped, case-insensitive dedup, casing of the first occurrence preserved). Whisper reserves ~224 tokens for the prompt; longer lists are trimmed from the tail with a single WARNING log line naming the count dropped.

VideoDubber and LocalDubbingPipeline accept the same vocabulary kwarg; it threads through to the underlying transcriber. Within VideoAnalyzer, pass it via analyzer_params:

from videopython.ai import VideoAnalyzer
from videopython.ai.video_analysis import VideoAnalysisConfig

config = VideoAnalysisConfig(
    analyzer_params={"audio_to_text": {"vocabulary": ["Klarna", "Allegro"]}}
)
analysis = VideoAnalyzer(config=config).analyze_path("brand_review.mp4")

Recovers names Whisper almost heard correctly. It will not catch zero-prior names; an LLM correction pass would close that gap.

Per-segment confidence

TranscriptionSegment carries three optional confidence fields populated from the raw Whisper output: avg_logprob, no_speech_prob, and compression_ratio. They are None when not available (e.g. on the diarization-only path that builds segments from words without overlap match, or on transcripts loaded from formats that don't carry the metadata).

These signals feed the dubbing pipeline's transcript-quality gate (median avg_logprob is one of three reject flags) and the translator's confidence-aware translation prompt (segments below threshold get a low_confidence hint). They are also useful for downstream callers that want to drop low-quality segments before further processing.

result = AudioToText().transcribe(video)
for segment in result.segments:
    if segment.avg_logprob is not None and segment.avg_logprob < -1.0:
        print(f"low confidence: {segment.text!r}")

AudioToText

Bases: ManagedPredictor

Transcription service for audio and video using local Whisper models.

Uses openai-whisper for transcription (with word-level timestamps) and pyannote-audio for optional speaker diarization. By default, Silero VAD runs before Whisper to gate language detection on a 30s window built from voiced regions only — fixes Whisper's tendency to lock onto the wrong language when the file opens with silence, music, or non-vocal credits. Disable with enable_vad=False to reproduce pre-0.27 behaviour.

Three Whisper decoder kwargs are surfaced for anti-hallucination tuning:

condition_on_previous_text defaults to False (Whisper's own default is True). With conditioning on, a single hallucinated filler phrase cascades through the rest of the file because each window's decoder is primed by the previous window's decoded text. Turning it off is the most commonly recommended fix for that failure mode; the cost on clean audio is small (slightly less context for ambiguous homophones across sentence boundaries).
no_speech_threshold and logprob_threshold are forwarded with Whisper's documented defaults (0.6 and -1.0); raising no_speech_threshold biases toward dropping low-confidence windows instead of emitting filler.

vocabulary biases Whisper's first-window decoder toward a caller- supplied list of brand names, product names, or proper nouns via the native initial_prompt channel. Recovers near-mishears (e.g. Klarna → "carna") without new model deps; will not catch zero-prior names. Per-call override is available on :meth:transcribe.

Source code in src/videopython/ai/understanding/audio.py

class AudioToText(ManagedPredictor):
    """Transcription service for audio and video using local Whisper models.

    Uses openai-whisper for transcription (with word-level timestamps) and
    pyannote-audio for optional speaker diarization. By default, Silero VAD
    runs before Whisper to gate language detection on a 30s window built from
    voiced regions only — fixes Whisper's tendency to lock onto the wrong
    language when the file opens with silence, music, or non-vocal credits.
    Disable with ``enable_vad=False`` to reproduce pre-0.27 behaviour.

    Three Whisper decoder kwargs are surfaced for anti-hallucination tuning:

    - ``condition_on_previous_text`` defaults to ``False`` (Whisper's own
      default is ``True``). With conditioning on, a single hallucinated filler
      phrase cascades through the rest of the file because each window's
      decoder is primed by the previous window's decoded text. Turning it off
      is the most commonly recommended fix for that failure mode; the cost on
      clean audio is small (slightly less context for ambiguous homophones
      across sentence boundaries).
    - ``no_speech_threshold`` and ``logprob_threshold`` are forwarded with
      Whisper's documented defaults (``0.6`` and ``-1.0``); raising
      ``no_speech_threshold`` biases toward dropping low-confidence windows
      instead of emitting filler.

    ``vocabulary`` biases Whisper's first-window decoder toward a caller-
    supplied list of brand names, product names, or proper nouns via the
    native ``initial_prompt`` channel. Recovers near-mishears (e.g. Klarna
    → "carna") without new model deps; will not catch zero-prior names.
    Per-call override is available on :meth:`transcribe`.
    """

    PYANNOTE_DIARIZATION_MODEL = "pyannote/speaker-diarization-community-1"
    _model_attrs = ("_model", "_diarization_pipeline", "_vad_model")

    def __init__(
        self,
        model_name: Literal["tiny", "base", "small", "medium", "large", "turbo"] = "turbo",
        enable_diarization: bool = False,
        enable_vad: bool = True,
        condition_on_previous_text: bool = False,
        no_speech_threshold: float = 0.6,
        logprob_threshold: float | None = -1.0,
        vocabulary: list[str] | None = None,
        device: str | None = None,
    ):
        self.model_name = model_name
        self.enable_diarization = enable_diarization
        self.enable_vad = enable_vad
        self.condition_on_previous_text = condition_on_previous_text
        self.no_speech_threshold = no_speech_threshold
        self.logprob_threshold = logprob_threshold
        self.vocabulary = _normalize_vocabulary(vocabulary)
        self.device = select_device(device, mps_allowed=False)
        log_device_initialization(
            "AudioToText",
            requested_device=device,
            resolved_device=self.device,
        )
        self._model: Any = None
        self._diarization_pipeline: Any = None
        self._vad_model: Any = None

    def _transcribe_kwargs(self, language: str | None, vocabulary: list[str]) -> dict[str, Any]:
        """Kwargs threaded into ``whisper.Whisper.transcribe`` from both call sites.
        ``initial_prompt`` is omitted entirely on the no-vocab path."""
        kwargs: dict[str, Any] = {
            "word_timestamps": True,
            "language": language,
            "condition_on_previous_text": self.condition_on_previous_text,
            "no_speech_threshold": self.no_speech_threshold,
            "logprob_threshold": self.logprob_threshold,
        }
        prompt = _build_initial_prompt(vocabulary)
        if prompt is not None:
            kwargs["initial_prompt"] = prompt
        return kwargs

    def _init_local(self) -> None:
        """Initialize local Whisper model."""
        from videopython.ai._optional import require

        whisper = require("whisper", feature="AudioToText")

        # No revision pin: openai-whisper downloads weights by name from OpenAI's
        # own CDN, not via a HF from_pretrained repo, so there is no HF commit
        # SHA to pin (see videopython.ai._revisions module docstring).
        self._model = whisper.load_model(name=self.model_name, device=self.device)

    def _init_diarization(self) -> None:
        """Initialize pyannote speaker diarization pipeline."""
        import torch

        from videopython.ai._optional import require

        Pipeline = require("pyannote.audio", feature="AudioToText diarization").Pipeline

        self._diarization_pipeline = Pipeline.from_pretrained(
            self.PYANNOTE_DIARIZATION_MODEL, revision=pinned(self.PYANNOTE_DIARIZATION_MODEL)
        )
        self._diarization_pipeline.to(torch.device(self.device))

    def _init_vad(self) -> None:
        """Initialize Silero VAD model.

        The model is ~2 MB and CPU-fast (~5-15s for a 90 min movie); we keep
        it on CPU regardless of ``self.device`` since dispatch overhead would
        outweigh inference cost.
        """
        from videopython.ai._optional import require

        load_silero_vad = require("silero_vad", feature="AudioToText VAD").load_silero_vad

        self._vad_model = load_silero_vad()

    def _process_transcription_result(self, transcription_result: dict[str, Any]) -> Transcription:
        """Process raw transcription result into a Transcription object."""
        transcription_segments = []
        for segment in transcription_result["segments"]:
            transcription_words = [
                TranscriptionWord(word=word["word"], start=float(word["start"]), end=float(word["end"]))
                for word in segment.get("words", [])
            ]
            transcription_segment = TranscriptionSegment(
                start=segment["start"],
                end=segment["end"],
                text=segment["text"],
                words=transcription_words,
                avg_logprob=segment.get("avg_logprob"),
                no_speech_prob=segment.get("no_speech_prob"),
                compression_ratio=segment.get("compression_ratio"),
            )
            transcription_segments.append(transcription_segment)

        return Transcription(segments=transcription_segments, language=transcription_result.get("language"))

    @staticmethod
    def _assign_speakers_to_words(
        words: list[TranscriptionWord],
        diarization_result: Any,
    ) -> list[TranscriptionWord]:
        """Assign speaker labels to words based on diarization segment overlap.

        For each word, finds the diarization segment with the greatest time overlap
        and assigns that speaker. Words with no overlapping diarization segment get
        the nearest speaker by midpoint distance.
        """
        speaker_segments: list[tuple[float, float, str]] = []
        # pyannote-audio 4.x returns DiarizeOutput; use exclusive_speaker_diarization
        # (no overlapping turns) for cleaner word assignment.
        annotation = getattr(diarization_result, "exclusive_speaker_diarization", diarization_result)
        for turn, _, speaker in annotation.itertracks(yield_label=True):
            speaker_segments.append((turn.start, turn.end, speaker))

        if not speaker_segments:
            return words

        result = []
        for word in words:
            best_speaker: str | None = None
            best_overlap = 0.0

            for seg_start, seg_end, speaker in speaker_segments:
                overlap = max(0.0, min(word.end, seg_end) - max(word.start, seg_start))
                if overlap > best_overlap:
                    best_overlap = overlap
                    best_speaker = speaker

            if best_speaker is None:
                word_mid = (word.start + word.end) / 2.0
                best_dist = float("inf")
                for seg_start, seg_end, speaker in speaker_segments:
                    seg_mid = (seg_start + seg_end) / 2.0
                    dist = abs(word_mid - seg_mid)
                    if dist < best_dist:
                        best_dist = dist
                        best_speaker = speaker

            result.append(
                TranscriptionWord(
                    word=word.word,
                    start=word.start,
                    end=word.end,
                    speaker=best_speaker,
                )
            )
        return result

    def diarize_transcription(self, audio: Audio, transcription: Transcription) -> Transcription:
        """Attach speaker labels to a pre-computed transcription using pyannote.

        Useful when callers have a transcription (e.g. pre-computed and edited)
        but no speakers, and want per-speaker voice cloning in dubbing without
        re-running Whisper. Runs pyannote standalone on ``audio`` and overlays
        speakers onto the supplied transcription's words.

        Requires word-level timings: at least one segment must contain more
        than one word. Transcriptions loaded from SRT (one synthetic word per
        segment) will not produce useful speakers and are rejected.
        """
        import numpy as np
        import torch

        all_words: list[TranscriptionWord] = list(transcription.words)
        if not all_words:
            raise ValueError("Cannot diarize a transcription with no words.")

        if not any(len(seg.words) > 1 for seg in transcription.segments):
            raise ValueError(
                "Cannot diarize a transcription without word-level timings. "
                "Supplied transcription has at most one word per segment "
                "(e.g. loaded from SRT). Provide a transcription with "
                "word-level timings, or omit `transcription` to let the "
                "pipeline transcribe and diarize from scratch."
            )

        if self._diarization_pipeline is None:
            self._init_diarization()

        import whisper

        audio_mono = audio.to_mono().resample(whisper.audio.SAMPLE_RATE)
        waveform = torch.from_numpy(audio_mono.data.astype(np.float32)).unsqueeze(0)
        diarization_result = self._diarization_pipeline(
            {"waveform": waveform, "sample_rate": audio_mono.metadata.sample_rate}
        )

        all_words = self._assign_speakers_to_words(all_words, diarization_result)
        return Transcription(words=all_words, language=transcription.language)

    def _run_vad(self, audio_mono: Audio) -> list[tuple[float, float]]:
        """Return voiced spans in seconds using Silero VAD.

        Audio must already be mono at ``whisper.audio.SAMPLE_RATE`` (16 kHz),
        which is one of Silero's two supported rates.
        """
        import numpy as np
        import torch

        if self._vad_model is None:
            self._init_vad()

        from silero_vad import get_speech_timestamps

        waveform = torch.from_numpy(audio_mono.data.astype(np.float32))
        timestamps = get_speech_timestamps(
            waveform,
            self._vad_model,
            sampling_rate=audio_mono.metadata.sample_rate,
            return_seconds=True,
        )
        return [(float(ts["start"]), float(ts["end"])) for ts in timestamps]

    def _detect_language(self, audio_mono: Audio, voiced_spans: list[tuple[float, float]]) -> str:
        """Run Whisper language detection on a 30s window of voiced audio.

        Whisper's auto-detection only inspects the first 30s of input. When
        the file opens with silence/music/credits, that window contains no
        speech and detection picks the closest-looking thing (typically
        English). Concatenating voiced spans up to 30s and running
        ``model.detect_language()`` on the resulting mel fixes this.
        """
        import numpy as np
        import torch
        import whisper

        sample_rate = audio_mono.metadata.sample_rate
        chunks: list[np.ndarray] = []
        remaining = whisper.audio.N_SAMPLES
        for start, end in voiced_spans:
            if remaining <= 0:
                break
            chunk = audio_mono.data[int(start * sample_rate) : int(end * sample_rate)][:remaining]
            chunks.append(chunk)
            remaining -= len(chunk)

        voiced_audio = np.concatenate(chunks).astype(np.float32) if chunks else np.zeros(0, dtype=np.float32)
        padded = whisper.audio.pad_or_trim(torch.from_numpy(voiced_audio))
        mel = whisper.audio.log_mel_spectrogram(padded, n_mels=self._model.dims.n_mels).to(self._model.device)

        _, probs = self._model.detect_language(mel)
        return max(probs, key=probs.get)

    def _transcribe_with_diarization(
        self, audio_mono: Audio, language: str | None, vocabulary: list[str]
    ) -> Transcription:
        """Transcribe with word timestamps and assign speakers via pyannote."""
        import numpy as np
        import torch

        if self._diarization_pipeline is None:
            self._init_diarization()

        audio_data = audio_mono.data
        transcription_result = self._model.transcribe(audio=audio_data, **self._transcribe_kwargs(language, vocabulary))

        waveform = torch.from_numpy(audio_data.astype(np.float32)).unsqueeze(0)
        diarization_result = self._diarization_pipeline(
            {"waveform": waveform, "sample_rate": audio_mono.metadata.sample_rate}
        )

        transcription = self._process_transcription_result(transcription_result)

        # Capture original Whisper segments before flattening to words. The
        # diarization rebuild via Transcription(words=...) regroups by speaker,
        # which loses the per-segment confidence M1.3 plumbed through. We
        # re-attach by max-overlap match below so M2's confidence-aware
        # translation prompts have signal on the diarized path too.
        whisper_segments = transcription.segments

        all_words: list[TranscriptionWord] = []
        for seg in transcription.segments:
            all_words.extend(seg.words)

        if all_words:
            all_words = self._assign_speakers_to_words(all_words, diarization_result)

        rebuilt = Transcription(words=all_words, language=transcription.language)
        _attach_confidence_by_overlap(rebuilt.segments, whisper_segments)
        return rebuilt

    def _transcribe_local(self, audio: Audio, vocabulary: list[str]) -> Transcription:
        """Transcribe using local Whisper model.

        When ``enable_vad`` is True (default), Silero VAD locates voiced
        regions and a 30s voiced window is used for Whisper language
        detection -- avoiding the well-known failure where Whisper locks
        onto the wrong language because the first 30s of input is silence
        or music. The detected language is then passed into
        ``transcribe()`` so chunked decoding stays consistent. If VAD
        finds no speech, an empty Transcription is returned without
        invoking Whisper.
        """
        import whisper

        if self._model is None:
            self._init_local()

        audio_mono = audio.to_mono().resample(whisper.audio.SAMPLE_RATE)

        language: str | None = None
        if self.enable_vad:
            voiced_spans = self._run_vad(audio_mono)
            if not voiced_spans:
                return Transcription(segments=[])
            language = self._detect_language(audio_mono, voiced_spans)

        if self.enable_diarization:
            return self._transcribe_with_diarization(audio_mono, language, vocabulary)

        transcription_result = self._model.transcribe(
            audio=audio_mono.data, **self._transcribe_kwargs(language, vocabulary)
        )
        return self._process_transcription_result(transcription_result)

    def transcribe(self, media: Audio | Video, vocabulary: list[str] | None = None) -> Transcription:
        """Transcribe audio or video to text.

        ``vocabulary`` overrides the constructor default for this call only;
        a per-call list wins over the instance's vocabulary so one
        :class:`AudioToText` instance can serve multiple tenants. Pass
        ``None`` (the default) to use the constructor's list.
        """
        if isinstance(media, Video):
            if media.audio.is_silent:
                return Transcription(segments=[])
            audio = media.audio
        elif isinstance(media, Audio):
            if media.is_silent:
                return Transcription(segments=[])
            audio = media
        else:
            raise TypeError(f"Unsupported media type: {type(media)}. Expected Audio or Video.")

        effective_vocab = self.vocabulary if vocabulary is None else _normalize_vocabulary(vocabulary)
        return self._transcribe_local(audio, effective_vocab)

diarize_transcription

diarize_transcription(
    audio: Audio, transcription: Transcription
) -> Transcription

Attach speaker labels to a pre-computed transcription using pyannote.

Useful when callers have a transcription (e.g. pre-computed and edited) but no speakers, and want per-speaker voice cloning in dubbing without re-running Whisper. Runs pyannote standalone on audio and overlays speakers onto the supplied transcription's words.

Requires word-level timings: at least one segment must contain more than one word. Transcriptions loaded from SRT (one synthetic word per segment) will not produce useful speakers and are rejected.

Source code in src/videopython/ai/understanding/audio.py

def diarize_transcription(self, audio: Audio, transcription: Transcription) -> Transcription:
    """Attach speaker labels to a pre-computed transcription using pyannote.

    Useful when callers have a transcription (e.g. pre-computed and edited)
    but no speakers, and want per-speaker voice cloning in dubbing without
    re-running Whisper. Runs pyannote standalone on ``audio`` and overlays
    speakers onto the supplied transcription's words.

    Requires word-level timings: at least one segment must contain more
    than one word. Transcriptions loaded from SRT (one synthetic word per
    segment) will not produce useful speakers and are rejected.
    """
    import numpy as np
    import torch

    all_words: list[TranscriptionWord] = list(transcription.words)
    if not all_words:
        raise ValueError("Cannot diarize a transcription with no words.")

    if not any(len(seg.words) > 1 for seg in transcription.segments):
        raise ValueError(
            "Cannot diarize a transcription without word-level timings. "
            "Supplied transcription has at most one word per segment "
            "(e.g. loaded from SRT). Provide a transcription with "
            "word-level timings, or omit `transcription` to let the "
            "pipeline transcribe and diarize from scratch."
        )

    if self._diarization_pipeline is None:
        self._init_diarization()

    import whisper

    audio_mono = audio.to_mono().resample(whisper.audio.SAMPLE_RATE)
    waveform = torch.from_numpy(audio_mono.data.astype(np.float32)).unsqueeze(0)
    diarization_result = self._diarization_pipeline(
        {"waveform": waveform, "sample_rate": audio_mono.metadata.sample_rate}
    )

    all_words = self._assign_speakers_to_words(all_words, diarization_result)
    return Transcription(words=all_words, language=transcription.language)

transcribe

transcribe(
    media: Audio | Video,
    vocabulary: list[str] | None = None,
) -> Transcription

Transcribe audio or video to text.

vocabulary overrides the constructor default for this call only; a per-call list wins over the instance's vocabulary so one :class:AudioToText instance can serve multiple tenants. Pass None (the default) to use the constructor's list.

Source code in src/videopython/ai/understanding/audio.py

def transcribe(self, media: Audio | Video, vocabulary: list[str] | None = None) -> Transcription:
    """Transcribe audio or video to text.

    ``vocabulary`` overrides the constructor default for this call only;
    a per-call list wins over the instance's vocabulary so one
    :class:`AudioToText` instance can serve multiple tenants. Pass
    ``None`` (the default) to use the constructor's list.
    """
    if isinstance(media, Video):
        if media.audio.is_silent:
            return Transcription(segments=[])
        audio = media.audio
    elif isinstance(media, Audio):
        if media.is_silent:
            return Transcription(segments=[])
        audio = media
    else:
        raise TypeError(f"Unsupported media type: {type(media)}. Expected Audio or Video.")

    effective_vocab = self.vocabulary if vocabulary is None else _normalize_vocabulary(vocabulary)
    return self._transcribe_local(audio, effective_vocab)

AudioClassifier

Detect and classify sounds, music, and audio events with timestamps using Audio Spectrogram Transformer (AST), a state-of-the-art model achieving 0.485 mAP on AudioSet.

Basic Usage

from videopython.ai import AudioClassifier
from videopython.base import Video

classifier = AudioClassifier(confidence_threshold=0.3)
video = Video.from_path("video.mp4")

result = classifier.classify(video)

# Clip-level predictions (overall audio content)
for label, confidence in result.clip_predictions.items():
    print(f"{label}: {confidence:.2f}")

# Timestamped events
for event in result.events:
    print(f"{event.start:.1f}s - {event.end:.1f}s: {event.label} ({event.confidence:.2f})")

AudioClassifier

Bases: ManagedPredictor

Audio event and sound classification using AST.

Source code in src/videopython/ai/understanding/classification.py

class AudioClassifier(ManagedPredictor):
    """Audio event and sound classification using AST."""

    _model_attrs = ("_model", "_processor")
    AST_SAMPLE_RATE: int = 16000
    AST_CHUNK_SECONDS: float = 10.0
    AST_HOP_SECONDS: float = 5.0

    def __init__(
        self,
        model_name: str = "MIT/ast-finetuned-audioset-10-10-0.4593",
        confidence_threshold: float = 0.3,
        top_k: int = 10,
        device: str | None = None,
    ):
        self.model_name = model_name
        self.confidence_threshold = confidence_threshold
        self.top_k = top_k
        self.device = select_device(device, mps_allowed=True)
        log_device_initialization(
            "AudioClassifier",
            requested_device=device,
            resolved_device=self.device,
        )

        self._model: Any = None
        self._processor: Any = None
        self._labels: list[str] = []

    def _init_local(self) -> None:
        """Initialize local AST model from HuggingFace."""
        from videopython.ai._optional import require

        _transformers = require("transformers", feature="AudioClassifier")
        ASTFeatureExtractor = _transformers.ASTFeatureExtractor
        ASTForAudioClassification = _transformers.ASTForAudioClassification

        self._processor = ASTFeatureExtractor.from_pretrained(self.model_name, revision=pinned(self.model_name))
        self._model = ASTForAudioClassification.from_pretrained(self.model_name, revision=pinned(self.model_name))
        self._model.to(self.device)
        self._model.eval()

        self._labels = [self._model.config.id2label[i] for i in range(len(self._model.config.id2label))]

    def _merge_events(self, events: list[AudioEvent], gap_threshold: float = 0.5) -> list[AudioEvent]:
        """Merge consecutive events of the same class."""
        if not events:
            return []

        events_by_label: dict[str, list[AudioEvent]] = {}
        for event in events:
            if event.label not in events_by_label:
                events_by_label[event.label] = []
            events_by_label[event.label].append(event)

        merged = []
        for label, label_events in events_by_label.items():
            sorted_events = sorted(label_events, key=lambda e: e.start)
            current = sorted_events[0]

            for next_event in sorted_events[1:]:
                if next_event.start - current.end <= gap_threshold:
                    current = AudioEvent(
                        start=current.start,
                        end=next_event.end,
                        label=label,
                        confidence=max(current.confidence, next_event.confidence),
                    )
                else:
                    merged.append(current)
                    current = next_event

            merged.append(current)

        return sorted(merged, key=lambda e: e.start)

    def _classify_local(self, audio: Audio) -> AudioClassification:
        """Classify audio using local AST model with sliding window."""
        import numpy as np
        import torch

        if self._model is None:
            self._init_local()

        audio_processed = audio.to_mono().resample(self.AST_SAMPLE_RATE)
        audio_data = audio_processed.data.astype(np.float32)

        chunk_samples = int(self.AST_CHUNK_SECONDS * self.AST_SAMPLE_RATE)
        hop_samples = int(self.AST_HOP_SECONDS * self.AST_SAMPLE_RATE)
        total_samples = len(audio_data)

        all_chunk_probs = []
        chunk_times = []

        if total_samples <= chunk_samples:
            chunks = [(0, audio_data)]
        else:
            chunks = []
            start = 0
            while start < total_samples:
                end = min(start + chunk_samples, total_samples)
                chunk = audio_data[start:end]
                if len(chunk) < chunk_samples:
                    chunk = np.pad(chunk, (0, chunk_samples - len(chunk)))
                chunks.append((start, chunk))
                start += hop_samples

        for start_sample, chunk in chunks:
            start_time = start_sample / self.AST_SAMPLE_RATE

            inputs = self._processor(
                chunk,
                sampling_rate=self.AST_SAMPLE_RATE,
                return_tensors="pt",
            )
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

            with torch.no_grad():
                outputs = self._model(**inputs)
                logits = outputs.logits[0]
                probs = torch.sigmoid(logits).cpu().numpy()

            all_chunk_probs.append(probs)
            chunk_times.append(start_time)

        chunk_probs_array = np.array(all_chunk_probs)

        events = []
        for start_time, probs in zip(chunk_times, chunk_probs_array):
            end_time = start_time + self.AST_CHUNK_SECONDS
            top_indices = np.argsort(probs)[-self.top_k :][::-1]

            for class_idx in top_indices:
                confidence = float(probs[class_idx])
                if confidence >= self.confidence_threshold:
                    label = self._labels[class_idx]
                    events.append(
                        AudioEvent(
                            start=start_time,
                            end=min(end_time, total_samples / self.AST_SAMPLE_RATE),
                            label=label,
                            confidence=confidence,
                        )
                    )

        merged_events = self._merge_events(events)

        clip_preds = np.mean(chunk_probs_array, axis=0)
        top_clip_indices = np.argsort(clip_preds)[-self.top_k :][::-1]
        clip_predictions = {
            self._labels[idx]: float(clip_preds[idx])
            for idx in top_clip_indices
            if clip_preds[idx] >= self.confidence_threshold
        }

        return AudioClassification(events=merged_events, clip_predictions=clip_predictions)

    def classify(self, media: Audio | Video) -> AudioClassification:
        """Classify audio events in audio or video."""
        if isinstance(media, Video):
            if media.audio.is_silent:
                return AudioClassification(events=[], clip_predictions={})
            audio = media.audio
        elif isinstance(media, Audio):
            if media.is_silent:
                return AudioClassification(events=[], clip_predictions={})
            audio = media
        else:
            raise TypeError(f"Unsupported media type: {type(media)}. Expected Audio or Video.")

        return self._classify_local(audio)

classify

classify(media: Audio | Video) -> AudioClassification

Classify audio events in audio or video.

Source code in src/videopython/ai/understanding/classification.py

def classify(self, media: Audio | Video) -> AudioClassification:
    """Classify audio events in audio or video."""
    if isinstance(media, Video):
        if media.audio.is_silent:
            return AudioClassification(events=[], clip_predictions={})
        audio = media.audio
    elif isinstance(media, Audio):
        if media.is_silent:
            return AudioClassification(events=[], clip_predictions={})
        audio = media
    else:
        raise TypeError(f"Unsupported media type: {type(media)}. Expected Audio or Video.")

    return self._classify_local(audio)

SceneVLM

SceneVLM describes scenes with a local Ollama vision model (model kwarg, an Ollama tag you have pulled; default qwen3.6:27b). It needs a running Ollama server and a vision-capable model that supports structured output.

analyze_scene() and analyze_frame() return a structured SceneDescription with three fields: a one-sentence caption, an open-list subjects, and a closed-enum shot_type. The schema is handed to Ollama's format, so the model returns valid JSON directly.

from videopython.ai import SceneVLM

vlm = SceneVLM(model="llava")  # any pulled vision model
description = vlm.analyze_frame(frame_array)

print(description.caption)     # "A man in a cap speaks into a microphone."
print(description.subjects)    # ["man", "microphone", "cap"]
print(description.shot_type)   # "medium"

SceneVLM.unload() clears the Ollama client for low_memory parity.

SceneVLM

Bases: ManagedPredictor

Generates structured scene descriptions with a local Ollama vision model.

The model must be vision-capable and support Ollama's structured-output format; ollama pull <model> first. options are extra Ollama generation options merged over temperature=0.

Source code in src/videopython/ai/understanding/image.py

class SceneVLM(ManagedPredictor):
    """Generates structured scene descriptions with a local Ollama vision model.

    The model must be vision-capable and support Ollama's structured-output
    ``format``; ``ollama pull <model>`` first. ``options`` are extra Ollama
    generation options merged over ``temperature=0``.
    """

    def __init__(
        self,
        model: str = DEFAULT_SCENE_VLM_MODEL,
        *,
        host: str | None = None,
        options: dict[str, Any] | None = None,
    ) -> None:
        self._client = OllamaStructuredClient(model=model, host=host, options=options)

    def analyze_frame(self, image: np.ndarray | Image.Image, prompt: str | None = None) -> SceneDescription:
        """Analyze one frame and return a structured scene description."""
        return self.analyze_scene([image], prompt=prompt)

    def analyze_scene(self, images: list[np.ndarray | Image.Image], prompt: str | None = None) -> SceneDescription:
        """Analyze a scene's frames and return a structured description."""
        if not images:
            raise ValueError("`images` must contain at least one frame")
        frames = [_to_rgb_array(image) for image in images]
        data = self._client.generate_json(
            system=_SYSTEM_PROMPT, text=prompt or _USER_PROMPT, schema=_SCENE_SCHEMA, images=frames
        )
        shot_type = data.get("shot_type")
        return SceneDescription(
            caption=str(data.get("caption", "")),
            subjects=[str(s) for s in data.get("subjects", [])],
            shot_type=shot_type if shot_type in _SHOT_TYPES else None,
        )

    def unload(self) -> None:
        self._client.unload()

analyze_frame

analyze_frame(
    image: ndarray | Image, prompt: str | None = None
) -> SceneDescription

Analyze one frame and return a structured scene description.

Source code in src/videopython/ai/understanding/image.py

def analyze_frame(self, image: np.ndarray | Image.Image, prompt: str | None = None) -> SceneDescription:
    """Analyze one frame and return a structured scene description."""
    return self.analyze_scene([image], prompt=prompt)

analyze_scene

analyze_scene(
    images: list[ndarray | Image], prompt: str | None = None
) -> SceneDescription

Analyze a scene's frames and return a structured description.

Source code in src/videopython/ai/understanding/image.py

def analyze_scene(self, images: list[np.ndarray | Image.Image], prompt: str | None = None) -> SceneDescription:
    """Analyze a scene's frames and return a structured description."""
    if not images:
        raise ValueError("`images` must contain at least one frame")
    frames = [_to_rgb_array(image) for image in images]
    data = self._client.generate_json(
        system=_SYSTEM_PROMPT, text=prompt or _USER_PROMPT, schema=_SCENE_SCHEMA, images=frames
    )
    shot_type = data.get("shot_type")
    return SceneDescription(
        caption=str(data.get("caption", "")),
        subjects=[str(s) for s in data.get("subjects", [])],
        shot_type=shot_type if shot_type in _SHOT_TYPES else None,
    )

SemanticSceneDetector

ML-based scene boundary detection using TransNetV2. More accurate than histogram-based detection, especially for gradual transitions like fades and dissolves.

from videopython.ai import SemanticSceneDetector

detector = SemanticSceneDetector(threshold=0.5, min_scene_length=1.0)
scenes = detector.detect_streaming("video.mp4")

for scene in scenes:
    print(f"Scene: {scene.start:.1f}s - {scene.end:.1f}s ({scene.duration:.1f}s)")

SemanticSceneDetector

Bases: ManagedPredictor

ML-based scene detection using TransNetV2.

TransNetV2 is a neural network specifically designed for shot boundary detection, providing more accurate scene boundaries than histogram-based methods, especially for gradual transitions.

Uses the transnetv2-pytorch package with pretrained weights.

Example

from videopython.ai.understanding import SemanticSceneDetector detector = SemanticSceneDetector() scenes = detector.detect_streaming("video.mp4") for scene in scenes: ... print(f"Scene: {scene.start:.2f}s - {scene.end:.2f}s")

Source code in src/videopython/ai/understanding/temporal.py

class SemanticSceneDetector(ManagedPredictor):
    """ML-based scene detection using TransNetV2.

    TransNetV2 is a neural network specifically designed for shot boundary
    detection, providing more accurate scene boundaries than histogram-based
    methods, especially for gradual transitions.

    Uses the transnetv2-pytorch package with pretrained weights.

    Example:
        >>> from videopython.ai.understanding import SemanticSceneDetector
        >>> detector = SemanticSceneDetector()
        >>> scenes = detector.detect_streaming("video.mp4")
        >>> for scene in scenes:
        ...     print(f"Scene: {scene.start:.2f}s - {scene.end:.2f}s")
    """

    def __init__(
        self,
        threshold: float = 0.5,
        min_scene_length: float = 0.5,
        device: str | None = None,
    ):
        """Initialize the semantic scene detector.

        Args:
            threshold: Confidence threshold for scene boundaries (0.0-1.0).
                Higher values = fewer, more confident boundaries.
            min_scene_length: Minimum scene duration in seconds.
            device: Device to run on ('cuda', 'mps', 'cpu', or None for auto).
                Note: MPS may have numerical inconsistencies; use 'cpu' for
                reproducible results.
        """
        if not 0.0 <= threshold <= 1.0:
            raise ValueError("threshold must be between 0.0 and 1.0")
        if min_scene_length < 0:
            raise ValueError("min_scene_length must be non-negative")

        self.threshold = threshold
        self.min_scene_length = min_scene_length
        self.device: str | None = device
        self._model: Any = None

    def _init_local(self) -> None:
        """Load the TransNetV2 model with pretrained weights."""
        if self._model is not None:
            return

        from videopython.ai._optional import require

        TransNetV2 = require("transnetv2_pytorch", feature="SemanticSceneDetector").TransNetV2

        requested_device = self.device
        device = select_device(self.device, mps_allowed=True)
        log_device_initialization(
            "SemanticSceneDetector",
            requested_device=requested_device,
            resolved_device=device,
        )
        self.device = device
        self._model = TransNetV2(device=device)
        self._model.eval()

    def detect(self, video: Video) -> list[SceneBoundary]:
        """Detect scenes in a video using ML-based boundary detection.

        Note: This method requires saving video to a temporary file for
        TransNetV2 processing. For better performance, use detect_streaming()
        with a file path directly.

        Args:
            video: Video object to analyze.

        Returns:
            List of SceneBoundary objects representing detected scenes.
        """
        import tempfile

        if len(video.frames) == 0:
            return []

        if len(video.frames) == 1:
            return [SceneBoundary(start=0.0, end=video.total_seconds, start_frame=0, end_frame=1)]

        # Save video to temp file for TransNetV2 processing
        with tempfile.NamedTemporaryFile(suffix=".mp4", delete=True) as tmp:
            video.save(tmp.name)
            return self.detect_streaming(tmp.name)

    def detect_streaming(self, path: str | Path) -> list[SceneBoundary]:
        """Detect scenes from a video file.

        Uses TransNetV2 with pretrained weights for accurate shot boundary
        detection.

        Args:
            path: Path to video file.

        Returns:
            List of SceneBoundary objects representing detected scenes.
        """
        self._init_local()

        # Use TransNetV2's detect_scenes which handles everything internally
        raw_scenes = self._model.detect_scenes(str(path), threshold=self.threshold)

        # Convert to SceneBoundary objects
        scenes = []
        for scene_data in raw_scenes:
            start_frame = scene_data["start_frame"]
            end_frame = scene_data["end_frame"]
            start_time = float(scene_data["start_time"])
            end_time = float(scene_data["end_time"])

            scenes.append(
                SceneBoundary(
                    start=start_time,
                    end=end_time,
                    start_frame=start_frame,
                    end_frame=end_frame,
                )
            )

        if self.min_scene_length > 0:
            scenes = self._merge_short_scenes(scenes)

        return scenes

    def _merge_short_scenes(self, scenes: list[SceneBoundary]) -> list[SceneBoundary]:
        """Merge scenes that are shorter than min_scene_length.

        Args:
            scenes: List of scenes to process.

        Returns:
            List of scenes with short scenes merged into adjacent ones.
        """
        if not scenes:
            return scenes

        merged = [scenes[0]]

        for scene in scenes[1:]:
            last_scene = merged[-1]

            if last_scene.duration < self.min_scene_length:
                merged[-1] = SceneBoundary(
                    start=last_scene.start,
                    end=scene.end,
                    start_frame=last_scene.start_frame,
                    end_frame=scene.end_frame,
                )
            else:
                merged.append(scene)

        if len(merged) > 1 and merged[-1].duration < self.min_scene_length:
            second_last = merged[-2]
            last = merged[-1]
            merged[-2] = SceneBoundary(
                start=second_last.start,
                end=last.end,
                start_frame=second_last.start_frame,
                end_frame=last.end_frame,
            )
            merged.pop()

        return merged

    @classmethod
    def detect_from_path(
        cls,
        path: str | Path,
        threshold: float = 0.5,
        min_scene_length: float = 0.5,
    ) -> list[SceneBoundary]:
        """Convenience method for one-shot scene detection.

        Args:
            path: Path to video file.
            threshold: Scene boundary threshold (0.0-1.0).
            min_scene_length: Minimum scene duration in seconds.

        Returns:
            List of SceneBoundary objects representing detected scenes.
        """
        detector = cls(threshold=threshold, min_scene_length=min_scene_length)
        return detector.detect_streaming(path)

init

__init__(
    threshold: float = 0.5,
    min_scene_length: float = 0.5,
    device: str | None = None,
)

Initialize the semantic scene detector.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Confidence threshold for scene boundaries (0.0-1.0). Higher values = fewer, more confident boundaries.	`0.5`
`min_scene_length`	`float`	Minimum scene duration in seconds.	`0.5`
`device`	`str \| None`	Device to run on ('cuda', 'mps', 'cpu', or None for auto). Note: MPS may have numerical inconsistencies; use 'cpu' for reproducible results.	`None`

Source code in src/videopython/ai/understanding/temporal.py

def __init__(
    self,
    threshold: float = 0.5,
    min_scene_length: float = 0.5,
    device: str | None = None,
):
    """Initialize the semantic scene detector.

    Args:
        threshold: Confidence threshold for scene boundaries (0.0-1.0).
            Higher values = fewer, more confident boundaries.
        min_scene_length: Minimum scene duration in seconds.
        device: Device to run on ('cuda', 'mps', 'cpu', or None for auto).
            Note: MPS may have numerical inconsistencies; use 'cpu' for
            reproducible results.
    """
    if not 0.0 <= threshold <= 1.0:
        raise ValueError("threshold must be between 0.0 and 1.0")
    if min_scene_length < 0:
        raise ValueError("min_scene_length must be non-negative")

    self.threshold = threshold
    self.min_scene_length = min_scene_length
    self.device: str | None = device
    self._model: Any = None

detect

detect(video: Video) -> list[SceneBoundary]

Detect scenes in a video using ML-based boundary detection.

Note: This method requires saving video to a temporary file for TransNetV2 processing. For better performance, use detect_streaming() with a file path directly.

Parameters:

Name	Type	Description	Default
`video`	`Video`	Video object to analyze.	required

Returns:

Type	Description
`list[SceneBoundary]`	List of SceneBoundary objects representing detected scenes.

Source code in src/videopython/ai/understanding/temporal.py

def detect(self, video: Video) -> list[SceneBoundary]:
    """Detect scenes in a video using ML-based boundary detection.

    Note: This method requires saving video to a temporary file for
    TransNetV2 processing. For better performance, use detect_streaming()
    with a file path directly.

    Args:
        video: Video object to analyze.

    Returns:
        List of SceneBoundary objects representing detected scenes.
    """
    import tempfile

    if len(video.frames) == 0:
        return []

    if len(video.frames) == 1:
        return [SceneBoundary(start=0.0, end=video.total_seconds, start_frame=0, end_frame=1)]

    # Save video to temp file for TransNetV2 processing
    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=True) as tmp:
        video.save(tmp.name)
        return self.detect_streaming(tmp.name)

detect_streaming

detect_streaming(path: str | Path) -> list[SceneBoundary]

Detect scenes from a video file.

Uses TransNetV2 with pretrained weights for accurate shot boundary detection.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to video file.	required

Returns:

Type	Description
`list[SceneBoundary]`	List of SceneBoundary objects representing detected scenes.

Source code in src/videopython/ai/understanding/temporal.py

def detect_streaming(self, path: str | Path) -> list[SceneBoundary]:
    """Detect scenes from a video file.

    Uses TransNetV2 with pretrained weights for accurate shot boundary
    detection.

    Args:
        path: Path to video file.

    Returns:
        List of SceneBoundary objects representing detected scenes.
    """
    self._init_local()

    # Use TransNetV2's detect_scenes which handles everything internally
    raw_scenes = self._model.detect_scenes(str(path), threshold=self.threshold)

    # Convert to SceneBoundary objects
    scenes = []
    for scene_data in raw_scenes:
        start_frame = scene_data["start_frame"]
        end_frame = scene_data["end_frame"]
        start_time = float(scene_data["start_time"])
        end_time = float(scene_data["end_time"])

        scenes.append(
            SceneBoundary(
                start=start_time,
                end=end_time,
                start_frame=start_frame,
                end_frame=end_frame,
            )
        )

    if self.min_scene_length > 0:
        scenes = self._merge_short_scenes(scenes)

    return scenes

detect_from_path `classmethod`

detect_from_path(
    path: str | Path,
    threshold: float = 0.5,
    min_scene_length: float = 0.5,
) -> list[SceneBoundary]

Convenience method for one-shot scene detection.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to video file.	required
`threshold`	`float`	Scene boundary threshold (0.0-1.0).	`0.5`
`min_scene_length`	`float`	Minimum scene duration in seconds.	`0.5`

Returns:

Type	Description
`list[SceneBoundary]`	List of SceneBoundary objects representing detected scenes.

Source code in src/videopython/ai/understanding/temporal.py

@classmethod
def detect_from_path(
    cls,
    path: str | Path,
    threshold: float = 0.5,
    min_scene_length: float = 0.5,
) -> list[SceneBoundary]:
    """Convenience method for one-shot scene detection.

    Args:
        path: Path to video file.
        threshold: Scene boundary threshold (0.0-1.0).
        min_scene_length: Minimum scene duration in seconds.

    Returns:
        List of SceneBoundary objects representing detected scenes.
    """
    detector = cls(threshold=threshold, min_scene_length=min_scene_length)
    return detector.detect_streaming(path)

Face Tracking

Two YuNet-based face trackers share one detector, one per use case:

FaceShotTracker.track_shot(frames, frame_indices) returns a list of FaceTrack objects with stable ids within a shot, via IoU association — no embedding re-id, so a track does not survive across shot boundaries. This is the API the analyzer uses.
FaceSmoothingTracker.detect_and_track(frame, frame_index) / track_video(frames) are the single-subject smoothed-position APIs used by FaceTrackingCrop (see AI Transforms).

from videopython.ai import FaceShotTracker

tracker = FaceShotTracker()
tracks = tracker.track_shot(frames)

for track in tracks:
    print(f"track #{track.track_id}: {track.length} frames, "
          f"first frame {track.frame_indices[0]}")

FaceShotTracker

Bases: _FaceTrackerBase

Per-shot multi-track face association via IoU.

Detects faces on every input frame and stitches them into FaceTracks greedily by best IoU. Tracks do not survive across shot boundaries (IoU-only association; no embedding re-id). Used by the video-analysis pipeline to bind detections to subjects within one shot.

Source code in src/videopython/ai/understanding/faces.py

class FaceShotTracker(_FaceTrackerBase):
    """Per-shot multi-track face association via IoU.

    Detects faces on every input frame and stitches them into ``FaceTrack``s
    greedily by best IoU. Tracks do not survive across shot boundaries
    (IoU-only association; no embedding re-id). Used by the video-analysis
    pipeline to bind detections to subjects within one shot.
    """

    def __init__(
        self,
        min_face_size: int = 30,
        batch_size: int = 16,
        iou_match_threshold: float = DEFAULT_IOU_MATCH_THRESHOLD,
        max_missed_frames: int = DEFAULT_MAX_MISSED_FRAMES,
    ):
        """Initialize the per-shot tracker.

        Args:
            min_face_size: Minimum face size in pixels for detection.
            batch_size: Batch size for detection. Default 16.
            iou_match_threshold: Minimum IoU between consecutive detections to
                continue an existing track.
            max_missed_frames: Consecutive frames a track may go without a
                detection before it is closed.
        """
        super().__init__(min_face_size=min_face_size)
        self.batch_size = batch_size
        self.iou_match_threshold = iou_match_threshold
        self.max_missed_frames = max_missed_frames
        logger.info("FaceShotTracker initialized (min_face_size=%s)", self.min_face_size)

    def track_shot(
        self,
        frames: list[np.ndarray] | np.ndarray,
        frame_indices: list[int] | None = None,
    ) -> list[FaceTrack]:
        """Per-shot multi-track association via IoU.

        Detection is run on every input frame (caller is expected to have
        already chosen the sampling cadence -- the analysis pipeline
        passes one frame per scene-VLM sample, lip-sync passes every
        frame in the shot). Tracks are stitched together greedily by
        best IoU above ``iou_match_threshold``; tracks with no match for
        ``max_missed_frames`` consecutive frames are closed and won't
        accept future associations.

        Track ids are integers starting at 1 within this shot. They are
        **not** stable across shots — embedding re-id is deferred.

        Args:
            frames: Frames in the shot (list or stacked ndarray).
            frame_indices: Source-video frame indices. Defaults to
                ``range(len(frames))`` when omitted.

        Returns:
            List of ``FaceTrack`` objects, one per distinct subject
            tracked in the shot.
        """
        if isinstance(frames, np.ndarray):
            frame_list = [frames[i] for i in range(frames.shape[0])] if frames.ndim == 4 else [frames]
        else:
            frame_list = list(frames)

        if not frame_list:
            return []

        if frame_indices is None:
            frame_indices = list(range(len(frame_list)))
        if len(frame_indices) != len(frame_list):
            raise ValueError("frame_indices length must match frames length")

        if self._detector is None:
            self._init_detector()
            assert self._detector is not None

        per_frame_detections: list[list[DetectedFace]] = []
        for batch_start in range(0, len(frame_list), self.batch_size):
            batch = frame_list[batch_start : batch_start + self.batch_size]
            per_frame_detections.extend(self._detector.detect_batch(batch))

        active: list[_OpenTrack] = []
        finished: list[_OpenTrack] = []
        next_id = 1

        for relative_idx, faces in enumerate(per_frame_detections):
            absolute_idx = frame_indices[relative_idx]
            available = [face for face in faces if face.bounding_box is not None]
            assignments: dict[int, DetectedFace] = {}

            for track in active:
                best_face: DetectedFace | None = None
                best_iou = self.iou_match_threshold
                last_box = track.last_box
                if last_box is None:
                    continue
                for face in available:
                    if face in assignments.values() or face.bounding_box is None:
                        continue
                    iou = _bbox_iou(last_box, face.bounding_box)
                    if iou > best_iou:
                        best_iou = iou
                        best_face = face
                if best_face is not None:
                    assignments[track.track_id] = best_face

            for track in active:
                if track.track_id in assignments:
                    face = assignments[track.track_id]
                    assert face.bounding_box is not None
                    track.frame_indices.append(absolute_idx)
                    track.boxes.append(face.bounding_box)
                    track.confidences.append(face.confidence)
                    track.last_box = face.bounding_box
                    track.missed = 0
                else:
                    track.missed += 1

            for face in available:
                if face in assignments.values() or face.bounding_box is None:
                    continue
                track = _OpenTrack(track_id=next_id, last_box=face.bounding_box)
                next_id += 1
                track.frame_indices.append(absolute_idx)
                track.boxes.append(face.bounding_box)
                track.confidences.append(face.confidence)
                active.append(track)

            still_active: list[_OpenTrack] = []
            for track in active:
                if track.missed > self.max_missed_frames:
                    finished.append(track)
                else:
                    still_active.append(track)
            active = still_active

        finished.extend(active)

        return [
            FaceTrack(
                track_id=track.track_id,
                frame_indices=track.frame_indices,
                boxes=track.boxes,
                confidences=track.confidences,
            )
            for track in finished
            if track.frame_indices
        ]

init

__init__(
    min_face_size: int = 30,
    batch_size: int = 16,
    iou_match_threshold: float = DEFAULT_IOU_MATCH_THRESHOLD,
    max_missed_frames: int = DEFAULT_MAX_MISSED_FRAMES,
)

Initialize the per-shot tracker.

Parameters:

Name	Type	Description	Default
`min_face_size`	`int`	Minimum face size in pixels for detection.	`30`
`batch_size`	`int`	Batch size for detection. Default 16.	`16`
`iou_match_threshold`	`float`	Minimum IoU between consecutive detections to continue an existing track.	`DEFAULT_IOU_MATCH_THRESHOLD`
`max_missed_frames`	`int`	Consecutive frames a track may go without a detection before it is closed.	`DEFAULT_MAX_MISSED_FRAMES`

Source code in src/videopython/ai/understanding/faces.py

def __init__(
    self,
    min_face_size: int = 30,
    batch_size: int = 16,
    iou_match_threshold: float = DEFAULT_IOU_MATCH_THRESHOLD,
    max_missed_frames: int = DEFAULT_MAX_MISSED_FRAMES,
):
    """Initialize the per-shot tracker.

    Args:
        min_face_size: Minimum face size in pixels for detection.
        batch_size: Batch size for detection. Default 16.
        iou_match_threshold: Minimum IoU between consecutive detections to
            continue an existing track.
        max_missed_frames: Consecutive frames a track may go without a
            detection before it is closed.
    """
    super().__init__(min_face_size=min_face_size)
    self.batch_size = batch_size
    self.iou_match_threshold = iou_match_threshold
    self.max_missed_frames = max_missed_frames
    logger.info("FaceShotTracker initialized (min_face_size=%s)", self.min_face_size)

track_shot

track_shot(
    frames: list[ndarray] | ndarray,
    frame_indices: list[int] | None = None,
) -> list[FaceTrack]

Per-shot multi-track association via IoU.

Detection is run on every input frame (caller is expected to have already chosen the sampling cadence -- the analysis pipeline passes one frame per scene-VLM sample, lip-sync passes every frame in the shot). Tracks are stitched together greedily by best IoU above iou_match_threshold; tracks with no match for max_missed_frames consecutive frames are closed and won't accept future associations.

Track ids are integers starting at 1 within this shot. They are not stable across shots — embedding re-id is deferred.

Parameters:

Name	Type	Description	Default
`frames`	`list[ndarray] \| ndarray`	Frames in the shot (list or stacked ndarray).	required
`frame_indices`	`list[int] \| None`	Source-video frame indices. Defaults to `range(len(frames))` when omitted.	`None`

Returns:

Type	Description
`list[FaceTrack]`	List of `FaceTrack` objects, one per distinct subject
`list[FaceTrack]`	tracked in the shot.

Source code in src/videopython/ai/understanding/faces.py

def track_shot(
    self,
    frames: list[np.ndarray] | np.ndarray,
    frame_indices: list[int] | None = None,
) -> list[FaceTrack]:
    """Per-shot multi-track association via IoU.

    Detection is run on every input frame (caller is expected to have
    already chosen the sampling cadence -- the analysis pipeline
    passes one frame per scene-VLM sample, lip-sync passes every
    frame in the shot). Tracks are stitched together greedily by
    best IoU above ``iou_match_threshold``; tracks with no match for
    ``max_missed_frames`` consecutive frames are closed and won't
    accept future associations.

    Track ids are integers starting at 1 within this shot. They are
    **not** stable across shots — embedding re-id is deferred.

    Args:
        frames: Frames in the shot (list or stacked ndarray).
        frame_indices: Source-video frame indices. Defaults to
            ``range(len(frames))`` when omitted.

    Returns:
        List of ``FaceTrack`` objects, one per distinct subject
        tracked in the shot.
    """
    if isinstance(frames, np.ndarray):
        frame_list = [frames[i] for i in range(frames.shape[0])] if frames.ndim == 4 else [frames]
    else:
        frame_list = list(frames)

    if not frame_list:
        return []

    if frame_indices is None:
        frame_indices = list(range(len(frame_list)))
    if len(frame_indices) != len(frame_list):
        raise ValueError("frame_indices length must match frames length")

    if self._detector is None:
        self._init_detector()
        assert self._detector is not None

    per_frame_detections: list[list[DetectedFace]] = []
    for batch_start in range(0, len(frame_list), self.batch_size):
        batch = frame_list[batch_start : batch_start + self.batch_size]
        per_frame_detections.extend(self._detector.detect_batch(batch))

    active: list[_OpenTrack] = []
    finished: list[_OpenTrack] = []
    next_id = 1

    for relative_idx, faces in enumerate(per_frame_detections):
        absolute_idx = frame_indices[relative_idx]
        available = [face for face in faces if face.bounding_box is not None]
        assignments: dict[int, DetectedFace] = {}

        for track in active:
            best_face: DetectedFace | None = None
            best_iou = self.iou_match_threshold
            last_box = track.last_box
            if last_box is None:
                continue
            for face in available:
                if face in assignments.values() or face.bounding_box is None:
                    continue
                iou = _bbox_iou(last_box, face.bounding_box)
                if iou > best_iou:
                    best_iou = iou
                    best_face = face
            if best_face is not None:
                assignments[track.track_id] = best_face

        for track in active:
            if track.track_id in assignments:
                face = assignments[track.track_id]
                assert face.bounding_box is not None
                track.frame_indices.append(absolute_idx)
                track.boxes.append(face.bounding_box)
                track.confidences.append(face.confidence)
                track.last_box = face.bounding_box
                track.missed = 0
            else:
                track.missed += 1

        for face in available:
            if face in assignments.values() or face.bounding_box is None:
                continue
            track = _OpenTrack(track_id=next_id, last_box=face.bounding_box)
            next_id += 1
            track.frame_indices.append(absolute_idx)
            track.boxes.append(face.bounding_box)
            track.confidences.append(face.confidence)
            active.append(track)

        still_active: list[_OpenTrack] = []
        for track in active:
            if track.missed > self.max_missed_frames:
                finished.append(track)
            else:
                still_active.append(track)
        active = still_active

    finished.extend(active)

    return [
        FaceTrack(
            track_id=track.track_id,
            frame_indices=track.frame_indices,
            boxes=track.boxes,
            confidences=track.confidences,
        )
        for track in finished
        if track.frame_indices
    ]

FaceSmoothingTracker

Bases: _FaceTrackerBase

Single-subject face tracker with EMA position smoothing.

Selects one face per frame (selection_strategy) and returns a smoothed (cx, cy, w, h) tuple in normalized coords via detect_and_track / track_video. Used by FaceTrackingCrop to drive a follow-the-speaker crop.

Source code in src/videopython/ai/understanding/faces.py

class FaceSmoothingTracker(_FaceTrackerBase):
    """Single-subject face tracker with EMA position smoothing.

    Selects one face per frame (``selection_strategy``) and returns a smoothed
    ``(cx, cy, w, h)`` tuple in normalized coords via ``detect_and_track`` /
    ``track_video``. Used by ``FaceTrackingCrop`` to drive a follow-the-speaker
    crop.
    """

    def __init__(
        self,
        selection_strategy: Literal["largest", "centered", "index"] = "largest",
        face_index: int = 0,
        smoothing: float = 0.8,
        detection_interval: int = 3,
        min_face_size: int = 30,
        batch_size: int = 16,
    ):
        """Initialize the smoothing tracker.

        Args:
            selection_strategy: Which face to track — "largest" (biggest box),
                "centered" (closest to frame center), or "index" (``face_index``).
            face_index: Index of face to track when using the "index" strategy.
            smoothing: Exponential moving average factor (0-1). Higher = smoother.
            detection_interval: Run detection every N frames, hold position between.
            min_face_size: Minimum face size in pixels for detection.
            batch_size: Frames per detection batch in ``track_video``. Default 16.
        """
        super().__init__(min_face_size=min_face_size)
        self.selection_strategy = selection_strategy
        self.face_index = face_index
        self.smoothing = smoothing
        self.detection_interval = detection_interval
        self.batch_size = batch_size
        self._last_position: tuple[float, float] | None = None
        self._last_size: tuple[float, float] | None = None
        self._smoothed_position: tuple[float, float] | None = None
        self._smoothed_size: tuple[float, float] | None = None
        logger.info("FaceSmoothingTracker initialized (detection_interval=%s)", self.detection_interval)

    def _select_face(
        self,
        faces: list[DetectedFace],
        frame_width: int,
        frame_height: int,
    ) -> tuple[float, float, float, float] | None:
        """Select a face based on the configured strategy.

        Args:
            faces: List of DetectedFace objects.
            frame_width: Width of the frame.
            frame_height: Height of the frame.

        Returns:
            Tuple of (center_x, center_y, width, height) in normalized coords, or None.
        """
        faces_with_box = [(f, f.bounding_box) for f in faces if f.bounding_box is not None]
        if not faces_with_box:
            return None

        if self.selection_strategy == "largest":
            _, bbox = faces_with_box[0]
        elif self.selection_strategy == "centered":
            frame_center = (0.5, 0.5)
            _, bbox = min(
                faces_with_box,
                key=lambda fb: ((fb[1].center[0] - frame_center[0]) ** 2 + (fb[1].center[1] - frame_center[1]) ** 2),
            )
        elif self.selection_strategy == "index":
            idx = self.face_index if self.face_index < len(faces_with_box) else 0
            _, bbox = faces_with_box[idx]
        else:
            _, bbox = faces_with_box[0]

        return (bbox.center[0], bbox.center[1], bbox.width, bbox.height)

    def detect_and_track(
        self,
        frame: np.ndarray,
        frame_index: int,
    ) -> tuple[float, float, float, float] | None:
        """Detect face in frame and return smoothed position.

        Args:
            frame: Video frame as numpy array (H, W, 3).
            frame_index: Index of current frame.

        Returns:
            Tuple of (center_x, center_y, width, height) in normalized coords,
            or None if no face detected and no fallback available.
        """
        if self._detector is None:
            self._init_detector()
            assert self._detector is not None

        h, w = frame.shape[:2]

        if frame_index % self.detection_interval == 0:
            faces = self._detector.detect(frame)
            face_info = self._select_face(faces, w, h)
            if face_info is not None:
                self._last_position = (face_info[0], face_info[1])
                self._last_size = (face_info[2], face_info[3])
        elif self._last_position is not None and self._last_size is not None:
            face_info = (*self._last_position, *self._last_size)
        else:
            face_info = None

        return self._smooth(face_info)

    def _smooth(
        self,
        face_info: tuple[float, float, float, float] | None,
    ) -> tuple[float, float, float, float] | None:
        """Apply EMA smoothing, or replay the last smoothed value when no detection.

        Returns ``None`` when no detection has been seen yet.
        """
        if face_info is not None:
            cx, cy, fw, fh = face_info
            if self._smoothed_position is None:
                self._smoothed_position = (cx, cy)
                self._smoothed_size = (fw, fh)
            else:
                assert self._smoothed_size is not None
                alpha = 1 - self.smoothing
                self._smoothed_position = (
                    self._smoothed_position[0] * self.smoothing + cx * alpha,
                    self._smoothed_position[1] * self.smoothing + cy * alpha,
                )
                self._smoothed_size = (
                    self._smoothed_size[0] * self.smoothing + fw * alpha,
                    self._smoothed_size[1] * self.smoothing + fh * alpha,
                )
            return (*self._smoothed_position, *self._smoothed_size)

        if self._smoothed_position is not None and self._smoothed_size is not None:
            return (*self._smoothed_position, *self._smoothed_size)
        return None

    def reset(self) -> None:
        """Reset tracker state for a new video."""
        self._last_position = None
        self._last_size = None
        self._smoothed_position = None
        self._smoothed_size = None

    def track_video(
        self,
        frames: np.ndarray,
    ) -> list[tuple[float, float, float, float] | None]:
        """Track the face through a whole clip via batched per-frame detection.

        Detection runs on every frame (the YuNet detector is CPU-only), then each
        frame's selected face is EMA-smoothed.

        Args:
            frames: Video frames array of shape (N, H, W, 3).

        Returns:
            List of face positions (cx, cy, w, h) for each frame, or None where
            no face was detected and no fallback was available.
        """
        if self._detector is None:
            self._init_detector()
            assert self._detector is not None

        n_frames = len(frames)
        if n_frames == 0:
            return []

        h, w = frames[0].shape[:2]

        detections: list[list[DetectedFace]] = []
        for batch_start in range(0, n_frames, self.batch_size):
            batch = [frames[i] for i in range(batch_start, min(batch_start + self.batch_size, n_frames))]
            detections.extend(self._detector.detect_batch(batch))

        faces = [self._select_face(frame_faces, w, h) for frame_faces in detections]
        self.reset()
        return [self._smooth(face_info) for face_info in faces]

init

__init__(
    selection_strategy: Literal[
        "largest", "centered", "index"
    ] = "largest",
    face_index: int = 0,
    smoothing: float = 0.8,
    detection_interval: int = 3,
    min_face_size: int = 30,
    batch_size: int = 16,
)

Initialize the smoothing tracker.

Parameters:

Name	Type	Description	Default
`selection_strategy`	`Literal['largest', 'centered', 'index']`	Which face to track — "largest" (biggest box), "centered" (closest to frame center), or "index" (`face_index`).	`'largest'`
`face_index`	`int`	Index of face to track when using the "index" strategy.	`0`
`smoothing`	`float`	Exponential moving average factor (0-1). Higher = smoother.	`0.8`
`detection_interval`	`int`	Run detection every N frames, hold position between.	`3`
`min_face_size`	`int`	Minimum face size in pixels for detection.	`30`
`batch_size`	`int`	Frames per detection batch in `track_video`. Default 16.	`16`

Source code in src/videopython/ai/understanding/faces.py

def __init__(
    self,
    selection_strategy: Literal["largest", "centered", "index"] = "largest",
    face_index: int = 0,
    smoothing: float = 0.8,
    detection_interval: int = 3,
    min_face_size: int = 30,
    batch_size: int = 16,
):
    """Initialize the smoothing tracker.

    Args:
        selection_strategy: Which face to track — "largest" (biggest box),
            "centered" (closest to frame center), or "index" (``face_index``).
        face_index: Index of face to track when using the "index" strategy.
        smoothing: Exponential moving average factor (0-1). Higher = smoother.
        detection_interval: Run detection every N frames, hold position between.
        min_face_size: Minimum face size in pixels for detection.
        batch_size: Frames per detection batch in ``track_video``. Default 16.
    """
    super().__init__(min_face_size=min_face_size)
    self.selection_strategy = selection_strategy
    self.face_index = face_index
    self.smoothing = smoothing
    self.detection_interval = detection_interval
    self.batch_size = batch_size
    self._last_position: tuple[float, float] | None = None
    self._last_size: tuple[float, float] | None = None
    self._smoothed_position: tuple[float, float] | None = None
    self._smoothed_size: tuple[float, float] | None = None
    logger.info("FaceSmoothingTracker initialized (detection_interval=%s)", self.detection_interval)

detect_and_track

detect_and_track(
    frame: ndarray, frame_index: int
) -> tuple[float, float, float, float] | None

Detect face in frame and return smoothed position.

Parameters:

Name	Type	Description	Default
`frame`	`ndarray`	Video frame as numpy array (H, W, 3).	required
`frame_index`	`int`	Index of current frame.	required

Returns:

Type	Description
`tuple[float, float, float, float] \| None`	Tuple of (center_x, center_y, width, height) in normalized coords,
`tuple[float, float, float, float] \| None`	or None if no face detected and no fallback available.

Source code in src/videopython/ai/understanding/faces.py

def detect_and_track(
    self,
    frame: np.ndarray,
    frame_index: int,
) -> tuple[float, float, float, float] | None:
    """Detect face in frame and return smoothed position.

    Args:
        frame: Video frame as numpy array (H, W, 3).
        frame_index: Index of current frame.

    Returns:
        Tuple of (center_x, center_y, width, height) in normalized coords,
        or None if no face detected and no fallback available.
    """
    if self._detector is None:
        self._init_detector()
        assert self._detector is not None

    h, w = frame.shape[:2]

    if frame_index % self.detection_interval == 0:
        faces = self._detector.detect(frame)
        face_info = self._select_face(faces, w, h)
        if face_info is not None:
            self._last_position = (face_info[0], face_info[1])
            self._last_size = (face_info[2], face_info[3])
    elif self._last_position is not None and self._last_size is not None:
        face_info = (*self._last_position, *self._last_size)
    else:
        face_info = None

    return self._smooth(face_info)

reset

reset() -> None

Reset tracker state for a new video.

Source code in src/videopython/ai/understanding/faces.py

def reset(self) -> None:
    """Reset tracker state for a new video."""
    self._last_position = None
    self._last_size = None
    self._smoothed_position = None
    self._smoothed_size = None

track_video

track_video(
    frames: ndarray,
) -> list[tuple[float, float, float, float] | None]

Track the face through a whole clip via batched per-frame detection.

Detection runs on every frame (the YuNet detector is CPU-only), then each frame's selected face is EMA-smoothed.

Parameters:

Name	Type	Description	Default
`frames`	`ndarray`	Video frames array of shape (N, H, W, 3).	required

Returns:

Type	Description
`list[tuple[float, float, float, float] \| None]`	List of face positions (cx, cy, w, h) for each frame, or None where
`list[tuple[float, float, float, float] \| None]`	no face was detected and no fallback was available.

Source code in src/videopython/ai/understanding/faces.py

def track_video(
    self,
    frames: np.ndarray,
) -> list[tuple[float, float, float, float] | None]:
    """Track the face through a whole clip via batched per-frame detection.

    Detection runs on every frame (the YuNet detector is CPU-only), then each
    frame's selected face is EMA-smoothed.

    Args:
        frames: Video frames array of shape (N, H, W, 3).

    Returns:
        List of face positions (cx, cy, w, h) for each frame, or None where
        no face was detected and no fallback was available.
    """
    if self._detector is None:
        self._init_detector()
        assert self._detector is not None

    n_frames = len(frames)
    if n_frames == 0:
        return []

    h, w = frames[0].shape[:2]

    detections: list[list[DetectedFace]] = []
    for batch_start in range(0, n_frames, self.batch_size):
        batch = [frames[i] for i in range(batch_start, min(batch_start + self.batch_size, n_frames))]
        detections.extend(self._detector.detect_batch(batch))

    faces = [self._select_face(frame_faces, w, h) for frame_faces in detections]
    self.reset()
    return [self._smooth(face_info) for face_info in faces]

ObjectDetector

ObjectDetector runs a D-FINE COCO model and returns a list of DetectedObject per frame, with normalized bounding boxes sorted by confidence. It is the object-detection counterpart to the face trackers and powers ObjectDetectionOverlay.

The D-FINE weights (Apache-2.0) download from HuggingFace on first use; class names come from the model config. Detection is gated by confidence_threshold and optionally restricted to class_filter. D-FINE uses VOC-style COCO names, so class_filter must use the model's exact spellings (e.g. motorbike, tvmonitor).

from videopython.ai import ObjectDetector
from videopython.base import Video

video = Video.from_path("street.mp4")

detector = ObjectDetector(model_name="ustc-community/dfine-nano-coco", class_filter=("person", "car"))
for obj in detector.detect(video.frames[0]):
    print(f"{obj.label} {obj.confidence:.2f} @ {obj.bounding_box}")

# Batched detection over several frames.
per_frame = detector.detect_batch(video.frames[:16])

ObjectDetector

Bases: DetectorBase[DetectedObject]

Lazy D-FINE COCO object detector returning normalized detections.

The D-FINE weights (default ustc-community/dfine-nano-coco) download from HuggingFace on first real use; class names come from the model config. Detection is gated by confidence_threshold and optionally restricted to class_filter (COCO class names; YOLO-style spellings are normalized).

Source code in src/videopython/ai/understanding/objects.py

class ObjectDetector(DetectorBase[DetectedObject]):
    """Lazy D-FINE COCO object detector returning normalized detections.

    The D-FINE weights (default ``ustc-community/dfine-nano-coco``) download from
    HuggingFace on first real use; class names come from the model config.
    Detection is gated by ``confidence_threshold`` and optionally restricted to
    ``class_filter`` (COCO class names; YOLO-style spellings are normalized).
    """

    DEFAULT_CONFIDENCE_THRESHOLD = 0.5
    _FEATURE = "ObjectDetector"
    # Override the base sentinel/unload set: hold the model AND the processor.
    _model_attrs = ("_model", "_processor")

    def __init__(
        self,
        model_name: str = DEFAULT_MODEL,
        confidence_threshold: float = DEFAULT_CONFIDENCE_THRESHOLD,
        class_filter: tuple[str, ...] = (),
        backend: Backend = "auto",
    ):
        """Initialize the detector.

        Args:
            model_name: D-FINE COCO HuggingFace repo id (e.g.
                ``ustc-community/dfine-nano-coco``, ``...-small-coco``,
                ``...-medium-coco``, ``...-large-coco``). Downloaded on first use.
            confidence_threshold: Minimum detection confidence in ``[0, 1]``.
            class_filter: If non-empty, only these COCO class names are kept
                (D-FINE's VOC-style spelling, e.g. ``motorbike``/``tvmonitor``).
            backend: Detection device - ``"cpu"``, ``"gpu"``, or ``"auto"``.
        """
        super().__init__(backend=backend)
        self.model_name = model_name
        self.confidence_threshold = confidence_threshold
        self.class_filter = tuple(class_filter)
        self._model: Any = None
        self._processor: Any = None
        self._class_names: dict[int, str] = {}
        logger.info("ObjectDetector initialized with model=%s backend=%s", model_name, backend)

    def _load_model(self) -> None:
        from videopython.ai._optional import require

        tf = require("transformers", feature=self._FEATURE)
        revision = pinned(self.model_name)
        self._processor = tf.AutoImageProcessor.from_pretrained(self.model_name, revision=revision, use_fast=True)
        model = tf.DFineForObjectDetection.from_pretrained(self.model_name, revision=revision)
        model.eval()
        if self._resolve_device() == "cuda":
            model = model.to("cuda")
        self._model = model
        self._class_names = {int(k): v for k, v in model.config.id2label.items()}

    def _infer(self, images: list[np.ndarray]) -> list[list[DetectedObject]]:
        import torch

        device = self._resolve_device()
        inputs = self._processor(images=images, return_tensors="pt")
        if device == "cuda":
            inputs = inputs.to("cuda")
        with torch.no_grad():
            outputs = self._model(**inputs)
        # target_sizes is (height, width) per image; D-FINE letterboxes internally
        # so post-processing needs the original sizes to de-letterbox the boxes.
        target_sizes = torch.tensor([[img.shape[0], img.shape[1]] for img in images], device=device)
        results = self._processor.post_process_object_detection(
            outputs, target_sizes=target_sizes, threshold=self.confidence_threshold
        )
        return [self._parse(result, img.shape[1], img.shape[0]) for result, img in zip(results, images)]

    def _parse(self, result: dict[str, Any], img_w: int, img_h: int) -> list[DetectedObject]:
        detected: list[DetectedObject] = []
        scores = result["scores"].tolist()
        labels = result["labels"].tolist()
        boxes = result["boxes"].tolist()
        for score, label_id, (x1, y1, x2, y2) in zip(scores, labels, boxes):
            label = self._class_names.get(int(label_id), str(int(label_id)))
            if self.class_filter and label not in self.class_filter:
                continue
            # D-FINE boxes can sit slightly outside the frame; clamp before normalizing.
            x1 = min(max(x1, 0.0), img_w)
            x2 = min(max(x2, 0.0), img_w)
            y1 = min(max(y1, 0.0), img_h)
            y2 = min(max(y2, 0.0), img_h)
            detected.append(
                DetectedObject(
                    label=label,
                    confidence=float(score),
                    bounding_box=BoundingBox(
                        x=x1 / img_w,
                        y=y1 / img_h,
                        width=(x2 - x1) / img_w,
                        height=(y2 - y1) / img_h,
                    ),
                )
            )
        detected.sort(key=lambda d: d.confidence, reverse=True)
        return detected

init

__init__(
    model_name: str = DEFAULT_MODEL,
    confidence_threshold: float = DEFAULT_CONFIDENCE_THRESHOLD,
    class_filter: tuple[str, ...] = (),
    backend: Backend = "auto",
)

Initialize the detector.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	D-FINE COCO HuggingFace repo id (e.g. `ustc-community/dfine-nano-coco`, `...-small-coco`, `...-medium-coco`, `...-large-coco`). Downloaded on first use.	`DEFAULT_MODEL`
`confidence_threshold`	`float`	Minimum detection confidence in `[0, 1]`.	`DEFAULT_CONFIDENCE_THRESHOLD`
`class_filter`	`tuple[str, ...]`	If non-empty, only these COCO class names are kept (D-FINE's VOC-style spelling, e.g. `motorbike`/`tvmonitor`).	`()`
`backend`	`Backend`	Detection device - `"cpu"`, `"gpu"`, or `"auto"`.	`'auto'`

Source code in src/videopython/ai/understanding/objects.py

def __init__(
    self,
    model_name: str = DEFAULT_MODEL,
    confidence_threshold: float = DEFAULT_CONFIDENCE_THRESHOLD,
    class_filter: tuple[str, ...] = (),
    backend: Backend = "auto",
):
    """Initialize the detector.

    Args:
        model_name: D-FINE COCO HuggingFace repo id (e.g.
            ``ustc-community/dfine-nano-coco``, ``...-small-coco``,
            ``...-medium-coco``, ``...-large-coco``). Downloaded on first use.
        confidence_threshold: Minimum detection confidence in ``[0, 1]``.
        class_filter: If non-empty, only these COCO class names are kept
            (D-FINE's VOC-style spelling, e.g. ``motorbike``/``tvmonitor``).
        backend: Detection device - ``"cpu"``, ``"gpu"``, or ``"auto"``.
    """
    super().__init__(backend=backend)
    self.model_name = model_name
    self.confidence_threshold = confidence_threshold
    self.class_filter = tuple(class_filter)
    self._model: Any = None
    self._processor: Any = None
    self._class_names: dict[int, str] = {}
    logger.info("ObjectDetector initialized with model=%s backend=%s", model_name, backend)

Scene Data Classes

These classes are used by scene and audio analyzers to represent analysis results:

SceneBoundary

SceneBoundary `dataclass`

Timing information for a detected scene.

A lightweight structure representing scene boundaries returned by scene detectors (e.g. videopython.ai.SemanticSceneDetector). This is a backbone type — higher-level scene analysis lives in orchestration packages.

Attributes:

Name	Type	Description
`start`	`float`	Scene start time in seconds
`end`	`float`	Scene end time in seconds
`start_frame`	`int`	Index of the first frame in this scene
`end_frame`	`int`	Index of the last frame in this scene (exclusive)

Source code in src/videopython/base/description.py

@dataclass
class SceneBoundary:
    """Timing information for a detected scene.

    A lightweight structure representing scene boundaries returned by
    scene detectors (e.g. ``videopython.ai.SemanticSceneDetector``). This
    is a backbone type — higher-level scene analysis lives in orchestration
    packages.

    Attributes:
        start: Scene start time in seconds
        end: Scene end time in seconds
        start_frame: Index of the first frame in this scene
        end_frame: Index of the last frame in this scene (exclusive)
    """

    start: float
    end: float
    start_frame: int
    end_frame: int

    @property
    def duration(self) -> float:
        """Duration of the scene in seconds."""
        return self.end - self.start

    @property
    def frame_count(self) -> int:
        """Number of frames in this scene."""
        return self.end_frame - self.start_frame

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "start": self.start,
            "end": self.end,
            "start_frame": self.start_frame,
            "end_frame": self.end_frame,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "SceneBoundary":
        """Create SceneBoundary from dictionary."""
        return cls(
            start=data["start"],
            end=data["end"],
            start_frame=data["start_frame"],
            end_frame=data["end_frame"],
        )

duration `property`

duration: float

Duration of the scene in seconds.

frame_count `property`

frame_count: int

Number of frames in this scene.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/videopython/base/description.py

def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    return {
        "start": self.start,
        "end": self.end,
        "start_frame": self.start_frame,
        "end_frame": self.end_frame,
    }

from_dict `classmethod`

from_dict(data: dict[str, Any]) -> 'SceneBoundary'

Create SceneBoundary from dictionary.

Source code in src/videopython/base/description.py

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "SceneBoundary":
    """Create SceneBoundary from dictionary."""
    return cls(
        start=data["start"],
        end=data["end"],
        start_frame=data["start_frame"],
        end_frame=data["end_frame"],
    )

SceneDescription

SceneDescription `dataclass`

Structured visual scene description from the SceneVLM.

The v1 schema is intentionally narrow (caption + subjects + shot_type). Wider schemas drop JSON parse rate on small models without eval data to defend the cost. Fields are added in v2 as parse-rate measurements justify them; closed enums first, open lists last.

Attributes:

Name	Type	Description
`caption`	`str`	One-sentence summary of the scene.
`subjects`	`list[str]`	Open list of named subjects visible in the frames.
`shot_type`	`str \| None`	Closed enum framing the camera distance, or None when JSON parsing fell back to raw text.

Source code in src/videopython/base/description.py

@dataclass
class SceneDescription:
    """Structured visual scene description from the SceneVLM.

    The v1 schema is intentionally narrow (caption + subjects + shot_type).
    Wider schemas drop JSON parse rate on small models without eval data
    to defend the cost. Fields are added in v2 as parse-rate measurements
    justify them; closed enums first, open lists last.

    Attributes:
        caption: One-sentence summary of the scene.
        subjects: Open list of named subjects visible in the frames.
        shot_type: Closed enum framing the camera distance, or None
            when JSON parsing fell back to raw text.
    """

    caption: str
    subjects: list[str] = field(default_factory=list)
    shot_type: str | None = None

    def to_dict(self) -> dict[str, Any]:
        return {
            "caption": self.caption,
            "subjects": list(self.subjects),
            "shot_type": self.shot_type,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "SceneDescription":
        return cls(
            caption=str(data["caption"]),
            subjects=[str(s) for s in data.get("subjects", [])],
            shot_type=data.get("shot_type"),
        )

BoundingBox

Bases: BaseModel

A bounding box for detected objects or crop regions in an image.

Coordinates are normalized to [0, 1] relative to image dimensions. Promoted to a Pydantic model so it can be embedded directly into Operation fields (e.g. KenBurns.start_region) and validated / serialised as part of an op's JSON wire format.

Source code in src/videopython/base/description.py

class BoundingBox(BaseModel):
    """A bounding box for detected objects or crop regions in an image.

    Coordinates are normalized to ``[0, 1]`` relative to image dimensions.
    Promoted to a Pydantic model so it can be embedded directly into
    ``Operation`` fields (e.g. ``KenBurns.start_region``) and validated /
    serialised as part of an op's JSON wire format.
    """

    model_config = ConfigDict(extra="forbid", frozen=True)

    x: float = Field(description="Left edge of the box, 0=left of the image.")
    y: float = Field(description="Top edge of the box, 0=top of the image.")
    width: float = Field(description="Width of the box, normalized to image width.")
    height: float = Field(description="Height of the box, normalized to image height.")

    @property
    def center(self) -> tuple[float, float]:
        """Center point of the bounding box."""
        return (self.x + self.width / 2, self.y + self.height / 2)

    @property
    def area(self) -> float:
        """Area of the bounding box (normalized)."""
        return self.width * self.height

    def to_dict(self) -> dict[str, Any]:
        """Backwards-compat alias for ``model_dump()``."""
        return self.model_dump()

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> BoundingBox:
        """Backwards-compat alias for ``model_validate(data)``."""
        return cls.model_validate(data)

center `property`

center: tuple[float, float]

Center point of the bounding box.

area `property`

area: float

Area of the bounding box (normalized).

to_dict

to_dict() -> dict[str, Any]

Backwards-compat alias for model_dump().

Source code in src/videopython/base/description.py

def to_dict(self) -> dict[str, Any]:
    """Backwards-compat alias for ``model_dump()``."""
    return self.model_dump()

from_dict `classmethod`

from_dict(data: dict[str, Any]) -> BoundingBox

Backwards-compat alias for model_validate(data).

Source code in src/videopython/base/description.py

@classmethod
def from_dict(cls, data: dict[str, Any]) -> BoundingBox:
    """Backwards-compat alias for ``model_validate(data)``."""
    return cls.model_validate(data)

DetectedObject

DetectedObject `dataclass`

An object detected in a video frame.

Attributes:

Name	Type	Description
`label`	`str`	Name/class of the detected object (e.g., "person", "car", "dog")
`confidence`	`float`	Detection confidence score between 0 and 1
`bounding_box`	`BoundingBox \| None`	Optional bounding box location of the object

Source code in src/videopython/base/description.py

@dataclass
class DetectedObject:
    """An object detected in a video frame.

    Attributes:
        label: Name/class of the detected object (e.g., "person", "car", "dog")
        confidence: Detection confidence score between 0 and 1
        bounding_box: Optional bounding box location of the object
    """

    label: str
    confidence: float
    bounding_box: BoundingBox | None = None

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "label": self.label,
            "confidence": self.confidence,
            "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> DetectedObject:
        """Create DetectedObject from dictionary."""
        return cls(
            label=data["label"],
            confidence=data["confidence"],
            bounding_box=BoundingBox.from_dict(data["bounding_box"]) if data.get("bounding_box") else None,
        )

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/videopython/base/description.py

def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    return {
        "label": self.label,
        "confidence": self.confidence,
        "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None,
    }

from_dict `classmethod`

from_dict(data: dict[str, Any]) -> DetectedObject

Create DetectedObject from dictionary.

Source code in src/videopython/base/description.py

@classmethod
def from_dict(cls, data: dict[str, Any]) -> DetectedObject:
    """Create DetectedObject from dictionary."""
    return cls(
        label=data["label"],
        confidence=data["confidence"],
        bounding_box=BoundingBox.from_dict(data["bounding_box"]) if data.get("bounding_box") else None,
    )

DetectedFace

DetectedFace `dataclass`

A face detected in a video frame.

Attributes:

Name	Type	Description
`bounding_box`	`BoundingBox \| None`	Bounding box location of the face (normalized 0-1 coordinates). May be None for cloud backends that only return face counts.
`confidence`	`float`	Detection confidence score between 0 and 1

Source code in src/videopython/base/description.py

@dataclass
class DetectedFace:
    """A face detected in a video frame.

    Attributes:
        bounding_box: Bounding box location of the face (normalized 0-1 coordinates).
            May be None for cloud backends that only return face counts.
        confidence: Detection confidence score between 0 and 1
    """

    bounding_box: BoundingBox | None = None
    confidence: float = 1.0

    @property
    def center(self) -> tuple[float, float] | None:
        """Center point of the face bounding box, or None if no bounding box."""
        return self.bounding_box.center if self.bounding_box else None

    @property
    def area(self) -> float | None:
        """Area of the face bounding box (normalized), or None if no bounding box."""
        return self.bounding_box.area if self.bounding_box else None

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None,
            "confidence": self.confidence,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> DetectedFace:
        """Create DetectedFace from dictionary."""
        return cls(
            bounding_box=BoundingBox.from_dict(data["bounding_box"]) if data.get("bounding_box") else None,
            confidence=data.get("confidence", 1.0),
        )

center `property`

center: tuple[float, float] | None

Center point of the face bounding box, or None if no bounding box.

area `property`

area: float | None

Area of the face bounding box (normalized), or None if no bounding box.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/videopython/base/description.py

def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    return {
        "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None,
        "confidence": self.confidence,
    }

from_dict `classmethod`

from_dict(data: dict[str, Any]) -> DetectedFace

Create DetectedFace from dictionary.

Source code in src/videopython/base/description.py

@classmethod
def from_dict(cls, data: dict[str, Any]) -> DetectedFace:
    """Create DetectedFace from dictionary."""
    return cls(
        bounding_box=BoundingBox.from_dict(data["bounding_box"]) if data.get("bounding_box") else None,
        confidence=data.get("confidence", 1.0),
    )

DetectedText

DetectedText `dataclass`

Text detected in a video frame.

Attributes:

Name	Type	Description
`text`	`str`	OCR text content
`confidence`	`float`	Detection confidence score between 0 and 1
`bounding_box`	`BoundingBox \| None`	Optional normalized bounding box for the text region

Source code in src/videopython/base/description.py

@dataclass
class DetectedText:
    """Text detected in a video frame.

    Attributes:
        text: OCR text content
        confidence: Detection confidence score between 0 and 1
        bounding_box: Optional normalized bounding box for the text region
    """

    text: str
    confidence: float
    bounding_box: BoundingBox | None = None

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "text": self.text,
            "confidence": self.confidence,
            "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "DetectedText":
        """Create DetectedText from dictionary."""
        return cls(
            text=data["text"],
            confidence=data["confidence"],
            bounding_box=BoundingBox.from_dict(data["bounding_box"]) if data.get("bounding_box") else None,
        )

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/videopython/base/description.py

def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    return {
        "text": self.text,
        "confidence": self.confidence,
        "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None,
    }

from_dict `classmethod`

from_dict(data: dict[str, Any]) -> 'DetectedText'

Create DetectedText from dictionary.

Source code in src/videopython/base/description.py

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DetectedText":
    """Create DetectedText from dictionary."""
    return cls(
        text=data["text"],
        confidence=data["confidence"],
        bounding_box=BoundingBox.from_dict(data["bounding_box"]) if data.get("bounding_box") else None,
    )

FaceTrack

FaceTrack `dataclass`

A face tracked across consecutive frames within a single shot.

Tracks are produced by IoU association — no embedding re-id, so a track does not survive across shot/scene boundaries. frame_indices and boxes are parallel lists of equal length.

Attributes:

Name	Type	Description
`track_id`	`int`	Stable id within the shot the track was produced in. Not globally unique across scenes.
`frame_indices`	`list[int]`	Source-video frame indices for each detection.
`boxes`	`list[BoundingBox]`	Per-frame bounding boxes (normalized 0-1 coords).
`confidences`	`list[float]`	Per-frame detection confidence in [0, 1].

Source code in src/videopython/base/description.py

@dataclass
class FaceTrack:
    """A face tracked across consecutive frames within a single shot.

    Tracks are produced by IoU association — no embedding re-id, so a
    track does not survive across shot/scene boundaries. ``frame_indices``
    and ``boxes`` are parallel lists of equal length.

    Attributes:
        track_id: Stable id within the shot the track was produced in.
            Not globally unique across scenes.
        frame_indices: Source-video frame indices for each detection.
        boxes: Per-frame bounding boxes (normalized 0-1 coords).
        confidences: Per-frame detection confidence in [0, 1].
    """

    track_id: int
    frame_indices: list[int]
    boxes: list[BoundingBox]
    confidences: list[float] = field(default_factory=list)

    @property
    def length(self) -> int:
        """Number of frames in this track."""
        return len(self.frame_indices)

    def to_dict(self) -> dict[str, Any]:
        return {
            "track_id": self.track_id,
            "frame_indices": list(self.frame_indices),
            "boxes": [box.to_dict() for box in self.boxes],
            "confidences": list(self.confidences),
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "FaceTrack":
        return cls(
            track_id=int(data["track_id"]),
            frame_indices=[int(i) for i in data.get("frame_indices", [])],
            boxes=[BoundingBox.from_dict(b) for b in data.get("boxes", [])],
            confidences=[float(c) for c in data.get("confidences", [])],
        )

length `property`

length: int

Number of frames in this track.

AudioEvent

AudioEvent `dataclass`

A detected audio event with timestamp.

Attributes:

Name	Type	Description
`start`	`float`	Start time in seconds
`end`	`float`	End time in seconds
`label`	`str`	Name of the detected sound (e.g., "Music", "Speech", "Dog bark")
`confidence`	`float`	Detection confidence score between 0 and 1

Source code in src/videopython/base/description.py

@dataclass
class AudioEvent:
    """A detected audio event with timestamp.

    Attributes:
        start: Start time in seconds
        end: End time in seconds
        label: Name of the detected sound (e.g., "Music", "Speech", "Dog bark")
        confidence: Detection confidence score between 0 and 1
    """

    start: float
    end: float
    label: str
    confidence: float

    @property
    def duration(self) -> float:
        """Duration of the audio event in seconds."""
        return self.end - self.start

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "start": self.start,
            "end": self.end,
            "label": self.label,
            "confidence": self.confidence,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> AudioEvent:
        """Create AudioEvent from dictionary."""
        return cls(
            start=data["start"],
            end=data["end"],
            label=data["label"],
            confidence=data["confidence"],
        )

duration `property`

duration: float

Duration of the audio event in seconds.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/videopython/base/description.py

def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    return {
        "start": self.start,
        "end": self.end,
        "label": self.label,
        "confidence": self.confidence,
    }

from_dict `classmethod`

from_dict(data: dict[str, Any]) -> AudioEvent

Create AudioEvent from dictionary.

Source code in src/videopython/base/description.py

@classmethod
def from_dict(cls, data: dict[str, Any]) -> AudioEvent:
    """Create AudioEvent from dictionary."""
    return cls(
        start=data["start"],
        end=data["end"],
        label=data["label"],
        confidence=data["confidence"],
    )

AudioClassification

AudioClassification `dataclass`

Complete audio classification results.

Attributes:

Name	Type	Description
`events`	`list[AudioEvent]`	List of detected audio events with timestamps
`clip_predictions`	`dict[str, float]`	Overall class probabilities for the entire audio clip

Source code in src/videopython/base/description.py

@dataclass
class AudioClassification:
    """Complete audio classification results.

    Attributes:
        events: List of detected audio events with timestamps
        clip_predictions: Overall class probabilities for the entire audio clip
    """

    events: list[AudioEvent]
    clip_predictions: dict[str, float] = field(default_factory=dict)

    def to_dict(self) -> dict[str, Any]:
        """Convert to dictionary for JSON serialization."""
        return {
            "events": [event.to_dict() for event in self.events],
            "clip_predictions": self.clip_predictions,
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "AudioClassification":
        """Create AudioClassification from dictionary."""
        return cls(
            events=[AudioEvent.from_dict(event) for event in data.get("events", [])],
            clip_predictions={k: float(v) for k, v in data.get("clip_predictions", {}).items()},
        )

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for JSON serialization.

Source code in src/videopython/base/description.py

def to_dict(self) -> dict[str, Any]:
    """Convert to dictionary for JSON serialization."""
    return {
        "events": [event.to_dict() for event in self.events],
        "clip_predictions": self.clip_predictions,
    }

from_dict `classmethod`

from_dict(data: dict[str, Any]) -> 'AudioClassification'

Create AudioClassification from dictionary.

Source code in src/videopython/base/description.py

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AudioClassification":
    """Create AudioClassification from dictionary."""
    return cls(
        events=[AudioEvent.from_dict(event) for event in data.get("events", [])],
        clip_predictions={k: float(v) for k, v in data.get("clip_predictions", {}).items()},
    )

AI Understanding

Local Model Support

AudioToText

Anti-hallucination knobs

Brand-name vocabulary biasing

Per-segment confidence

AudioToText

diarize_transcription

transcribe

AudioClassifier

Basic Usage

AudioClassifier

classify

SceneVLM

SceneVLM

analyze_frame

analyze_scene

SemanticSceneDetector

SemanticSceneDetector

__init__

detect

detect_streaming

detect_from_path classmethod

Face Tracking

FaceShotTracker

__init__

track_shot

FaceSmoothingTracker

__init__

detect_and_track

reset

track_video

ObjectDetector

ObjectDetector

__init__

Scene Data Classes

SceneBoundary

SceneBoundary dataclass

duration property

frame_count property

to_dict

from_dict classmethod

SceneDescription

SceneDescription dataclass

BoundingBox

BoundingBox

center property

area property

to_dict

from_dict classmethod

DetectedObject

DetectedObject dataclass

to_dict

from_dict classmethod

DetectedFace

DetectedFace dataclass

center property

area property

to_dict

from_dict classmethod

DetectedText

DetectedText dataclass

to_dict

from_dict classmethod

FaceTrack

FaceTrack dataclass

length property

AudioEvent

AudioEvent dataclass

duration property

to_dict

from_dict classmethod

AudioClassification

AudioClassification dataclass

to_dict

from_dict classmethod

init

detect_from_path `classmethod`

init

init

init

SceneBoundary `dataclass`

duration `property`

frame_count `property`

from_dict `classmethod`

SceneDescription `dataclass`

center `property`

area `property`

from_dict `classmethod`

DetectedObject `dataclass`

from_dict `classmethod`

DetectedFace `dataclass`

center `property`

area `property`

from_dict `classmethod`

DetectedText `dataclass`

from_dict `classmethod`

FaceTrack `dataclass`

length `property`

AudioEvent `dataclass`

duration `property`

from_dict `classmethod`

AudioClassification `dataclass`

from_dict `classmethod`