Skip to content

AI Effects

AI-powered, shape-preserving effects. Unlike plain effects, these run a model per frame, so they physically live in videopython.ai and the core editing layer keeps no AI dependency.

ObjectDetectionOverlay

Detects objects in every frame with a YOLOv8-COCO model and composites tidy, colour-coded bounding boxes with class labels (and optional confidence). The detector (ObjectDetector) is constructed internally; the box/label drawing is done by the AI-free renderer videopython.base.draw_detections.

from videopython.ai import ObjectDetectionOverlay
from videopython.editing import VideoEdit, SegmentConfig

# Default: per-class colours, confidence shown, detect every 2nd frame.
edit = VideoEdit(segments=[SegmentConfig(source="street.mp4", start=0, end=5, operations=[
    ObjectDetectionOverlay(),
])])
edit.run_to_file("annotated.mp4")

# Only people and cars, detect every frame, larger model for accuracy.
edit = VideoEdit(segments=[SegmentConfig(source="street.mp4", start=0, end=5, operations=[
    ObjectDetectionOverlay(
        class_filter=["person", "car"],
        detection_interval=1,
        model_size="s",
    ),
])])
edit.run_to_file("annotated.mp4")

In a JSON editing plan (it is exposed in the LLM-facing schema):

{
  "op": "object_detection_overlay",
  "class_filter": ["person", "car", "dog"],
  "confidence_threshold": 0.4,
  "detection_interval": 2,
  "window": {"start": 0, "stop": 5}
}

Performance

object_detection_overlay is streamable — memory stays bounded on long clips — but detection is compute-bound: a YOLO forward pass runs per sampled frame. To cap cost:

  • window — restrict the overlay (and therefore detection) to a time range.
  • detection_interval — run detection every Nth frame and hold the boxes in between (default 2). Higher is faster; fast-moving objects show more lag.
  • class_filter — fewer classes to draw.
  • model_size"n" (nano, default, fastest) → "s""m" (most accurate).

ObjectDetectionOverlay

Bases: Effect

Detect objects per frame and overlay labelled bounding boxes.

Runs a YOLOv8-COCO detector and composites tidy, colour-coded boxes with class labels (and optional confidence) onto every frame in the window.

Detection runs on a detection_interval cadence in the streaming path and boxes are held between detections, so the cost is compute-bound, not memory-bound: "streamable" here means bounded memory, not bounded compute. On long clips, cap cost with window (limit the time range), a larger detection_interval, a class_filter, and/or the smaller model_size. Only streaming_init and process_frame are overridden; the streaming engine drives that contract for bounded-memory execution.

Source code in src/videopython/ai/effects.py
class ObjectDetectionOverlay(Effect):
    """Detect objects per frame and overlay labelled bounding boxes.

    Runs a YOLOv8-COCO detector and composites tidy, colour-coded boxes with
    class labels (and optional confidence) onto every frame in the window.

    Detection runs on a ``detection_interval`` cadence in the streaming path and
    boxes are held between detections, so the cost is *compute*-bound, not
    *memory*-bound: ``"streamable"`` here means bounded memory, not bounded
    compute. On long clips, cap cost with ``window`` (limit the time range),
    a larger ``detection_interval``, a ``class_filter``, and/or the smaller
    ``model_size``. Only ``streaming_init`` and ``process_frame`` are
    overridden; the streaming engine drives that contract for bounded-memory
    execution.
    """

    op: Literal["object_detection_overlay"] = "object_detection_overlay"

    confidence_threshold: float = Field(0.5, ge=0, le=1, description="Minimum detection confidence to draw a box, 0-1.")
    class_filter: list[str] | None = Field(
        None,
        description='Only draw these COCO class names, e.g. ["person", "car", "dog"]. Null draws all classes.',
    )
    show_confidence: bool = Field(True, description="Append the detection confidence as a percentage to each label.")
    box_color: tuple[int, int, int] | None = Field(
        None,
        description="Fixed box color as [R, G, B] (0-255) for every box, or null for distinct per-class colors.",
    )
    line_thickness: float = Field(
        0.003,
        gt=0,
        le=0.05,
        description="Box stroke width as a fraction of the frame's longer side (0.003 = ~3px at 1080p).",
    )
    label_font_size: float = Field(
        0.022,
        gt=0,
        le=0.2,
        description="Label text height as a fraction of the frame's longer side (0.022 = ~24px at 1080p).",
    )
    detection_interval: int = Field(
        2,
        ge=1,
        description="Run detection every Nth frame and reuse the last result in between. Higher is faster.",
    )
    model_size: Literal["n", "s", "m"] = Field(
        "n",
        description=(
            "YOLOv8 model size: 'n' (nano, fastest), 's' (small), 'm' (medium, most accurate). "
            "Larger detects better but is slower."
        ),
    )
    backend: Literal["cpu", "gpu", "auto"] = Field(
        "auto",
        description="Detection device: 'cpu', 'gpu', or 'auto'.",
        json_schema_extra={"llm_hidden": True},
    )

    _detector: ObjectDetector | None = PrivateAttr(default=None)
    _last: list[DetectedObject] = PrivateAttr(default_factory=list)

    def _style(self) -> DetectionStyle:
        return DetectionStyle(
            box_color=self.box_color,
            line_thickness=self.line_thickness,
            show_confidence=self.show_confidence,
            label_font_size=self.label_font_size,
            min_confidence=self.confidence_threshold,
        )

    def _init_detector(self) -> None:
        """Build the detector lazily. Single patch point for tests."""
        if self._detector is None:
            self._detector = ObjectDetector(
                model_name=f"yolov8{self.model_size}.pt",
                confidence_threshold=self.confidence_threshold,
                class_filter=tuple(self.class_filter or ()),
                backend=self.backend,
            )

    def streaming_init(self, total_frames: int, fps: float, width: int, height: int, **_context: Any) -> None:
        self._last = []
        self._init_detector()

    def process_frame(self, frame: np.ndarray, frame_index: int) -> np.ndarray:
        if self._detector is None:
            self._init_detector()
        assert self._detector is not None
        # frame_index is 0-based within the effect's window, so frame 0 always
        # detects; intermediate frames reuse the last result.
        if frame_index % self.detection_interval == 0:
            self._last = self._detector.detect(frame)
        return draw_detections(frame, self._last, self._style())

Renderer

The drawing is a pure, AI-free function reusable with any list of DetectedObject. Colours are deterministic per class, so a class is the same colour in every frame and across runs.

from videopython.base import DetectionStyle, class_color, draw_detections

frame = draw_detections(frame, detections, DetectionStyle(show_confidence=False))

draw_detections

draw_detections(
    frame: ndarray,
    detections: list[DetectedObject],
    style: DetectionStyle = DetectionStyle(),
) -> np.ndarray

Return a copy of frame with detections drawn as labelled boxes.

Shape-preserving: the result is the same (H, W, 3) uint8 array. An empty detections list (or one filtered out by min_confidence) is a no-op that returns frame unchanged. Boxes are clamped to the frame, so off-frame coordinates clip cleanly instead of raising. Label chips flip inside the box when they would overflow the top edge and clamp horizontally so they never leave the frame.

Parameters:

Name Type Description Default
frame ndarray

Source frame as (H, W, 3) uint8 (RGB).

required
detections list[DetectedObject]

Objects to draw; each uses its normalized bounding_box.

required
style DetectionStyle

Visual styling (colours, stroke width, label options).

DetectionStyle()

Returns:

Type Description
ndarray

A new (H, W, 3) uint8 frame with the overlays composited on.

Source code in src/videopython/base/draw_detections.py
def draw_detections(
    frame: np.ndarray,
    detections: list[DetectedObject],
    style: DetectionStyle = DetectionStyle(),
) -> np.ndarray:
    """Return a copy of ``frame`` with ``detections`` drawn as labelled boxes.

    Shape-preserving: the result is the same ``(H, W, 3)`` ``uint8`` array. An
    empty ``detections`` list (or one filtered out by ``min_confidence``) is a
    no-op that returns ``frame`` unchanged. Boxes are clamped to the frame, so
    off-frame coordinates clip cleanly instead of raising. Label chips flip
    inside the box when they would overflow the top edge and clamp horizontally
    so they never leave the frame.

    Args:
        frame: Source frame as ``(H, W, 3)`` ``uint8`` (RGB).
        detections: Objects to draw; each uses its normalized ``bounding_box``.
        style: Visual styling (colours, stroke width, label options).

    Returns:
        A new ``(H, W, 3)`` ``uint8`` frame with the overlays composited on.
    """
    if not detections:
        return frame

    h, w = frame.shape[:2]
    scale = max(h, w)
    thickness = max(1, round(style.line_thickness * scale))
    font_px = max(8, round(style.label_font_size * scale))
    font = load_font(style.font, font_px)

    canvas = Image.new("RGBA", (w, h), (0, 0, 0, 0))
    draw = ImageDraw.Draw(canvas)

    drew_any = False
    for det in detections:
        box = det.bounding_box
        if box is None or det.confidence < style.min_confidence:
            continue
        drew_any = True
        color = style.box_color or class_color(det.label)

        x0 = max(0, min(w - 1, int(box.x * w)))
        y0 = max(0, min(h - 1, int(box.y * h)))
        x1 = max(0, min(w - 1, int((box.x + box.width) * w)))
        y1 = max(0, min(h - 1, int((box.y + box.height) * h)))
        draw.rectangle((x0, y0, x1, y1), outline=(*color, 255), width=thickness)

        text = det.label.title()
        if style.show_confidence:
            text = f"{text} {det.confidence * 100:.0f}%"

        tb = draw.textbbox((0, 0), text, font=font)
        text_w, text_h = tb[2] - tb[0], tb[3] - tb[1]
        pad = max(2, thickness)
        chip_w, chip_h = text_w + 2 * pad, text_h + 2 * pad

        # Flip the chip inside the box when it would overflow the top edge,
        # and clamp horizontally so it never leaves the frame.
        chip_y = y0 - chip_h if y0 - chip_h >= 0 else y0
        chip_x = max(0, min(x0, w - chip_w))
        draw.rectangle(
            (chip_x, chip_y, chip_x + chip_w, chip_y + chip_h),
            fill=(*color, style.label_bg_alpha),
        )
        draw.text(
            (chip_x + pad - tb[0], chip_y + pad - tb[1]),
            text,
            font=font,
            fill=(*style.label_text_color, 255),
        )

    if not drew_any:
        return frame

    out = Image.fromarray(frame).convert("RGBA")
    out.alpha_composite(canvas)
    return np.array(out.convert("RGB"), dtype=np.uint8)

DetectionStyle dataclass

Styling for :func:draw_detections.

Lengths expressed as a fraction of the frame's longer side are resolution-independent: the same style reads consistently at 1080p and 4k.

Source code in src/videopython/base/draw_detections.py
@dataclass(frozen=True)
class DetectionStyle:
    """Styling for :func:`draw_detections`.

    Lengths expressed as a fraction of the frame's longer side are
    resolution-independent: the same style reads consistently at 1080p and 4k.
    """

    box_color: tuple[int, int, int] | None = None
    """Fixed ``(R, G, B)`` for every box, or ``None`` for per-class colours."""
    line_thickness: float = 0.003
    """Box stroke width as a fraction of ``max(height, width)`` (~3px at 1080p)."""
    show_confidence: bool = True
    """Append the confidence as a whole-number percent to each label."""
    label_font_size: float = 0.022
    """Label text height as a fraction of ``max(height, width)`` (~24px at 1080p)."""
    label_text_color: tuple[int, int, int] = (255, 255, 255)
    """Colour of the label text drawn on the chip."""
    label_bg_alpha: int = 200
    """Opacity (0-255) of the label chip background."""
    min_confidence: float = 0.0
    """Detections below this confidence are skipped."""
    font: str | None = None
    """Bundled font name or path; ``None`` uses the default font."""

box_color class-attribute instance-attribute

box_color: tuple[int, int, int] | None = None

Fixed (R, G, B) for every box, or None for per-class colours.

line_thickness class-attribute instance-attribute

line_thickness: float = 0.003

Box stroke width as a fraction of max(height, width) (~3px at 1080p).

show_confidence class-attribute instance-attribute

show_confidence: bool = True

Append the confidence as a whole-number percent to each label.

label_font_size class-attribute instance-attribute

label_font_size: float = 0.022

Label text height as a fraction of max(height, width) (~24px at 1080p).

label_text_color class-attribute instance-attribute

label_text_color: tuple[int, int, int] = (255, 255, 255)

Colour of the label text drawn on the chip.

label_bg_alpha class-attribute instance-attribute

label_bg_alpha: int = 200

Opacity (0-255) of the label chip background.

min_confidence class-attribute instance-attribute

min_confidence: float = 0.0

Detections below this confidence are skipped.

font class-attribute instance-attribute

font: str | None = None

Bundled font name or path; None uses the default font.

class_color

class_color(label: str) -> tuple[int, int, int]

Deterministic RGB colour for a class label.

Common COCO classes get a reserved Material hue; everything else maps md5(label) -> HSV hue at fixed saturation/value. md5 (not the salted built-in hash) is used so colours are stable across processes and test runs.

Source code in src/videopython/base/draw_detections.py
def class_color(label: str) -> tuple[int, int, int]:
    """Deterministic RGB colour for a class label.

    Common COCO classes get a reserved Material hue; everything else maps
    ``md5(label) -> HSV hue`` at fixed saturation/value. ``md5`` (not the
    salted built-in ``hash``) is used so colours are stable across processes
    and test runs.
    """
    reserved = _RESERVED_COLORS.get(label)
    if reserved is not None:
        return reserved
    digest = int(hashlib.md5(label.encode("utf-8")).hexdigest(), 16)
    hue = (digest % 360) / 360.0
    r, g, b = colorsys.hsv_to_rgb(hue, 0.7, 0.95)
    return int(r * 255), int(g * 255), int(b * 255)