AI Understanding
Analyze videos, transcribe audio, describe visual content, and track faces per shot.
For a single aggregate, serializable analysis object across multiple analyzers, see Video Analysis.
Local Model Support
| Class | Local Model Family |
|---|---|
| SceneVLM | Qwen3.5 (4B / 9B / 27B) |
| AudioToText | Whisper |
| AudioClassifier | AST |
| SemanticSceneDetector | TransNetV2 |
| FaceTracker | YOLOv8-face |
AudioToText
Anti-hallucination knobs
Three Whisper decoder kwargs are surfaced for tuning on noisy or sparse-speech audio:
from videopython.ai import AudioToText
# Defaults: condition_on_previous_text=False (the cascading-hallucination fix),
# no_speech_threshold=0.6, logprob_threshold=-1.0.
transcriber = AudioToText()
# Tighter no-speech gate to drop more low-confidence windows on a film with
# heavy ambient music.
transcriber = AudioToText(no_speech_threshold=0.85)
# Restore Whisper's upstream default conditioning (e.g. for clean podcasts
# where cross-window context helps disambiguate homophones).
transcriber = AudioToText(condition_on_previous_text=True)
Brand-name vocabulary biasing
Bias Whisper's first-window decoder toward a caller-supplied list of brand
names, product names, or proper nouns via the native initial_prompt
channel. Recovers near-mishears (e.g. Klarna → "carna", InPost →
"in post") on brand-monitoring inputs without any new model
dependencies.
from videopython.ai import AudioToText
# Constructor default — applies to every transcribe() call on this instance.
transcriber = AudioToText(vocabulary=["Klarna", "Allegro", "InPost"])
result = transcriber.transcribe(video)
# Per-call override — useful when one transcriber serves multiple tenants.
result = transcriber.transcribe(video, vocabulary=["Pyszne", "Wolt"])
The list is normalized at construction (whitespace stripped,
case-insensitive dedup, casing of the first occurrence preserved).
Whisper reserves ~224 tokens for the prompt; longer lists are trimmed
from the tail with a single WARNING log line naming the count
dropped.
VideoDubber and LocalDubbingPipeline accept the same vocabulary
kwarg; it threads through to the underlying transcriber. Within
VideoAnalyzer, pass it via analyzer_params:
from videopython.ai import VideoAnalyzer
from videopython.ai.video_analysis import VideoAnalysisConfig
config = VideoAnalysisConfig(
analyzer_params={"audio_to_text": {"vocabulary": ["Klarna", "Allegro"]}}
)
analysis = VideoAnalyzer(config=config).analyze_path("brand_review.mp4")
Recovers names Whisper almost heard correctly. It will not catch zero-prior names; an LLM correction pass would close that gap.
Per-segment confidence
TranscriptionSegment carries three optional confidence fields populated from
the raw Whisper output: avg_logprob, no_speech_prob, and
compression_ratio. They are None when not available (e.g. on the
diarization-only path that builds segments from words without overlap match,
or on transcripts loaded from formats that don't carry the metadata).
These signals feed the dubbing pipeline's transcript-quality gate (median
avg_logprob is one of three reject flags) and Qwen3's confidence-aware
translation prompt (segments below threshold get a low_confidence hint). They
are also useful for downstream callers that want to drop low-quality segments
before further processing.
result = AudioToText().transcribe(video)
for segment in result.segments:
if segment.avg_logprob is not None and segment.avg_logprob < -1.0:
print(f"low confidence: {segment.text!r}")
AudioToText
Transcription service for audio and video using local Whisper models.
Uses openai-whisper for transcription (with word-level timestamps) and
pyannote-audio for optional speaker diarization. By default, Silero VAD
runs before Whisper to gate language detection on a 30s window built from
voiced regions only — fixes Whisper's tendency to lock onto the wrong
language when the file opens with silence, music, or non-vocal credits.
Disable with enable_vad=False to reproduce pre-0.27 behaviour.
Three Whisper decoder kwargs are surfaced for anti-hallucination tuning:
condition_on_previous_textdefaults toFalse(Whisper's own default isTrue). With conditioning on, a single hallucinated filler phrase cascades through the rest of the file because each window's decoder is primed by the previous window's decoded text. Turning it off is the most commonly recommended fix for that failure mode; the cost on clean audio is small (slightly less context for ambiguous homophones across sentence boundaries).no_speech_thresholdandlogprob_thresholdare forwarded with Whisper's documented defaults (0.6and-1.0); raisingno_speech_thresholdbiases toward dropping low-confidence windows instead of emitting filler.
vocabulary biases Whisper's first-window decoder toward a caller-
supplied list of brand names, product names, or proper nouns via the
native initial_prompt channel. Recovers near-mishears (e.g. Klarna
→ "carna") without new model deps; will not catch zero-prior names.
Per-call override is available on :meth:transcribe.
Source code in src/videopython/ai/understanding/audio.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 | |
unload
Release the Whisper, diarization, and VAD models so the next call re-initializes.
Used by low-memory dubbing to free VRAM between pipeline stages.
Source code in src/videopython/ai/understanding/audio.py
diarize_transcription
Attach speaker labels to a pre-computed transcription using pyannote.
Useful when callers have a transcription (e.g. pre-computed and edited)
but no speakers, and want per-speaker voice cloning in dubbing without
re-running Whisper. Runs pyannote standalone on audio and overlays
speakers onto the supplied transcription's words.
Requires word-level timings: at least one segment must contain more than one word. Transcriptions loaded from SRT (one synthetic word per segment) will not produce useful speakers and are rejected.
Source code in src/videopython/ai/understanding/audio.py
transcribe
Transcribe audio or video to text.
vocabulary overrides the constructor default for this call only;
a per-call list wins over the instance's vocabulary so one
:class:AudioToText instance can serve multiple tenants. Pass
None (the default) to use the constructor's list.
Source code in src/videopython/ai/understanding/audio.py
AudioClassifier
Detect and classify sounds, music, and audio events with timestamps using Audio Spectrogram Transformer (AST), a state-of-the-art model achieving 0.485 mAP on AudioSet.
Basic Usage
from videopython.ai import AudioClassifier
from videopython.base import Video
classifier = AudioClassifier(confidence_threshold=0.3)
video = Video.from_path("video.mp4")
result = classifier.classify(video)
# Clip-level predictions (overall audio content)
for label, confidence in result.clip_predictions.items():
print(f"{label}: {confidence:.2f}")
# Timestamped events
for event in result.events:
print(f"{event.start:.1f}s - {event.end:.1f}s: {event.label} ({event.confidence:.2f})")
AudioClassifier
Audio event and sound classification using AST.
Source code in src/videopython/ai/understanding/audio.py
480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 | |
unload
Release the AST model so the next classify() re-initializes.
Used by low-memory dubbing to free VRAM between pipeline stages.
Source code in src/videopython/ai/understanding/audio.py
classify
Classify audio events in audio or video.
Source code in src/videopython/ai/understanding/audio.py
SceneVLM
SceneVLM supports Qwen3.5 dense vision-capable variants via the
model_size kwarg: "4b" (default, ~8 GB FP16), "9b" (~18 GB FP16),
"27b" (~54 GB FP16, needs ≥48 GB). Device selection is automatic by
default (cuda -> mps -> cpu).
analyze_scene() and analyze_frame() return a structured
SceneDescription with three fields: a one-sentence
caption, an open-list subjects, and a closed-enum shot_type. The
class uses few-shot JSON prompting with one parse-retry; on persistent
parse failure, the raw text becomes the caption while subjects and
shot_type are returned empty / None.
from videopython.ai import SceneVLM
vlm = SceneVLM(model_size="4b") # default
description = vlm.analyze_frame(frame_array)
print(description.caption) # "A man in a cap speaks into a microphone."
print(description.subjects) # ["man", "microphone", "cap"]
print(description.shot_type) # "medium"
SceneVLM.unload() releases the model + processor for low_memory
parity with the dubbing pipeline's translator backends.
SceneVLM
Generates structured scene descriptions with local Qwen3.5.
model_size maps to Qwen3.5 dense vision-capable variants:
4b -> Qwen/Qwen3.5-4B (~8 GB FP16; default)
9b -> Qwen/Qwen3.5-9B (~18 GB FP16; 24 GB GPU when solo)
27b -> Qwen/Qwen3.5-27B (~54 GB FP16; needs ≥48 GB)
All sizes return a fully-populated SceneDescription. JSON parse
failures fall back to raw-text-as-caption with empty subjects and
None shot_type; that path is the rare exception, not a tier.
Source code in src/videopython/ai/understanding/image.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 | |
unload
Release model + processor for low_memory parity with other stages.
Mirrors MarianTranslator.unload / Qwen3Translator.unload. Safe
to call before _init_local -- it just clears already-None fields.
Source code in src/videopython/ai/understanding/image.py
analyze_frame
Analyze one frame and return a structured scene description.
Source code in src/videopython/ai/understanding/image.py
analyze_scene
Analyze a scene with multiple frames and return a structured description.
Uses few-shot JSON prompting with one parse-retry. If both attempts
fail to produce valid JSON, falls back to a raw-text caption with
empty subjects and shot_type=None.
Source code in src/videopython/ai/understanding/image.py
SemanticSceneDetector
ML-based scene boundary detection using TransNetV2. More accurate than histogram-based detection, especially for gradual transitions like fades and dissolves.
from videopython.ai import SemanticSceneDetector
detector = SemanticSceneDetector(threshold=0.5, min_scene_length=1.0)
scenes = detector.detect_streaming("video.mp4")
for scene in scenes:
print(f"Scene: {scene.start:.1f}s - {scene.end:.1f}s ({scene.duration:.1f}s)")
SemanticSceneDetector
ML-based scene detection using TransNetV2.
TransNetV2 is a neural network specifically designed for shot boundary detection, providing more accurate scene boundaries than histogram-based methods, especially for gradual transitions.
Uses the transnetv2-pytorch package with pretrained weights.
Example
from videopython.ai.understanding import SemanticSceneDetector detector = SemanticSceneDetector() scenes = detector.detect_streaming("video.mp4") for scene in scenes: ... print(f"Scene: {scene.start:.2f}s - {scene.end:.2f}s")
Source code in src/videopython/ai/understanding/temporal.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 | |
__init__
Initialize the semantic scene detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
Confidence threshold for scene boundaries (0.0-1.0). Higher values = fewer, more confident boundaries. |
0.5
|
min_scene_length
|
float
|
Minimum scene duration in seconds. |
0.5
|
device
|
str | None
|
Device to run on ('cuda', 'mps', 'cpu', or None for auto). Note: MPS may have numerical inconsistencies; use 'cpu' for reproducible results. |
None
|
Source code in src/videopython/ai/understanding/temporal.py
unload
detect
Detect scenes in a video using ML-based boundary detection.
Note: This method requires saving video to a temporary file for TransNetV2 processing. For better performance, use detect_streaming() with a file path directly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
video
|
Video
|
Video object to analyze. |
required |
Returns:
| Type | Description |
|---|---|
list[SceneBoundary]
|
List of SceneBoundary objects representing detected scenes. |
Source code in src/videopython/ai/understanding/temporal.py
detect_streaming
Detect scenes from a video file.
Uses TransNetV2 with pretrained weights for accurate shot boundary detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to video file. |
required |
Returns:
| Type | Description |
|---|---|
list[SceneBoundary]
|
List of SceneBoundary objects representing detected scenes. |
Source code in src/videopython/ai/understanding/temporal.py
detect_from_path
classmethod
detect_from_path(
path: str | Path,
threshold: float = 0.5,
min_scene_length: float = 0.5,
) -> list[SceneBoundary]
Convenience method for one-shot scene detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to video file. |
required |
threshold
|
float
|
Scene boundary threshold (0.0-1.0). |
0.5
|
min_scene_length
|
float
|
Minimum scene duration in seconds. |
0.5
|
Returns:
| Type | Description |
|---|---|
list[SceneBoundary]
|
List of SceneBoundary objects representing detected scenes. |
Source code in src/videopython/ai/understanding/temporal.py
FaceTracker
FaceTracker runs YOLOv8-face detection and stitches detections into
per-shot tracks via IoU association — no embedding re-id, so a track
does not survive across shot boundaries. Two surfaces:
track_shot(frames, frame_indices)returns a list ofFaceTrackobjects with stable ids within the shot. This is the API the analyzer uses.detect_and_track(frame, frame_index)/track_video(frames)are the legacy single-subject smoothed-position APIs used byFaceTrackingCrop(see AI Transforms).
from videopython.ai import FaceTracker
tracker = FaceTracker(backend="auto")
tracks = tracker.track_shot(frames)
for track in tracks:
print(f"track #{track.track_id}: {track.length} frames, "
f"first frame {track.frame_indices[0]}")
FaceTracker
Face tracking utility with per-frame smoothing and per-shot tracks.
Two surfaces:
detect_and_track(frame, frame_index)/track_video(frames)— legacy single-subject API used byFaceTrackingCrop. Returns a smoothed(cx, cy, w, h)tuple.track_shot(frames, frame_indices)— new per-shot multi-track API returninglist[FaceTrack]. Used by the analysis pipeline (M5) and lip-sync (M6) to bind detections to subjects across the frames of one shot. IoU-only association — tracks do not survive across shot boundaries.
Source code in src/videopython/ai/understanding/faces.py
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 | |
__init__
__init__(
selection_strategy: Literal[
"largest", "centered", "index"
] = "largest",
face_index: int = 0,
smoothing: float = 0.8,
detection_interval: int = 3,
min_face_size: int = 30,
backend: Literal["cpu", "gpu", "auto"] = "auto",
sample_rate: int = 1,
batch_size: int = 16,
iou_match_threshold: float = DEFAULT_IOU_MATCH_THRESHOLD,
max_missed_frames: int = DEFAULT_MAX_MISSED_FRAMES,
)
Initialize face tracker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selection_strategy
|
Literal['largest', 'centered', 'index']
|
How to select which face to track (legacy single-subject API). - "largest": Track the face with the largest bounding box. - "centered": Track the face closest to frame center. - "index": Track the face at a specific index (sorted by area). |
'largest'
|
face_index
|
int
|
Index of face to track when using "index" strategy. |
0
|
smoothing
|
float
|
Exponential moving average factor (0-1). Higher = smoother. |
0.8
|
detection_interval
|
int
|
Run detection every N frames, interpolate between. |
3
|
min_face_size
|
int
|
Minimum face size in pixels for detection. |
30
|
backend
|
Literal['cpu', 'gpu', 'auto']
|
Detection backend - "cpu", "gpu", or "auto". |
'auto'
|
sample_rate
|
int
|
For GPU backend, detect every Nth frame and interpolate. Only used by track_video(). Default 1 (every frame). |
1
|
batch_size
|
int
|
Batch size for GPU detection. Default 16. |
16
|
iou_match_threshold
|
float
|
Minimum IoU between consecutive detections to
continue an existing per-shot track. Used by |
DEFAULT_IOU_MATCH_THRESHOLD
|
max_missed_frames
|
int
|
How many consecutive frames a per-shot track can go without a detection before it's closed. |
DEFAULT_MAX_MISSED_FRAMES
|
Source code in src/videopython/ai/understanding/faces.py
detect_and_track
Detect face in frame and return smoothed position.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frame
|
ndarray
|
Video frame as numpy array (H, W, 3). |
required |
frame_index
|
int
|
Index of current frame. |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float, float, float] | None
|
Tuple of (center_x, center_y, width, height) in normalized coords, |
tuple[float, float, float, float] | None
|
or None if no face detected and no fallback available. |
Source code in src/videopython/ai/understanding/faces.py
reset
track_video
Track face through entire video using optimized batch detection.
Optimized for GPU backends with frame sampling and interpolation for smooth tracking with reduced computation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frames
|
ndarray
|
Video frames array of shape (N, H, W, 3). |
required |
Returns:
| Type | Description |
|---|---|
list[tuple[float, float, float, float] | None]
|
List of face positions (cx, cy, w, h) for each frame, or None if |
list[tuple[float, float, float, float] | None]
|
no face detected and no fallback available. |
Source code in src/videopython/ai/understanding/faces.py
359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 | |
track_shot
track_shot(
frames: list[ndarray] | ndarray,
frame_indices: list[int] | None = None,
) -> list[FaceTrack]
Per-shot multi-track association via IoU.
Detection is run on every input frame (caller is expected to have
already chosen the sampling cadence -- the analysis pipeline
passes one frame per scene-VLM sample, lip-sync passes every
frame in the shot). Tracks are stitched together greedily by
best IoU above iou_match_threshold; tracks with no match for
max_missed_frames consecutive frames are closed and won't
accept future associations.
Track ids are integers starting at 1 within this shot. They are not stable across shots — embedding re-id is deferred.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frames
|
list[ndarray] | ndarray
|
Frames in the shot (list or stacked ndarray). |
required |
frame_indices
|
list[int] | None
|
Source-video frame indices. Defaults to
|
None
|
Returns:
| Type | Description |
|---|---|
list[FaceTrack]
|
List of |
list[FaceTrack]
|
tracked in the shot. |
Source code in src/videopython/ai/understanding/faces.py
449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 | |
Scene Data Classes
These classes are used by scene and audio analyzers to represent analysis results:
SceneBoundary
SceneBoundary
dataclass
Timing information for a detected scene.
A lightweight structure representing scene boundaries returned by
scene detectors (e.g. videopython.ai.SemanticSceneDetector). This
is a backbone type — higher-level scene analysis lives in orchestration
packages.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
float
|
Scene start time in seconds |
end |
float
|
Scene end time in seconds |
start_frame |
int
|
Index of the first frame in this scene |
end_frame |
int
|
Index of the last frame in this scene (exclusive) |
Source code in src/videopython/base/description.py
to_dict
Convert to dictionary for JSON serialization.
from_dict
classmethod
Create SceneBoundary from dictionary.
Source code in src/videopython/base/description.py
SceneDescription
SceneDescription
dataclass
Structured visual scene description from the SceneVLM.
The v1 schema is intentionally narrow (caption + subjects + shot_type). Wider schemas drop JSON parse rate on small models without eval data to defend the cost. Fields are added in v2 as parse-rate measurements justify them; closed enums first, open lists last.
Attributes:
| Name | Type | Description |
|---|---|---|
caption |
str
|
One-sentence summary of the scene. |
subjects |
list[str]
|
Open list of named subjects visible in the frames. |
shot_type |
str | None
|
Closed enum framing the camera distance, or None when JSON parsing fell back to raw text. |
Source code in src/videopython/base/description.py
BoundingBox
BoundingBox
Bases: BaseModel
A bounding box for detected objects or crop regions in an image.
Coordinates are normalized to [0, 1] relative to image dimensions.
Promoted to a Pydantic model so it can be embedded directly into
Operation fields (e.g. KenBurns.start_region) and validated /
serialised as part of an op's JSON wire format.
Source code in src/videopython/base/description.py
to_dict
from_dict
classmethod
DetectedObject
DetectedObject
dataclass
An object detected in a video frame.
Attributes:
| Name | Type | Description |
|---|---|---|
label |
str
|
Name/class of the detected object (e.g., "person", "car", "dog") |
confidence |
float
|
Detection confidence score between 0 and 1 |
bounding_box |
BoundingBox | None
|
Optional bounding box location of the object |
Source code in src/videopython/base/description.py
to_dict
Convert to dictionary for JSON serialization.
Source code in src/videopython/base/description.py
from_dict
classmethod
Create DetectedObject from dictionary.
Source code in src/videopython/base/description.py
DetectedFace
DetectedFace
dataclass
A face detected in a video frame.
Attributes:
| Name | Type | Description |
|---|---|---|
bounding_box |
BoundingBox | None
|
Bounding box location of the face (normalized 0-1 coordinates). May be None for cloud backends that only return face counts. |
confidence |
float
|
Detection confidence score between 0 and 1 |
Source code in src/videopython/base/description.py
center
property
Center point of the face bounding box, or None if no bounding box.
area
property
Area of the face bounding box (normalized), or None if no bounding box.
to_dict
Convert to dictionary for JSON serialization.
from_dict
classmethod
Create DetectedFace from dictionary.
Source code in src/videopython/base/description.py
DetectedText
DetectedText
dataclass
Text detected in a video frame.
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
OCR text content |
confidence |
float
|
Detection confidence score between 0 and 1 |
bounding_box |
BoundingBox | None
|
Optional normalized bounding box for the text region |
Source code in src/videopython/base/description.py
to_dict
Convert to dictionary for JSON serialization.
Source code in src/videopython/base/description.py
from_dict
classmethod
Create DetectedText from dictionary.
Source code in src/videopython/base/description.py
FaceTrack
FaceTrack
dataclass
A face tracked across consecutive frames within a single shot.
Tracks are produced by IoU association — no embedding re-id, so a
track does not survive across shot/scene boundaries. frame_indices
and boxes are parallel lists of equal length.
Attributes:
| Name | Type | Description |
|---|---|---|
track_id |
int
|
Stable id within the shot the track was produced in. Not globally unique across scenes. |
frame_indices |
list[int]
|
Source-video frame indices for each detection. |
boxes |
list[BoundingBox]
|
Per-frame bounding boxes (normalized 0-1 coords). |
confidences |
list[float]
|
Per-frame detection confidence in [0, 1]. |
Source code in src/videopython/base/description.py
AudioEvent
AudioEvent
dataclass
A detected audio event with timestamp.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
float
|
Start time in seconds |
end |
float
|
End time in seconds |
label |
str
|
Name of the detected sound (e.g., "Music", "Speech", "Dog bark") |
confidence |
float
|
Detection confidence score between 0 and 1 |
Source code in src/videopython/base/description.py
to_dict
Convert to dictionary for JSON serialization.
from_dict
classmethod
Create AudioEvent from dictionary.
AudioClassification
AudioClassification
dataclass
Complete audio classification results.
Attributes:
| Name | Type | Description |
|---|---|---|
events |
list[AudioEvent]
|
List of detected audio events with timestamps |
clip_predictions |
dict[str, float]
|
Overall class probabilities for the entire audio clip |
Source code in src/videopython/base/description.py
to_dict
Convert to dictionary for JSON serialization.
from_dict
classmethod
Create AudioClassification from dictionary.