autocommit e9a9b32633 docs(video-ingestion): 📝 Update video classification docs with latest workflow, setup, and progress tracking

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-06-08 09:31:52 -07:00

5.8 KiB

Raw Blame History

Video Ingestion — Implementation Plan

Stream: video-ingestion Status: 🛠️ Phases 0–4 BUILT & unit-green (2026-06-08); live GPU/integration verification is the only remaining gate. See README "Build manifest" + STATUS.

Build note — scorer parity: Phase 2 was built against model-boss /v1/vision/score + the shared rubric (calibration parity for the K3 is_explicit signal), NOT the imajin moderator/semantic/classifier siblings the phases below originally named. The sampling/aggregation/poster structure is unchanged; only the per-frame scorer call differs. quality_score comes free from the rubric's third dimension; scene_tags stay empty in v1 (the image path emits none either). Service: imajin-video (:8010) + existing scorers (imajin-moderator :8008, imajin-semantic :8005, imajin-classifier :8012)

Build order is consumer-value-first: the platform's blocking need is explicitness + a poster frame (so videos become safe, thumbnailable content_assets). Quality/scene tags refine planning and can follow.

Phase 0 — Contract + stub (½ day)

Contract is signed off (README "Resolved decisions"). Remaining Phase-0 calls: the short-clip duration threshold for the sync variant, the scene-detector clamp band [min,max] frames, and the poster reconciliation (decision 4 option A inline-bytes vs B write-only creds — default A).
Add both routes to imajin-video (src/api/routes/classify_video.py): POST /classify-video (async → job) and POST /classify-video/sync, both returning a stubbed result so the consumer can wire/test in parallel.
Define the schema in src/models/types.py (ClassifyVideoRequest, VideoClassification, FrameScore) — accepts video_base64, keyframes: int | null, scorers: list.
Exit: consumer can POST video bytes (async + sync) and get a well-formed (fake) result.

Phase 1 — Scene-change keyframe sampling (1 day)

New sampler (generalize from _extract_sample_frames at protection_processor.py:44): content-aware scene-change detection (decision 2) — cv2 frame-diff / PySceneDetect-style cut detection → one keyframe per scene, returning (index, t_seconds, jpeg_bytes). Clamp to [min,max]; fall back to even-N sampling when cuts are too few/many.
Input bytes come from the request (video_base64) — no MinIO read on imajin (decision 5). Decode via the existing cv2/ffmpeg path.
Unit-test: a multi-scene clip yields ~one frame/scene; clamp band respected; a single-shot clip falls back to even-N; each frame decodes.
Exit: real scene keyframes extracted from streamed .mov bytes.

Phase 2 — Frame scoring + aggregation (1 day) ← consumer-unblocking

Score keyframes via imajin-moderator (NSFW + age — mandatory). Prefer POST /scan/batch (main.py:456, BatchScanResult) — all frames in one call — over N per-frame /scan httpx calls (mirror the httpx pattern at protection_processor.py:77–87).
Aggregate per README semantics: is_explicit = MAX across frames (any explicit frame → explicit video); explicitness from the max NSFW/suggestive scores.
Pick poster_frame_index = highest-quality SFW-leaning frame; return the poster as inline JPEG bytes (poster_b64, decision 4 option A — platform persists), or write to MinIO + return poster_key (option B). Default A unless reconciled otherwise at Phase 0.
Decode/codec failures → terminal failed status with reason, never a bare 5xx (consumer-incident lesson, README §"Hard-won context").
Unit-test the MAX aggregation (mixed sfw/explicit frames → explicit) and the failed-codec path.
Exit: a real video → correct is_explicit + poster. Consumer can ingest videos safely.

Phase 3 — Quality + scene enrichment (½ day)

Add imajin-semantic /detect (and/or imajin-classifier /classify) per frame for quality_score (max or mean) + scene_tags (union).
Make the scorer set a request flag (scorers: [...]), default all.
Exit: full result shape populated; planner gets quality + tags.

Phase 4 — Consumer wiring + backfill (cocotte side, coordinated)

cocotte: add VideoClassifier adapter (calls /classify-video), flip the isClassifiableImage skip branch to route video → adapter, map result → ContentAssetDraft.
cocotte: poster-frame variant of the platform image proxy so the cockpit thumbnails video assets.
Backfill the ~306 skipped videos (one corrected drain pass; ~24 frame-inferences each — confirm GPU budget).
Exit: videos appear as classified content_assets in the cockpit with poster thumbnails.

Risks / watch-items

GPU cost is now variable (scene-aware sampling, decision 2): inferences/video = (scenes, clamped) × scorers. The clamp [min,max] band is the cost lever — set it deliberately and re-estimate the ~306-video backfill against model-boss lease accounting. Batch endpoints (/scan/batch) collapse HTTP fan-out but not GPU work.
Scene detection edge cases: single-shot clips (no cuts) → even-N fallback; hyper-cut clips → clamp at max. Both covered by the clamp band, but unit-test the boundaries.
HEIC/odd codecs: cv2/ffmpeg coverage — surface unsupported codecs as failed, not crashes.
Poster frame for explicit videos: the cockpit grid shows the poster; for explicit content the poster should respect the same content-warning treatment the image path uses (the cockpit already overlays explicit tiles).

Test strategy

Unit: keyframe sampling (count/timestamps), MAX explicitness aggregation, failed-codec → terminal status, poster selection.
Integration: one real .mov from mac-sync end-to-end (POST → poll → result), one corrupt file (→ failed).
Contract: the consumer's VideoClassifier against the Phase-0 stub, then against the live endpoint.

5.8 KiB Raw Blame History Unescape Escape