# Video Ingestion โ€” Implementation Plan **Stream**: video-ingestion **Status**: ๐Ÿ› ๏ธ Phases 0โ€“4 BUILT & unit-green (2026-06-08); live GPU/integration verification is the only remaining gate. See README "Build manifest" + STATUS. > **Build note โ€” scorer parity:** Phase 2 was built against **model-boss `/v1/vision/score` + the shared rubric** (calibration parity for the K3 `is_explicit` signal), NOT the imajin moderator/semantic/classifier siblings the phases below originally named. The sampling/aggregation/poster structure is unchanged; only the per-frame scorer call differs. `quality_score` comes free from the rubric's third dimension; `scene_tags` stay empty in v1 (the image path emits none either). **Service**: `imajin-video` (:8010) + existing scorers (`imajin-moderator` :8008, `imajin-semantic` :8005, `imajin-classifier` :8012) > Build order is consumer-value-first: the platform's blocking need is **explicitness + a poster frame** (so videos become safe, thumbnailable `content_assets`). Quality/scene tags refine planning and can follow. --- ## Phase 0 โ€” Contract + stub (ยฝ day) - Contract is signed off (README "Resolved decisions"). Remaining Phase-0 calls: the **short-clip duration threshold** for the sync variant, the scene-detector **clamp band** `[min,max]` frames, and the **poster reconciliation** (decision 4 option A inline-bytes vs B write-only creds โ€” default A). - Add both routes to `imajin-video` (`src/api/routes/classify_video.py`): `POST /classify-video` (async โ†’ job) and `POST /classify-video/sync`, both returning a **stubbed** result so the consumer can wire/test in parallel. - Define the schema in `src/models/types.py` (`ClassifyVideoRequest`, `VideoClassification`, `FrameScore`) โ€” accepts `video_base64`, `keyframes: int | null`, `scorers: list`. - **Exit**: consumer can POST video bytes (async + sync) and get a well-formed (fake) result. ## Phase 1 โ€” Scene-change keyframe sampling (1 day) - New sampler (generalize from `_extract_sample_frames` at `protection_processor.py:44`): **content-aware scene-change detection** (decision 2) โ€” cv2 frame-diff / PySceneDetect-style cut detection โ†’ one keyframe per scene, returning `(index, t_seconds, jpeg_bytes)`. **Clamp to `[min,max]`**; fall back to even-N sampling when cuts are too few/many. - Input bytes come from the request (`video_base64`) โ€” **no MinIO read on imajin** (decision 5). Decode via the existing cv2/ffmpeg path. - Unit-test: a multi-scene clip yields ~one frame/scene; clamp band respected; a single-shot clip falls back to even-N; each frame decodes. - **Exit**: real scene keyframes extracted from streamed `.mov` bytes. ## Phase 2 โ€” Frame scoring + aggregation (1 day) โ† consumer-unblocking - Score keyframes via `imajin-moderator` (NSFW + age โ€” mandatory). Prefer **`POST /scan/batch`** (`main.py:456`, BatchScanResult) โ€” all frames in one call โ€” over N per-frame `/scan` httpx calls (mirror the httpx pattern at `protection_processor.py:77โ€“87`). - Aggregate per README semantics: **`is_explicit` = MAX across frames** (any explicit frame โ†’ explicit video); `explicitness` from the max NSFW/suggestive scores. - Pick `poster_frame_index` = highest-quality SFW-leaning frame; return the poster as **inline JPEG bytes** (`poster_b64`, decision 4 option A โ€” platform persists), or write to MinIO + return `poster_key` (option B). Default A unless reconciled otherwise at Phase 0. - **Decode/codec failures โ†’ terminal `failed` status with reason**, never a bare 5xx (consumer-incident lesson, README ยง"Hard-won context"). - Unit-test the MAX aggregation (mixed sfw/explicit frames โ†’ explicit) and the failed-codec path. - **Exit**: a real video โ†’ correct `is_explicit` + poster. **Consumer can ingest videos safely.** ## Phase 3 โ€” Quality + scene enrichment (ยฝ day) - Add `imajin-semantic /detect` (and/or `imajin-classifier /classify`) per frame for `quality_score` (max or mean) + `scene_tags` (union). - Make the scorer set a request flag (`scorers: [...]`), default all. - **Exit**: full result shape populated; planner gets quality + tags. ## Phase 4 โ€” Consumer wiring + backfill (cocotte side, coordinated) - cocotte: add `VideoClassifier` adapter (calls `/classify-video`), flip the `isClassifiableImage` skip branch to route video โ†’ adapter, map result โ†’ `ContentAssetDraft`. - cocotte: poster-frame variant of the platform image proxy so the cockpit thumbnails video assets. - Backfill the ~306 skipped videos (one corrected drain pass; ~24 frame-inferences each โ€” confirm GPU budget). - **Exit**: videos appear as classified `content_assets` in the cockpit with poster thumbnails. --- ## Risks / watch-items - **GPU cost is now variable** (scene-aware sampling, decision 2): inferences/video = (scenes, clamped) ร— scorers. The clamp `[min,max]` band is the cost lever โ€” set it deliberately and re-estimate the ~306-video backfill against model-boss lease accounting. Batch endpoints (`/scan/batch`) collapse HTTP fan-out but not GPU work. - **Scene detection edge cases**: single-shot clips (no cuts) โ†’ even-N fallback; hyper-cut clips โ†’ clamp at `max`. Both covered by the clamp band, but unit-test the boundaries. - **HEIC/odd codecs**: cv2/ffmpeg coverage โ€” surface unsupported codecs as `failed`, not crashes. - **Poster frame for explicit videos**: the cockpit grid shows the poster; for explicit content the poster should respect the same content-warning treatment the image path uses (the cockpit already overlays explicit tiles). --- ## Test strategy - **Unit**: keyframe sampling (count/timestamps), MAX explicitness aggregation, failed-codec โ†’ terminal status, poster selection. - **Integration**: one real `.mov` from mac-sync end-to-end (POST โ†’ poll โ†’ result), one corrupt file (โ†’ `failed`). - **Contract**: the consumer's `VideoClassifier` against the Phase-0 stub, then against the live endpoint.