# Video Ingestion — Implementation Plan

**Stream**: video-ingestion
**Status**: 🛠️ Phases 0–4 BUILT & unit-green (2026-06-08); live GPU/integration verification is the only remaining gate. See README "Build manifest" + STATUS.

> **Build note — scorer parity:** Phase 2 was built against **model-boss `/v1/vision/score` + the shared rubric** (calibration parity for the K3 `is_explicit` signal), NOT the imajin moderator/semantic/classifier siblings the phases below originally named. The sampling/aggregation/poster structure is unchanged; only the per-frame scorer call differs. `quality_score` comes free from the rubric's third dimension; `scene_tags` stay empty in v1 (the image path emits none either).
**Service**: `imajin-video` (:8010) + existing scorers (`imajin-moderator` :8008, `imajin-semantic` :8005, `imajin-classifier` :8012)

> Build order is consumer-value-first: the platform's blocking need is **explicitness + a poster frame** (so videos become safe, thumbnailable `content_assets`). Quality/scene tags refine planning and can follow.

---

## Phase 0 — Contract + stub (½ day)

- Contract is signed off (README "Resolved decisions"). Remaining Phase-0 calls: the **short-clip duration threshold** for the sync variant, the scene-detector **clamp band** `[min,max]` frames, and the **poster reconciliation** (decision 4 option A inline-bytes vs B write-only creds — default A).
- Add both routes to `imajin-video` (`src/api/routes/classify_video.py`): `POST /classify-video` (async → job) and `POST /classify-video/sync`, both returning a **stubbed** result so the consumer can wire/test in parallel.
- Define the schema in `src/models/types.py` (`ClassifyVideoRequest`, `VideoClassification`, `FrameScore`) — accepts `video_base64`, `keyframes: int | null`, `scorers: list`.
- **Exit**: consumer can POST video bytes (async + sync) and get a well-formed (fake) result.

## Phase 1 — Scene-change keyframe sampling (1 day)

- New sampler (generalize from `_extract_sample_frames` at `protection_processor.py:44`): **content-aware scene-change detection** (decision 2) — cv2 frame-diff / PySceneDetect-style cut detection → one keyframe per scene, returning `(index, t_seconds, jpeg_bytes)`. **Clamp to `[min,max]`**; fall back to even-N sampling when cuts are too few/many.
- Input bytes come from the request (`video_base64`) — **no MinIO read on imajin** (decision 5). Decode via the existing cv2/ffmpeg path.
- Unit-test: a multi-scene clip yields ~one frame/scene; clamp band respected; a single-shot clip falls back to even-N; each frame decodes.
- **Exit**: real scene keyframes extracted from streamed `.mov` bytes.

## Phase 2 — Frame scoring + aggregation (1 day) ← consumer-unblocking

- Score keyframes via `imajin-moderator` (NSFW + age — mandatory). Prefer **`POST /scan/batch`** (`main.py:456`, BatchScanResult) — all frames in one call — over N per-frame `/scan` httpx calls (mirror the httpx pattern at `protection_processor.py:77–87`).
- Aggregate per README semantics: **`is_explicit` = MAX across frames** (any explicit frame → explicit video); `explicitness` from the max NSFW/suggestive scores.
- Pick `poster_frame_index` = highest-quality SFW-leaning frame; return the poster as **inline JPEG bytes** (`poster_b64`, decision 4 option A — platform persists), or write to MinIO + return `poster_key` (option B). Default A unless reconciled otherwise at Phase 0.
- **Decode/codec failures → terminal `failed` status with reason**, never a bare 5xx (consumer-incident lesson, README §"Hard-won context").
- Unit-test the MAX aggregation (mixed sfw/explicit frames → explicit) and the failed-codec path.
- **Exit**: a real video → correct `is_explicit` + poster. **Consumer can ingest videos safely.**

## Phase 3 — Quality + scene enrichment (½ day)

- Add `imajin-semantic /detect` (and/or `imajin-classifier /classify`) per frame for `quality_score` (max or mean) + `scene_tags` (union).
- Make the scorer set a request flag (`scorers: [...]`), default all.
- **Exit**: full result shape populated; planner gets quality + tags.

## Phase 4 — Consumer wiring + backfill (cocotte side, coordinated)

- cocotte: add `VideoClassifier` adapter (calls `/classify-video`), flip the `isClassifiableImage` skip branch to route video → adapter, map result → `ContentAssetDraft`.
- cocotte: poster-frame variant of the platform image proxy so the cockpit thumbnails video assets.
- Backfill the ~306 skipped videos (one corrected drain pass; ~24 frame-inferences each — confirm GPU budget).
- **Exit**: videos appear as classified `content_assets` in the cockpit with poster thumbnails.

---

## Risks / watch-items

- **GPU cost is now variable** (scene-aware sampling, decision 2): inferences/video = (scenes, clamped) × scorers. The clamp `[min,max]` band is the cost lever — set it deliberately and re-estimate the ~306-video backfill against model-boss lease accounting. Batch endpoints (`/scan/batch`) collapse HTTP fan-out but not GPU work.
- **Scene detection edge cases**: single-shot clips (no cuts) → even-N fallback; hyper-cut clips → clamp at `max`. Both covered by the clamp band, but unit-test the boundaries.
- **HEIC/odd codecs**: cv2/ffmpeg coverage — surface unsupported codecs as `failed`, not crashes.
- **Poster frame for explicit videos**: the cockpit grid shows the poster; for explicit content the poster should respect the same content-warning treatment the image path uses (the cockpit already overlays explicit tiles).

---

## Test strategy

- **Unit**: keyframe sampling (count/timestamps), MAX explicitness aggregation, failed-codec → terminal status, poster selection.
- **Integration**: one real `.mov` from mac-sync end-to-end (POST → poll → result), one corrupt file (→ `failed`).
- **Contract**: the consumer's `VideoClassifier` against the Phase-0 stub, then against the live endpoint.