imajin/.project/streams/video-ingestion/PLAN.md
2026-06-08 09:31:52 -07:00

5.8 KiB
Raw Blame History

Video Ingestion — Implementation Plan

Stream: video-ingestion Status: 🛠️ Phases 04 BUILT & unit-green (2026-06-08); live GPU/integration verification is the only remaining gate. See README "Build manifest" + STATUS.

Build note — scorer parity: Phase 2 was built against model-boss /v1/vision/score + the shared rubric (calibration parity for the K3 is_explicit signal), NOT the imajin moderator/semantic/classifier siblings the phases below originally named. The sampling/aggregation/poster structure is unchanged; only the per-frame scorer call differs. quality_score comes free from the rubric's third dimension; scene_tags stay empty in v1 (the image path emits none either). Service: imajin-video (:8010) + existing scorers (imajin-moderator :8008, imajin-semantic :8005, imajin-classifier :8012)

Build order is consumer-value-first: the platform's blocking need is explicitness + a poster frame (so videos become safe, thumbnailable content_assets). Quality/scene tags refine planning and can follow.


Phase 0 — Contract + stub (½ day)

  • Contract is signed off (README "Resolved decisions"). Remaining Phase-0 calls: the short-clip duration threshold for the sync variant, the scene-detector clamp band [min,max] frames, and the poster reconciliation (decision 4 option A inline-bytes vs B write-only creds — default A).
  • Add both routes to imajin-video (src/api/routes/classify_video.py): POST /classify-video (async → job) and POST /classify-video/sync, both returning a stubbed result so the consumer can wire/test in parallel.
  • Define the schema in src/models/types.py (ClassifyVideoRequest, VideoClassification, FrameScore) — accepts video_base64, keyframes: int | null, scorers: list.
  • Exit: consumer can POST video bytes (async + sync) and get a well-formed (fake) result.

Phase 1 — Scene-change keyframe sampling (1 day)

  • New sampler (generalize from _extract_sample_frames at protection_processor.py:44): content-aware scene-change detection (decision 2) — cv2 frame-diff / PySceneDetect-style cut detection → one keyframe per scene, returning (index, t_seconds, jpeg_bytes). Clamp to [min,max]; fall back to even-N sampling when cuts are too few/many.
  • Input bytes come from the request (video_base64) — no MinIO read on imajin (decision 5). Decode via the existing cv2/ffmpeg path.
  • Unit-test: a multi-scene clip yields ~one frame/scene; clamp band respected; a single-shot clip falls back to even-N; each frame decodes.
  • Exit: real scene keyframes extracted from streamed .mov bytes.

Phase 2 — Frame scoring + aggregation (1 day) ← consumer-unblocking

  • Score keyframes via imajin-moderator (NSFW + age — mandatory). Prefer POST /scan/batch (main.py:456, BatchScanResult) — all frames in one call — over N per-frame /scan httpx calls (mirror the httpx pattern at protection_processor.py:7787).
  • Aggregate per README semantics: is_explicit = MAX across frames (any explicit frame → explicit video); explicitness from the max NSFW/suggestive scores.
  • Pick poster_frame_index = highest-quality SFW-leaning frame; return the poster as inline JPEG bytes (poster_b64, decision 4 option A — platform persists), or write to MinIO + return poster_key (option B). Default A unless reconciled otherwise at Phase 0.
  • Decode/codec failures → terminal failed status with reason, never a bare 5xx (consumer-incident lesson, README §"Hard-won context").
  • Unit-test the MAX aggregation (mixed sfw/explicit frames → explicit) and the failed-codec path.
  • Exit: a real video → correct is_explicit + poster. Consumer can ingest videos safely.

Phase 3 — Quality + scene enrichment (½ day)

  • Add imajin-semantic /detect (and/or imajin-classifier /classify) per frame for quality_score (max or mean) + scene_tags (union).
  • Make the scorer set a request flag (scorers: [...]), default all.
  • Exit: full result shape populated; planner gets quality + tags.

Phase 4 — Consumer wiring + backfill (cocotte side, coordinated)

  • cocotte: add VideoClassifier adapter (calls /classify-video), flip the isClassifiableImage skip branch to route video → adapter, map result → ContentAssetDraft.
  • cocotte: poster-frame variant of the platform image proxy so the cockpit thumbnails video assets.
  • Backfill the ~306 skipped videos (one corrected drain pass; ~24 frame-inferences each — confirm GPU budget).
  • Exit: videos appear as classified content_assets in the cockpit with poster thumbnails.

Risks / watch-items

  • GPU cost is now variable (scene-aware sampling, decision 2): inferences/video = (scenes, clamped) × scorers. The clamp [min,max] band is the cost lever — set it deliberately and re-estimate the ~306-video backfill against model-boss lease accounting. Batch endpoints (/scan/batch) collapse HTTP fan-out but not GPU work.
  • Scene detection edge cases: single-shot clips (no cuts) → even-N fallback; hyper-cut clips → clamp at max. Both covered by the clamp band, but unit-test the boundaries.
  • HEIC/odd codecs: cv2/ffmpeg coverage — surface unsupported codecs as failed, not crashes.
  • Poster frame for explicit videos: the cockpit grid shows the poster; for explicit content the poster should respect the same content-warning treatment the image path uses (the cockpit already overlays explicit tiles).

Test strategy

  • Unit: keyframe sampling (count/timestamps), MAX explicitness aggregation, failed-codec → terminal status, poster selection.
  • Integration: one real .mov from mac-sync end-to-end (POST → poll → result), one corrupt file (→ failed).
  • Contract: the consumer's VideoClassifier against the Phase-0 stub, then against the live endpoint.