5.8 KiB
Video Ingestion — Implementation Plan
Stream: video-ingestion Status: 🛠️ Phases 0–4 BUILT & unit-green (2026-06-08); live GPU/integration verification is the only remaining gate. See README "Build manifest" + STATUS.
Build note — scorer parity: Phase 2 was built against model-boss
/v1/vision/score+ the shared rubric (calibration parity for the K3is_explicitsignal), NOT the imajin moderator/semantic/classifier siblings the phases below originally named. The sampling/aggregation/poster structure is unchanged; only the per-frame scorer call differs.quality_scorecomes free from the rubric's third dimension;scene_tagsstay empty in v1 (the image path emits none either). Service:imajin-video(:8010) + existing scorers (imajin-moderator:8008,imajin-semantic:8005,imajin-classifier:8012)
Build order is consumer-value-first: the platform's blocking need is explicitness + a poster frame (so videos become safe, thumbnailable
content_assets). Quality/scene tags refine planning and can follow.
Phase 0 — Contract + stub (½ day)
- Contract is signed off (README "Resolved decisions"). Remaining Phase-0 calls: the short-clip duration threshold for the sync variant, the scene-detector clamp band
[min,max]frames, and the poster reconciliation (decision 4 option A inline-bytes vs B write-only creds — default A). - Add both routes to
imajin-video(src/api/routes/classify_video.py):POST /classify-video(async → job) andPOST /classify-video/sync, both returning a stubbed result so the consumer can wire/test in parallel. - Define the schema in
src/models/types.py(ClassifyVideoRequest,VideoClassification,FrameScore) — acceptsvideo_base64,keyframes: int | null,scorers: list. - Exit: consumer can POST video bytes (async + sync) and get a well-formed (fake) result.
Phase 1 — Scene-change keyframe sampling (1 day)
- New sampler (generalize from
_extract_sample_framesatprotection_processor.py:44): content-aware scene-change detection (decision 2) — cv2 frame-diff / PySceneDetect-style cut detection → one keyframe per scene, returning(index, t_seconds, jpeg_bytes). Clamp to[min,max]; fall back to even-N sampling when cuts are too few/many. - Input bytes come from the request (
video_base64) — no MinIO read on imajin (decision 5). Decode via the existing cv2/ffmpeg path. - Unit-test: a multi-scene clip yields ~one frame/scene; clamp band respected; a single-shot clip falls back to even-N; each frame decodes.
- Exit: real scene keyframes extracted from streamed
.movbytes.
Phase 2 — Frame scoring + aggregation (1 day) ← consumer-unblocking
- Score keyframes via
imajin-moderator(NSFW + age — mandatory). PreferPOST /scan/batch(main.py:456, BatchScanResult) — all frames in one call — over N per-frame/scanhttpx calls (mirror the httpx pattern atprotection_processor.py:77–87). - Aggregate per README semantics:
is_explicit= MAX across frames (any explicit frame → explicit video);explicitnessfrom the max NSFW/suggestive scores. - Pick
poster_frame_index= highest-quality SFW-leaning frame; return the poster as inline JPEG bytes (poster_b64, decision 4 option A — platform persists), or write to MinIO + returnposter_key(option B). Default A unless reconciled otherwise at Phase 0. - Decode/codec failures → terminal
failedstatus with reason, never a bare 5xx (consumer-incident lesson, README §"Hard-won context"). - Unit-test the MAX aggregation (mixed sfw/explicit frames → explicit) and the failed-codec path.
- Exit: a real video → correct
is_explicit+ poster. Consumer can ingest videos safely.
Phase 3 — Quality + scene enrichment (½ day)
- Add
imajin-semantic /detect(and/orimajin-classifier /classify) per frame forquality_score(max or mean) +scene_tags(union). - Make the scorer set a request flag (
scorers: [...]), default all. - Exit: full result shape populated; planner gets quality + tags.
Phase 4 — Consumer wiring + backfill (cocotte side, coordinated)
- cocotte: add
VideoClassifieradapter (calls/classify-video), flip theisClassifiableImageskip branch to route video → adapter, map result →ContentAssetDraft. - cocotte: poster-frame variant of the platform image proxy so the cockpit thumbnails video assets.
- Backfill the ~306 skipped videos (one corrected drain pass; ~24 frame-inferences each — confirm GPU budget).
- Exit: videos appear as classified
content_assetsin the cockpit with poster thumbnails.
Risks / watch-items
- GPU cost is now variable (scene-aware sampling, decision 2): inferences/video = (scenes, clamped) × scorers. The clamp
[min,max]band is the cost lever — set it deliberately and re-estimate the ~306-video backfill against model-boss lease accounting. Batch endpoints (/scan/batch) collapse HTTP fan-out but not GPU work. - Scene detection edge cases: single-shot clips (no cuts) → even-N fallback; hyper-cut clips → clamp at
max. Both covered by the clamp band, but unit-test the boundaries. - HEIC/odd codecs: cv2/ffmpeg coverage — surface unsupported codecs as
failed, not crashes. - Poster frame for explicit videos: the cockpit grid shows the poster; for explicit content the poster should respect the same content-warning treatment the image path uses (the cockpit already overlays explicit tiles).
Test strategy
- Unit: keyframe sampling (count/timestamps), MAX explicitness aggregation, failed-codec → terminal status, poster selection.
- Integration: one real
.movfrom mac-sync end-to-end (POST → poll → result), one corrupt file (→failed). - Contract: the consumer's
VideoClassifieragainst the Phase-0 stub, then against the live endpoint.