Chobit is an interactive AI companion — a multi-platform Godot 4 app with a 3D VRM avatar, voice interaction, and pluggable LLM backend. Godot is the avatar runtime; all ML/GPU inference runs on external services via model-boss.
| `interrupted` | Face-to-Face | Brief surprise, then back to listening |
| Return to `idle` | Desktop Gaze | Gradual drift back to screen tracking |
The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s.
## Motion Mirroring System
A showcase feature where the avatar mimics the user's gestures detected via webcam. This is **methodologically distinct** from skeleton-driven tracking:
### Mirroring (what we do) vs Tracking (what we don't)
| **Tracking** (rejected) | Map user skeleton → avatar skeleton in real-time | Puppet-like, jittery, uncanny |
Mirroring means the avatar is a personality that *responds* to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle.
### Gesture Classification Pipeline
```
Webcam Frame
│
▼
Pose Detection (MediaPipe / lightweight model)
│
▼
Gesture Classifier
├── wave → play wave_back animation
├── head_cock → play head_tilt animation (mirrored)
├── nod → play nod animation
├── head_shake → play head_shake animation
├── lean_forward → play lean_in animation
├── hand_raise → play greeting animation
├── thumbs_up → play happy_react animation
└── unknown → no action (ignore)
│
▼
Animation Trigger (via EventBus)
│
▼
AnimationTree plays the corresponding animation
with personality variation (speed, amplitude randomization)
```
### Key Properties
- **Deliberate delay** — 0.2-0.5s response time feels natural, not robotic
- **Personality variance** — same gesture doesn't always trigger the exact same animation
- **Selective response** — avatar doesn't mirror everything; chooses what to react to
- **Layered on conversation** — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations
- **Graceful when no camera** — falls back to Desktop Gaze only, no degraded experience
### Gesture Detection Approach
Two viable approaches (decision deferred to implementation):
1.**MediaPipe Holistic** — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket.
2.**Lightweight CNN classifier** — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process.
Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system.
## Conversation Loop
```
1. VAD detects speech end
└─▶ AudioEffectCapture buffer captured by Godot audio server
2. Audio sent to STT service
└─▶ HTTP POST to chatterbox-tts-service /api/stt
└─▶ Returns transcribed text
3. Text + history sent to LLM backend
└─▶ HTTP streaming request (SSE or chunked response)
└─▶ Tokens arrive incrementally
4. SentenceStream buffers tokens into complete sentences
└─▶ Each sentence immediately sent to TTS
└─▶ First sentence plays while LLM still generates
5. EmotionExtractor strips [emotion] tags from each sentence
└─▶ AnimationTree transitions to matching expression
Mobile OSes don't support transparent overlay windows — Miku owns the full screen. The background behind the avatar is configurable with four modes:
| Mode | Source | Use case |
|------|--------|----------|
| **Camera feed** | Rear/front `CameraFeed` → viewport background | AR-style, companion in the real world. Front camera doubles as face tracking input. |
| **Rendered environment** | 3D scene (bedroom, park, abstract) | Virtual pet aesthetic, configurable themes |
| **Camera blur** | Camera feed → Gaussian blur shader | Softer AR look, less visual noise |
| **Solid/gradient** | Flat color or gradient | Battery-friendly fallback, clean aesthetic |
The background layer renders behind the avatar in the viewport. The avatar, lighting, and UI are identical to desktop — only the background differs. Desktop has transparency as its implicit "background mode" and doesn't use this system.
GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases:
- **Whisper STT**: Lease acquired per transcription request
- **Chatterbox TTS**: Lease acquired per synthesis request
- **LLM inference**: Lease held during streaming response
Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue.
## VRM Model Format
Chobit uses VRM models (`.vrm` files) loaded via the VRM4Godot addon: