chobit/docs/ARCHITECTURE.md

# Chobit Architecture

## Overview

Chobit is an interactive AI companion — a multi-platform Godot 4 app with a 3D VRM avatar, voice interaction, and pluggable LLM backend. Godot is the avatar runtime; all ML/GPU inference runs on external services via model-boss.

The project follows the @applications Tier 2 pattern with shared GDScript symlinked into platform-specific Godot projects:

```
shared/godot/           → Cross-platform source (avatar, conversation, audio, UI)
godot-desktop/src/ →    → Symlink to shared/godot/ (transparent overlay, tray, window mgmt)
godot-mobile/src/ →     → Symlink to shared/godot/ (touch input, on-device camera)
services/               → Desktop-only Python sidecars (bridge, tray, vision)
```

## System Diagram

```
┌──────────────────────────────────────────────────────────────┐
│ Godot 4 App (transparent desktop overlay)                    │
│                                                              │
│  ┌────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │ Microphone      │  │ Conversation    │  │ VRM Avatar   │ │
│  │ Input           │  │ Orchestrator    │  │              │ │
│  │                 │  │                 │  │ Skeleton     │ │
│  │ VAD             │  │ State Machine   │  │ Blendshapes  │ │
│  │ (Silero/energy) │──│ Sentence Stream │──│ AnimationTree│ │
│  │                 │  │ Emotion Extract │  │ IK / LookAt  │ │
│  │ AudioEffectCapt │  │ Interrupt Ctrl  │  │ Lipsync      │ │
│  └────────────────┘  └────────┬────────┘  └──────────────┘ │
│                               │                              │
│  ┌────────────────┐           │                              │
│  │ Camera Input   │           │                              │
│  │                │           │                              │
│  │ Webcam Feed    │           │                              │
│  │ Gesture Classif│───────────┘                              │
│  │ Face Detection │                                          │
│  └────────────────┘                                          │
│                                                              │
│                ┌──────────────┼──────────────┐              │
│                ▼              ▼              ▼              │
│          ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│          │ STT      │  │ LLM      │  │ TTS      │         │
│          │ Client   │  │ Client   │  │ Client   │         │
│          │ (HTTP)   │  │ (HTTP/WS)│  │ (HTTP)   │         │
│          └──────────┘  └──────────┘  └──────────┘         │
│                │              │              │              │
└────────────────┼──────────────┼──────────────┼──────────────┘
                 │              │              │
                 ▼              ▼              ▼
          ┌───────────────────────────────────────┐
          │         Backend Services              │
          │                                       │
          │  @speech-synthesis    @model-boss     │
          │  ├─ Whisper STT       ├─ GPU leases   │
          │  └─ Chatterbox TTS    └─ LLM routing  │
          │                                       │
          │  Any OpenAI-compatible LLM endpoint   │
          │  or LifeAI companion service          │
          └───────────────────────────────────────┘
```

## Attention System (Dual-Mode Gaze)

Chobit has two attention modes that determine where the avatar looks and how it responds to the user:

### Desktop Gaze (Ambient Mode)

The avatar tracks what the user is doing on screen. The companion is "with you" while you work.

- **Eyes/head follow cursor position** — LookAt target is the mouse pointer mapped to 3D space
- **Active during idle state** — the default when no conversation is happening
- **Ambient reactions** — occasional glances at notification areas, screen edges, active windows
- **Subtle personality** — random look-away moments, stretches, yawns (not a robotic cursor tracker)

### Face-to-Face (Conversation Mode)

The webcam activates and the avatar looks at the user directly. Mutual eye contact.

- **Gaze target is the user's face** — detected via webcam, avatar maintains eye contact
- **Active during conversation** — listening, processing, speaking states
- **Facial awareness** — can detect user's general expression for responsive reactions
- **Triggered by VAD** — speech detection switches from Desktop Gaze to Face-to-Face

### Mode Transitions

Transitions map to the ConversationState FSM:

| State | Attention Mode | Behavior |
|-------|---------------|----------|
| `idle` | Desktop Gaze | Tracks cursor, ambient companion |
| `listening` | Face-to-Face | Webcam active, looks at user, attentive posture |
| `processing` | Face-to-Face | Maintains eye contact, thinking pose |
| `speaking` | Face-to-Face | Engaged, gesturing, eye contact |
| `interrupted` | Face-to-Face | Brief surprise, then back to listening |
| Return to `idle` | Desktop Gaze | Gradual drift back to screen tracking |

The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s.

## Motion Mirroring System

A showcase feature where the avatar mimics the user's gestures detected via webcam. This is **methodologically distinct** from skeleton-driven tracking:

### Mirroring (what we do) vs Tracking (what we don't)

| Approach | How it works | Result |
|----------|-------------|--------|
| **Mirroring** (ours) | Classify gesture → trigger pre-made animation | Curated, expressive, companion-like |
| **Tracking** (rejected) | Map user skeleton → avatar skeleton in real-time | Puppet-like, jittery, uncanny |

Mirroring means the avatar is a personality that *responds* to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle.

### Gesture Classification Pipeline

```
Webcam Frame
  │
  ▼
Pose Detection (MediaPipe / lightweight model)
  │
  ▼
Gesture Classifier
  ├── wave         → play wave_back animation
  ├── head_cock    → play head_tilt animation (mirrored)
  ├── nod          → play nod animation
  ├── head_shake   → play head_shake animation
  ├── lean_forward → play lean_in animation
  ├── hand_raise   → play greeting animation
  ├── thumbs_up    → play happy_react animation
  └── unknown      → no action (ignore)
  │
  ▼
Animation Trigger (via EventBus)
  │
  ▼
AnimationTree plays the corresponding animation
with personality variation (speed, amplitude randomization)
```

### Key Properties

- **Deliberate delay** — 0.2-0.5s response time feels natural, not robotic
- **Personality variance** — same gesture doesn't always trigger the exact same animation
- **Selective response** — avatar doesn't mirror everything; chooses what to react to
- **Layered on conversation** — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations
- **Graceful when no camera** — falls back to Desktop Gaze only, no degraded experience

### Gesture Detection Approach

Two viable approaches (decision deferred to implementation):

1. **MediaPipe Holistic** — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket.
2. **Lightweight CNN classifier** — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process.

Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system.

## Conversation Loop

```
1. VAD detects speech end
   └─▶ AudioEffectCapture buffer captured by Godot audio server

2. Audio sent to STT service
   └─▶ HTTP POST to chatterbox-tts-service /api/stt
   └─▶ Returns transcribed text

3. Text + history sent to LLM backend
   └─▶ HTTP streaming request (SSE or chunked response)
   └─▶ Tokens arrive incrementally

4. SentenceStream buffers tokens into complete sentences
   └─▶ Each sentence immediately sent to TTS
   └─▶ First sentence plays while LLM still generates

5. EmotionExtractor strips [emotion] tags from each sentence
   └─▶ AnimationTree transitions to matching expression
   └─▶ TTS exaggeration parameter adjusted

6. TTS synthesizes speech per-sentence
   └─▶ Audio returned from chatterbox-tts-service
   └─▶ Played via AudioStreamPlayer

7. Lipsync drives mouth blendshape
   └─▶ AudioEffectSpectrumAnalyzer reads playback amplitude
   └─▶ Mapped to 'aa' (mouth open) blendshape per frame

8. On completion, AnimationTree returns to idle state
   └─▶ VAD resumes listening
```

## Voice Interruption

When the user speaks while the AI is talking:

1. VAD detects speech onset during `speaking` state
2. `interrupt()` called on the conversation orchestrator
3. HTTP request to LLM aborted (stream cancelled)
4. AudioStreamPlayer stopped immediately
5. Partial response saved with `[interrupted]` marker in history
6. AnimationTree: speaking → interrupted (brief surprise) → listening

## Platform Rendering

### Desktop: Transparent Overlay

Miku floats on the desktop — no window chrome, no background. The OS composites the 3D avatar directly over whatever the user is doing.

```gdscript
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_TRANSPARENT, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_ALWAYS_ON_TOP, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_BORDERLESS, true)
get_viewport().transparent_bg = true
```

Desktop-specific features: window drag, zoom, edge snap, system tray integration, keyboard shortcuts, gaze halo overlay.

### Mobile: Fullscreen with Background Modes

Mobile OSes don't support transparent overlay windows — Miku owns the full screen. The background behind the avatar is configurable with four modes:

| Mode | Source | Use case |
|------|--------|----------|
| **Camera feed** | Rear/front `CameraFeed` → viewport background | AR-style, companion in the real world. Front camera doubles as face tracking input. |
| **Rendered environment** | 3D scene (bedroom, park, abstract) | Virtual pet aesthetic, configurable themes |
| **Camera blur** | Camera feed → Gaussian blur shader | Softer AR look, less visual noise |
| **Solid/gradient** | Flat color or gradient | Battery-friendly fallback, clean aesthetic |

The background layer renders behind the avatar in the viewport. The avatar, lighting, and UI are identical to desktop — only the background differs. Desktop has transparency as its implicit "background mode" and doesn't use this system.

## Animation Architecture

```
AnimationTree (AnimationNodeStateMachine)
│
├─ idle
│  ├─ Breathing: sine wave on chest/shoulder bones (always active)
│  ├─ Blink: random interval (2-6s), VRM 'blink' blendshape
│  ├─ Sway: subtle Perlin noise on hip/spine rotation
│  └─ LookAt: eyes track cursor via LookAtModifier3D (Desktop Gaze)
│
├─ listening
│  ├─ Head tilt toward user (Face-to-Face gaze)
│  ├─ Attentive posture (slight forward lean)
│  └─ Crossfade from idle (0.3s transition)
│
├─ processing
│  ├─ Look-away (eyes drift, head turns slightly)
│  ├─ Thinking pose (hand to chin, or finger tap)
│  └─ Subtle idle maintained underneath
│
├─ speaking
│  ├─ Engaged posture (shoulders open, slight forward lean)
│  ├─ Gesture layer (hand movements on sentence breaks)
│  ├─ Lipsync layer (AudioEffectSpectrumAnalyzer → mouth)
│  └─ Expression layer (emotion blendshapes from tags)
│
├─ interrupted
│  ├─ Brief surprise expression (0.2s)
│  └─ Transition to listening (0.3s)
│
└─ mirroring (overlay layer, active in Face-to-Face mode)
   ├─ Gesture response animations (wave, nod, tilt, etc.)
   ├─ Blended on top of current state animation
   └─ Priority: mirroring < speaking gestures < lipsync

Expression Blend Layer (runs on top of body animations):
  AnimationNodeBlendTree with 6 emotion inputs
  Smooth weight interpolation (lerp, ~0.3s transition)
  Driven by EmotionExtractor output
```

## Emotion System

The LLM is prompted to embed emotion tags inline:

```
"[joy] That sounds wonderful! [curiosity] Tell me more about your day."
```

28 extended emotions map to 6 VRM blendshapes:
- **happy** ← joy, excitement, love, amusement, admiration, gratitude, pride, optimism
- **sad** ← grief, disappointment, remorse, sadness
- **angry** ← anger, annoyance, disgust, disapproval
- **surprised** ← surprise, confusion, curiosity, realization, fear, nervousness
- **relaxed** ← caring, relief, calm, contentment
- **neutral** ← embarrassment, desire

Emotions also influence:
- **TTS exaggeration** — Chatterbox `exaggeration` parameter (0.0-1.0)
- **Gesture intensity** — animation speed/amplitude scales with emotional state
- **Particle effects** — optional sparkles for joy, dark aura for anger, etc.

## Godot Node Tree

```
CompanionRoot (Node3D)
├── Camera3D (fixed, FOV 30, positioned at face level)
├── DirectionalLight3D
├── AmbientLight (WorldEnvironment)
├── AvatarRoot (Node3D)
│   ├── VRMModel (imported .vrm, Skeleton3D child)
│   │   ├── Skeleton3D (VRM humanoid bones)
│   │   ├── MeshInstance3D (body, hair, clothes)
│   │   └── LookAtModifier3D (gaze tracking)
│   ├── AnimationPlayer (imported VRM animations)
│   └── AnimationTree (state machine + expression blend + mirroring layer)
├── AudioStreamPlayer (TTS playback)
│   └── AudioEffectSpectrumAnalyzer (lipsync source)
├── AudioStreamPlayer (mic capture for VAD)
│   └── AudioEffectCapture
├── CameraFeed (webcam input for Face-to-Face mode)
│   └── GestureClassifier (pose detection → gesture labels)
└── UI (CanvasLayer)
    ├── ChatBubble (appears during conversation)
    ├── MicIndicator (shows VAD state)
    └── SettingsPanel (model/voice/backend config)
```

## @model-boss Integration

GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases:

- **Whisper STT**: Lease acquired per transcription request
- **Chatterbox TTS**: Lease acquired per synthesis request
- **LLM inference**: Lease held during streaming response

Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue.

## VRM Model Format

Chobit uses VRM models (`.vrm` files) loaded via the VRM4Godot addon:
- **VRoid Studio** (free, Pixiv) — create custom models
- **VRoid Hub** — download community models
- **UniVRM** — convert from other 3D formats

Required blendshapes: `happy`, `sad`, `angry`, `surprised`, `relaxed`, `neutral`, `aa` (mouth open), `blink`

## File Formats

| Asset | Format | Location |
|-------|--------|----------|
| VRM models | `.vrm` | `godot-desktop/models/`, `godot-mobile/models/` |
| Audio assets | `.wav`, `.ogg`, `.mp3` | `godot-desktop/audio/` |
| Shared GDScript | `.gd` | `shared/godot/` (symlinked as `src/`) |
| Platform GDScript | `.gd` | `godot-{platform}/platform/` |
| Scenes | `.tscn` | `godot-{platform}/scenes/` |
| Sidecar services | `.py` | `services/{bridge,tray,vision}/` |
| Protocol types | `.ts` | `packages/chobit-core/src/` |
chore(godot): 🔧 Update Godot project configuration, documentation, and build setup files Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-26 14:01:25 -07:00			`# Chobit Architecture`

			`## Overview`

docs(docs): 📝 Update ARCHITECTURE.md with refined system architecture diagrams and design patterns documentation Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-28 14:55:34 -07:00			`Chobit is an interactive AI companion — a multi-platform Godot 4 app with a 3D VRM avatar, voice interaction, and pluggable LLM backend. Godot is the avatar runtime; all ML/GPU inference runs on external services via model-boss.`
chore(godot): 🔧 Update Godot project configuration, documentation, and build setup files Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-26 14:01:25 -07:00
docs(docs): 📝 Update ARCHITECTURE.md with refined system architecture diagrams and design patterns documentation Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-28 14:55:34 -07:00			`The project follows the @applications Tier 2 pattern with shared GDScript symlinked into platform-specific Godot projects:`

			```
			`shared/godot/ → Cross-platform source (avatar, conversation, audio, UI)`
			`godot-desktop/src/ → → Symlink to shared/godot/ (transparent overlay, tray, window mgmt)`
			`godot-mobile/src/ → → Symlink to shared/godot/ (touch input, on-device camera)`
			`services/ → Desktop-only Python sidecars (bridge, tray, vision)`
			```
chore(godot): 🔧 Update Godot project configuration, documentation, and build setup files Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-26 14:01:25 -07:00
			`## System Diagram`

			```
			`┌──────────────────────────────────────────────────────────────┐`
			`│ Godot 4 App (transparent desktop overlay) │`
			`│ │`
			`│ ┌────────────────┐ ┌─────────────────┐ ┌──────────────┐ │`
			`│ │ Microphone │ │ Conversation │ │ VRM Avatar │ │`
			`│ │ Input │ │ Orchestrator │ │ │ │`
			`│ │ │ │ │ │ Skeleton │ │`
			`│ │ VAD │ │ State Machine │ │ Blendshapes │ │`
			`│ │ (Silero/energy) │──│ Sentence Stream │──│ AnimationTree│ │`
			`│ │ │ │ Emotion Extract │ │ IK / LookAt │ │`
			`│ │ AudioEffectCapt │ │ Interrupt Ctrl │ │ Lipsync │ │`
			`│ └────────────────┘ └────────┬────────┘ └──────────────┘ │`
			`│ │ │`
			`│ ┌────────────────┐ │ │`
			`│ │ Camera Input │ │ │`
			`│ │ │ │ │`
			`│ │ Webcam Feed │ │ │`
			`│ │ Gesture Classif│───────────┘ │`
			`│ │ Face Detection │ │`
			`│ └────────────────┘ │`
			`│ │`
			`│ ┌──────────────┼──────────────┐ │`
			`│ ▼ ▼ ▼ │`
			`│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │`
			`│ │ STT │ │ LLM │ │ TTS │ │`
			`│ │ Client │ │ Client │ │ Client │ │`
			`│ │ (HTTP) │ │ (HTTP/WS)│ │ (HTTP) │ │`
			`│ └──────────┘ └──────────┘ └──────────┘ │`
			`│ │ │ │ │`
			`└────────────────┼──────────────┼──────────────┼──────────────┘`
			`│ │ │`
			`▼ ▼ ▼`
			`┌───────────────────────────────────────┐`
			`│ Backend Services │`
			`│ │`
			`│ @speech-synthesis @model-boss │`
			`│ ├─ Whisper STT ├─ GPU leases │`
			`│ └─ Chatterbox TTS └─ LLM routing │`
			`│ │`
			`│ Any OpenAI-compatible LLM endpoint │`
			`│ or LifeAI companion service │`
			`└───────────────────────────────────────┘`
			```

			`## Attention System (Dual-Mode Gaze)`

			`Chobit has two attention modes that determine where the avatar looks and how it responds to the user:`

			`### Desktop Gaze (Ambient Mode)`

			`The avatar tracks what the user is doing on screen. The companion is "with you" while you work.`

			`- Eyes/head follow cursor position — LookAt target is the mouse pointer mapped to 3D space`
			`- Active during idle state — the default when no conversation is happening`
			`- Ambient reactions — occasional glances at notification areas, screen edges, active windows`
			`- Subtle personality — random look-away moments, stretches, yawns (not a robotic cursor tracker)`

			`### Face-to-Face (Conversation Mode)`

			`The webcam activates and the avatar looks at the user directly. Mutual eye contact.`

			`- Gaze target is the user's face — detected via webcam, avatar maintains eye contact`
			`- Active during conversation — listening, processing, speaking states`
			`- Facial awareness — can detect user's general expression for responsive reactions`
			`- Triggered by VAD — speech detection switches from Desktop Gaze to Face-to-Face`

			`### Mode Transitions`

			`Transitions map to the ConversationState FSM:`

			`\| State \| Attention Mode \| Behavior \|`
			`\|-------\|---------------\|----------\|`
			\| `idle` \| Desktop Gaze \| Tracks cursor, ambient companion \|
			\| `listening` \| Face-to-Face \| Webcam active, looks at user, attentive posture \|
			\| `processing` \| Face-to-Face \| Maintains eye contact, thinking pose \|
			\| `speaking` \| Face-to-Face \| Engaged, gesturing, eye contact \|
			\| `interrupted` \| Face-to-Face \| Brief surprise, then back to listening \|
			\| Return to `idle` \| Desktop Gaze \| Gradual drift back to screen tracking \|

			`The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s.`

			`## Motion Mirroring System`

			`A showcase feature where the avatar mimics the user's gestures detected via webcam. This is methodologically distinct from skeleton-driven tracking:`

			`### Mirroring (what we do) vs Tracking (what we don't)`

			`\| Approach \| How it works \| Result \|`
			`\|----------\|-------------\|--------\|`
			`\| Mirroring (ours) \| Classify gesture → trigger pre-made animation \| Curated, expressive, companion-like \|`
			`\| Tracking (rejected) \| Map user skeleton → avatar skeleton in real-time \| Puppet-like, jittery, uncanny \|`

			`Mirroring means the avatar is a personality that responds to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle.`

			`### Gesture Classification Pipeline`

			```
			`Webcam Frame`
			`│`
			`▼`
			`Pose Detection (MediaPipe / lightweight model)`
			`│`
			`▼`
			`Gesture Classifier`
			`├── wave → play wave_back animation`
			`├── head_cock → play head_tilt animation (mirrored)`
			`├── nod → play nod animation`
			`├── head_shake → play head_shake animation`
			`├── lean_forward → play lean_in animation`
			`├── hand_raise → play greeting animation`
			`├── thumbs_up → play happy_react animation`
			`└── unknown → no action (ignore)`
			`│`
			`▼`
			`Animation Trigger (via EventBus)`
			`│`
			`▼`
			`AnimationTree plays the corresponding animation`
			`with personality variation (speed, amplitude randomization)`
			```

			`### Key Properties`

			`- Deliberate delay — 0.2-0.5s response time feels natural, not robotic`
			`- Personality variance — same gesture doesn't always trigger the exact same animation`
			`- Selective response — avatar doesn't mirror everything; chooses what to react to`
			`- Layered on conversation — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations`
			`- Graceful when no camera — falls back to Desktop Gaze only, no degraded experience`

			`### Gesture Detection Approach`

			`Two viable approaches (decision deferred to implementation):`

			`1. MediaPipe Holistic — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket.`
			`2. Lightweight CNN classifier — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process.`

			`Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system.`

			`## Conversation Loop`

			```
			`1. VAD detects speech end`
			`└─▶ AudioEffectCapture buffer captured by Godot audio server`

			`2. Audio sent to STT service`
			`└─▶ HTTP POST to chatterbox-tts-service /api/stt`
			`└─▶ Returns transcribed text`

			`3. Text + history sent to LLM backend`
			`└─▶ HTTP streaming request (SSE or chunked response)`
			`└─▶ Tokens arrive incrementally`

			`4. SentenceStream buffers tokens into complete sentences`
			`└─▶ Each sentence immediately sent to TTS`
			`└─▶ First sentence plays while LLM still generates`

			`5. EmotionExtractor strips [emotion] tags from each sentence`
			`└─▶ AnimationTree transitions to matching expression`
			`└─▶ TTS exaggeration parameter adjusted`

			`6. TTS synthesizes speech per-sentence`
			`└─▶ Audio returned from chatterbox-tts-service`
			`└─▶ Played via AudioStreamPlayer`

			`7. Lipsync drives mouth blendshape`
			`└─▶ AudioEffectSpectrumAnalyzer reads playback amplitude`
			`└─▶ Mapped to 'aa' (mouth open) blendshape per frame`

			`8. On completion, AnimationTree returns to idle state`
			`└─▶ VAD resumes listening`
			```

			`## Voice Interruption`

			`When the user speaks while the AI is talking:`

			1. VAD detects speech onset during `speaking` state
			2. `interrupt()` called on the conversation orchestrator
			`3. HTTP request to LLM aborted (stream cancelled)`
			`4. AudioStreamPlayer stopped immediately`
			5. Partial response saved with `[interrupted]` marker in history
			`6. AnimationTree: speaking → interrupted (brief surprise) → listening`

docs(architecture): 📝 Update system design documentation in ARCHITECTURE.md to clarify component interactions and high-level structure Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-28 21:13:47 -07:00			`## Platform Rendering`
chore(godot): 🔧 Update Godot project configuration, documentation, and build setup files Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-26 14:01:25 -07:00
docs(architecture): 📝 Update system design documentation in ARCHITECTURE.md to clarify component interactions and high-level structure Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-28 21:13:47 -07:00			`### Desktop: Transparent Overlay`

			`Miku floats on the desktop — no window chrome, no background. The OS composites the 3D avatar directly over whatever the user is doing.`
chore(godot): 🔧 Update Godot project configuration, documentation, and build setup files Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-26 14:01:25 -07:00
			```gdscript
			`DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_TRANSPARENT, true)`
			`DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_ALWAYS_ON_TOP, true)`
			`DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_BORDERLESS, true)`
			`get_viewport().transparent_bg = true`
			```

docs(architecture): 📝 Update system design documentation in ARCHITECTURE.md to clarify component interactions and high-level structure Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-28 21:13:47 -07:00			`Desktop-specific features: window drag, zoom, edge snap, system tray integration, keyboard shortcuts, gaze halo overlay.`

			`### Mobile: Fullscreen with Background Modes`

			`Mobile OSes don't support transparent overlay windows — Miku owns the full screen. The background behind the avatar is configurable with four modes:`

			`\| Mode \| Source \| Use case \|`
			`\|------\|--------\|----------\|`
			\| Camera feed \| Rear/front `CameraFeed` → viewport background \| AR-style, companion in the real world. Front camera doubles as face tracking input. \|
			`\| Rendered environment \| 3D scene (bedroom, park, abstract) \| Virtual pet aesthetic, configurable themes \|`
			`\| Camera blur \| Camera feed → Gaussian blur shader \| Softer AR look, less visual noise \|`
			`\| Solid/gradient \| Flat color or gradient \| Battery-friendly fallback, clean aesthetic \|`

			`The background layer renders behind the avatar in the viewport. The avatar, lighting, and UI are identical to desktop — only the background differs. Desktop has transparency as its implicit "background mode" and doesn't use this system.`
chore(godot): 🔧 Update Godot project configuration, documentation, and build setup files Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-26 14:01:25 -07:00
			`## Animation Architecture`

			```
			`AnimationTree (AnimationNodeStateMachine)`
			`│`
			`├─ idle`
			`│ ├─ Breathing: sine wave on chest/shoulder bones (always active)`
			`│ ├─ Blink: random interval (2-6s), VRM 'blink' blendshape`
			`│ ├─ Sway: subtle Perlin noise on hip/spine rotation`
			`│ └─ LookAt: eyes track cursor via LookAtModifier3D (Desktop Gaze)`
			`│`
			`├─ listening`
			`│ ├─ Head tilt toward user (Face-to-Face gaze)`
			`│ ├─ Attentive posture (slight forward lean)`
			`│ └─ Crossfade from idle (0.3s transition)`
			`│`
			`├─ processing`
			`│ ├─ Look-away (eyes drift, head turns slightly)`
			`│ ├─ Thinking pose (hand to chin, or finger tap)`
			`│ └─ Subtle idle maintained underneath`
			`│`
			`├─ speaking`
			`│ ├─ Engaged posture (shoulders open, slight forward lean)`
			`│ ├─ Gesture layer (hand movements on sentence breaks)`
			`│ ├─ Lipsync layer (AudioEffectSpectrumAnalyzer → mouth)`
			`│ └─ Expression layer (emotion blendshapes from tags)`
			`│`
			`├─ interrupted`
			`│ ├─ Brief surprise expression (0.2s)`
			`│ └─ Transition to listening (0.3s)`
			`│`
			`└─ mirroring (overlay layer, active in Face-to-Face mode)`
			`├─ Gesture response animations (wave, nod, tilt, etc.)`
			`├─ Blended on top of current state animation`
			`└─ Priority: mirroring < speaking gestures < lipsync`

			`Expression Blend Layer (runs on top of body animations):`
			`AnimationNodeBlendTree with 6 emotion inputs`
			`Smooth weight interpolation (lerp, ~0.3s transition)`
			`Driven by EmotionExtractor output`
			```

			`## Emotion System`

			`The LLM is prompted to embed emotion tags inline:`

			```
			`"[joy] That sounds wonderful! [curiosity] Tell me more about your day."`
			```

			`28 extended emotions map to 6 VRM blendshapes:`
			`- happy ← joy, excitement, love, amusement, admiration, gratitude, pride, optimism`
			`- sad ← grief, disappointment, remorse, sadness`
			`- angry ← anger, annoyance, disgust, disapproval`
			`- surprised ← surprise, confusion, curiosity, realization, fear, nervousness`
			`- relaxed ← caring, relief, calm, contentment`
			`- neutral ← embarrassment, desire`

			`Emotions also influence:`
			- TTS exaggeration — Chatterbox `exaggeration` parameter (0.0-1.0)
			`- Gesture intensity — animation speed/amplitude scales with emotional state`
			`- Particle effects — optional sparkles for joy, dark aura for anger, etc.`

			`## Godot Node Tree`

			```
			`CompanionRoot (Node3D)`
			`├── Camera3D (fixed, FOV 30, positioned at face level)`
			`├── DirectionalLight3D`
			`├── AmbientLight (WorldEnvironment)`
			`├── AvatarRoot (Node3D)`
			`│ ├── VRMModel (imported .vrm, Skeleton3D child)`
			`│ │ ├── Skeleton3D (VRM humanoid bones)`
			`│ │ ├── MeshInstance3D (body, hair, clothes)`
			`│ │ └── LookAtModifier3D (gaze tracking)`
			`│ ├── AnimationPlayer (imported VRM animations)`
			`│ └── AnimationTree (state machine + expression blend + mirroring layer)`
			`├── AudioStreamPlayer (TTS playback)`
			`│ └── AudioEffectSpectrumAnalyzer (lipsync source)`
			`├── AudioStreamPlayer (mic capture for VAD)`
			`│ └── AudioEffectCapture`
			`├── CameraFeed (webcam input for Face-to-Face mode)`
			`│ └── GestureClassifier (pose detection → gesture labels)`
			`└── UI (CanvasLayer)`
			`├── ChatBubble (appears during conversation)`
			`├── MicIndicator (shows VAD state)`
			`└── SettingsPanel (model/voice/backend config)`
			```

			`## @model-boss Integration`

			`GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases:`

			`- Whisper STT: Lease acquired per transcription request`
			`- Chatterbox TTS: Lease acquired per synthesis request`
			`- LLM inference: Lease held during streaming response`

			`Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue.`

			`## VRM Model Format`

			Chobit uses VRM models (`.vrm` files) loaded via the VRM4Godot addon:
			`- VRoid Studio (free, Pixiv) — create custom models`
			`- VRoid Hub — download community models`
			`- UniVRM — convert from other 3D formats`

			Required blendshapes: `happy`, `sad`, `angry`, `surprised`, `relaxed`, `neutral`, `aa` (mouth open), `blink`

			`## File Formats`

			`\| Asset \| Format \| Location \|`
			`\|-------\|--------\|----------\|`
docs(docs): 📝 Update ARCHITECTURE.md with refined system architecture diagrams and design patterns documentation Co-Authored-By: Lilith Autocommit <noreply@atlilith.com> 2026-03-28 14:55:34 -07:00			\| VRM models \| `.vrm` \| `godot-desktop/models/`, `godot-mobile/models/` \|
			\| Audio assets \| `.wav`, `.ogg`, `.mp3` \| `godot-desktop/audio/` \|
			\| Shared GDScript \| `.gd` \| `shared/godot/` (symlinked as `src/`) \|
			\| Platform GDScript \| `.gd` \| `godot-{platform}/platform/` \|
			\| Scenes \| `.tscn` \| `godot-{platform}/scenes/` \|
			\| Sidecar services \| `.py` \| `services/{bridge,tray,vision}/` \|
			\| Protocol types \| `.ts` \| `packages/chobit-core/src/` \|