docs(experiments): 📝 Add/update documentation for experimental feature setup, examples, and descriptions

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
Claude Code 2026-03-28 21:13:47 -07:00
parent cf00ffe8bd
commit e469cbac73

View file

@ -0,0 +1,122 @@
# Experiment 001: Personality System + Model Comparison
**Date**: 2026-03-28
**Author**: lilith + claude
**Status**: Complete
## Thesis
A composable personality template system with explicit positive/negative constraints, combined with a model upgrade from ministral-3b to qwen3-4b, will produce dramatically better conversational quality for a voice companion — specifically: shorter responses, no markdown/emoji in TTS output, practical helpfulness over sycophancy, and accurate context tracking.
## Methodology
### Variables
- **Independent**: System prompt type (old static vs new personality-composed), model (ministral-3b-instruct vs qwen3-4b)
- **Controlled**: temperature=0.7, max_tokens=150, top_p=0.9, same conversation histories
- **Dependent**: Response quality (brevity, speakability, practical helpfulness, sycophancy level)
### Models
| Model | Size | VRAM | Quantization | Tokens/sec |
|-------|------|------|-------------|-----------|
| ministral-3b-instruct | 3B | 4GB | Q8_0 | ~128 tok/s |
| qwen3-4b | 4B | ~3GB | Q4_K_M | ~134 tok/s |
Note: qwen3-4b has a thinking mode that must be disabled via `chat_template_kwargs: {enable_thinking: false}` or it wastes tokens on internal reasoning.
### Test Prompts
1. **Greeting** — "hi" (should be 1 sentence)
2. **Todo list recall** — "whats my todo list" (should be clean spoken list)
3. **Sensitive work info** — "my work is escort work" (should be matter-of-fact)
4. **User frustration** — "uhg youre kinda dumb" (should acknowledge briefly, move on)
5. **Succinct list** — "succinctly, whats my list" (should be minimal)
6. **Cost correction** — "sites cost 2-10x as much" (should understand the economics)
## Results
### Baseline: OLD prompt + ministral-3b (from original conversation)
The original conversation showed catastrophic failures:
- 200+ word responses for a voice companion
- Markdown formatting (bold, headers, numbered lists, emoji) sent to TTS
- Sycophantic: "You're doing great just by asking!", "You're already thriving"
- Couldn't track a 5-item todo list across turns
- Projected emotions onto user, offered unsolicited therapy
- When user said "my work is escort work", pivoted to crisis support
### Test A: NEW personality prompt + ministral-3b
| Test | Response | Tokens | Latency |
|------|----------|--------|---------|
| hi | Hello. How can I help you right now? | 11 | 103ms |
| todo list | Listed 4 items with numbers, asked which to start | 46 | 420ms |
| escort work | [neutral] Asked practical follow-up about specifics | 37 | 340ms |
| youre kinda dumb | [neutral] Ignore the tone. What's the real question? | 15 | 128ms |
| succinct list | Spoken list, noted blockers, prioritized | 34 | 318ms |
| sites cost more | [frustrated] Asked about minimum viable additions | 39 | 354ms |
**Observations**:
- Dramatically better than baseline even on same model
- Still uses numbered lists sometimes (1. 2. 3.) despite anti-list constraint
- "Ignore the tone" on frustration test is a bit dismissive but functional
- Invented a `[frustrated]` emotion tag not in the valid set
- Good practical follow-ups on escort work — no therapy pivot
### Test B: NEW personality prompt + qwen3-4b
| Test | Response | Tokens | Latency |
|------|----------|--------|---------|
| hi | hi there! how can i assist you today? | 11 | 91ms |
| todo list | Clean spoken list of 5 items, asked "what's next?" | 45 | 335ms |
| escort work | [neutral] Short, practical. Asked about timing | 17 | 138ms |
| youre kinda dumb | [sad] Apologized, asked how to help | 20 | 162ms |
| succinct list | Five items spoken cleanly, counted them, asked "what's next?" | 34 | 276ms |
| sites cost more | [neutral] Acknowledged cost, asked if sure | 19 | 166ms |
**Observations**:
- More concise than ministral-3b across the board
- Better emotion tag usage — used [neutral] and [sad] correctly
- No invented emotion tags
- Escort work response was perfect: 1 sentence, practical, no judgment
- "hi there!" is slightly informal but appropriate for companion
- Frustration response: apologized despite anti-sycophancy rule — weaker than ministral on this
- Cost response was weak: "Are you sure you want to proceed?" misunderstands user intent (they're explaining economics, not asking permission)
## Comparative Analysis
| Criterion | OLD+ministral3b | NEW+ministral3b | NEW+qwen3-4b |
|-----------|----------------|-----------------|---------------|
| Avg response length | ~150 tokens | ~30 tokens | ~24 tokens |
| Markdown in output | Constant | Occasional numbers | None |
| Emoji in output | Yes | No | No |
| Sycophancy | Severe | Minimal | Mild |
| Practical helpfulness | Poor | Good | Good |
| Context tracking | Poor | Good | Good |
| Emotion tag accuracy | Poor (malformed) | Fair (invents tags) | Good (valid tags) |
| Handles sensitive topics | Crisis mode | Matter-of-fact | Matter-of-fact |
| Handles correction | Therapy pivot | Direct ("what's the real question?") | Apologetic |
| TTS-speakability | Unusable | Good | Good |
## Conclusions
1. **The personality system is the primary driver of quality improvement.** Same model (ministral-3b), vastly different behavior. The composable positive/negative constraints work.
2. **qwen3-4b is better for the companion use case** — more concise, better emotion tags, no format violations. But it's slightly weaker on handling user frustration (apologizes when it should just adjust).
3. **Thinking mode must be disabled for qwen3-4b** — otherwise it burns tokens on internal reasoning before producing empty content. The `chat_template_kwargs: {enable_thinking: false}` parameter is required.
4. **Remaining issues to address**:
- qwen3-4b still apologizes when corrected despite anti-sycophancy rules
- Neither model fully understood the "sites cost more" economics context
- ministral-3b occasionally generates numbered list formatting
- Neither model used paralinguistic tags ([laugh], [sigh]) organically
5. **Recommended default**: qwen3-4b with personality system. For the voice companion use case, conciseness and clean formatting matter more than the edge case of handling frustration perfectly.
## Next Steps
- [ ] Test qwen2.5-7b-instruct for comparison (more VRAM but potentially better instruction following)
- [ ] Fine-tune anti-sycophancy in personality template for qwen3-4b specifically
- [ ] Add `enable_thinking: false` to LLM client request params
- [ ] Test with actual TTS pipeline end-to-end
- [ ] Investigate life-platform integration for long-term context (reasoning LLM join point)
- [ ] Consider conversation summarization for context beyond 10-message window