- Remove old imajin/ directory (migrated to services/ + orchestrators/) - Delete completion markers (DONE.md, INTEGRATION-COMPLETE.md, TESTING.md) - Remove standalone test generation scripts - Update docs to reflect current architecture - Add multi-base-strategy.md documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.9 KiB
3.9 KiB
GPU Coordination
Managing GPU resources across multiple services with GPUBoss and Redis.
Overview
The @imajin platform runs multiple GPU-intensive services:
- imajin-prompt: LLM inference (DeepSeek R1 70B)
- imajin-diffusion: Diffusion model inference
GPUBoss coordinates VRAM allocation to prevent OOM errors.
Architecture
sequenceDiagram
participant Service as Service
participant Boss as GPUBoss
participant Redis as Redis
participant GPU as GPU VRAM
Service->>Boss: Request VRAM lease (8GB)
Boss->>Redis: Check available VRAM
Redis-->>Boss: 16GB available on cuda:0
Boss->>Redis: Register lease (8GB, cuda:0)
Boss-->>Service: Lease granted (cuda:0)
Service->>GPU: Load model
Note over Service,GPU: Model inference
Service->>Boss: Release lease
Boss->>Redis: Clear lease
Boss-->>Service: Lease released
Configuration
Redis Setup
# Docker
docker run -d -p 6379:6379 --name redis redis
# System service
sudo systemctl start redis
Service Configuration
# config.yaml
gpu:
enabled: true
redis_url: redis://localhost:6379
priority: "normal" # low, normal, high
Priority Levels
| Priority | Use Case |
|---|---|
low |
Background tasks, batch processing |
normal |
Standard requests |
high |
User-facing, latency-sensitive |
Higher priority services get VRAM leases first when contention exists.
Device Assignment
Multi-GPU Setup
Assign different models to different GPUs:
# imajin-diffusion
export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0
export IMAGE_GEN_ANIME_DEVICE=cuda:1
This allows parallel generation with both models.
Single-GPU Setup
All services share one GPU, coordinated by GPUBoss:
export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0
export IMAGE_GEN_ANIME_DEVICE=cuda:0
GPUBoss ensures only one model is loaded at a time.
VRAM Requirements
| Model | Approximate VRAM |
|---|---|
| DeepSeek R1 70B (Q4) | 40GB |
| DeepSeek R1 70B (Q8) | 70GB |
| Diffusion (photorealistic) | 8GB |
| Diffusion (anime) | 8GB |
| Cultural classifier | 4GB |
Lease Lifecycle
1. Request Lease
async with gpu_boss.lease(vram_gb=8, priority="normal") as device:
# device = "cuda:0"
model = load_model(device)
result = model.generate(...)
2. Automatic Release
Leases are automatically released when:
- Context manager exits
- Service shuts down
- Timeout expires (configurable)
3. Manual Release
lease_id = await gpu_boss.acquire(vram_gb=8)
try:
# ... use GPU
finally:
await gpu_boss.release(lease_id)
Monitoring
Check GPU Status
nvidia-smi
Check Redis Leases
redis-cli keys "gpuboss:*"
redis-cli hgetall "gpuboss:leases"
Service Health
curl http://localhost:8003/health
# { "gpu_available": true, "vram_total": 24576, "vram_free": 16384 }
Troubleshooting
OOM Despite Coordination
- Check for leaked leases:
redis-cli keys "gpuboss:*" - Verify VRAM estimates match actual usage
- Reduce model quantization or batch size
Slow Lease Acquisition
- Check Redis latency:
redis-cli --latency - Verify priority settings
- Check for long-running leases blocking queue
Service Can't Get GPU
# Check what's holding leases
redis-cli hgetall "gpuboss:leases"
# Force release stale leases (use with caution)
redis-cli del "gpuboss:leases"
Best Practices
- Request minimum needed VRAM - Don't over-request
- Use appropriate priority - Reserve "high" for user-facing requests
- Handle lease failures gracefully - Return 503 if GPU unavailable
- Set reasonable timeouts - Prevent indefinite waits
- Monitor VRAM usage - Track actual vs. requested
Related
- Configuration - Redis URL configuration
- Service Topology - Service dependencies