apricot-health/README.md
Natalie dafbabee41 feat(@packages/apricot-health): add power-fault monitoring and mitigation tools
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-17 23:18:47 -07:00

66 lines
3.7 KiB
Markdown

# apricot-health
Power-fault diagnostics and mitigation for **apricot** — a Threadripper 2990WX / X399 AORUS XTREME-CF / dual RTX 3090 rig on an open wet-bench, hit by random hard power-offs whose root cause is still being isolated (aging PSU caps, VRM degradation, or both).
## What's in here
| Component | What it does |
|---|---|
| `scripts/apricot-crash-logger` | High-frequency (10 Hz) sensor snapshotter. Captures GPU / CPU / NVMe / motherboard-rail telemetry to `~/apricot-crash.log`, fsync'd every second, so the last fractions of a second before a hard reset survive the crash. |
| `scripts/apricot-rail-watchdog` | Tails the crash-log, learns per-chip baseline for `in5` on each `it8628/hwmonN`, alerts on deviations > `DEVIATION_MV` (default 30 mV). Optionally invokes a mitigation hook. |
| `scripts/apricot-rail-mitigate` | Root-only emergency responder: drops GPU power caps and pins CPU governor to `powersave` for `HOLD_SECONDS` (default 60), then restores. Fired by the watchdog via sudoers. |
| `scripts/apricot-rail-mitigate-trigger` | User-space shim that `sudo`s into `apricot-rail-mitigate` (scoped NOPASSWD). |
| `scripts/apricot-cstate-tune` | Disables deep CPU C-states (C2+) so Vcore stays at a higher baseline, reducing VRM transient-demand magnitude. Oneshot systemd unit at boot. |
| `scripts/apricot-rasdaemon-setup` | Installs + enables `rasdaemon` for detailed AMD MCA/MCE decoding into a sqlite DB. |
| `modprobe.d/it87.conf` | `force_id=0x8628 ignore_resource_conflict=1` — binds IT8628E SuperIO so voltage/fan/temp rails are exposed in `/sys/class/hwmon`. |
| `modules-load.d/it87.conf` | Loads `it87` at boot. |
| `sudoers.d/apricot-health` | NOPASSWD rule for `lilith` to invoke the mitigation entrypoint (scoped to one command). |
| `systemd/*.service` | Three units — one root (`apricot-cstate-tune`), two user (`apricot-crash-monitor`, `apricot-rail-watchdog`). |
## Install
```sh
./install.sh # targets HOST=apricot by default
HOST=other-host ./install.sh # or override
```
Idempotent. Re-run to push updates.
## Tuning
All runtime behavior is env-overridable through systemd drop-ins:
```sh
systemctl --user edit apricot-rail-watchdog
# [Service]
# Environment=DEVIATION_MV=50 BASELINE_SAMPLES=40 RAIL_KEY=in5
```
Key knobs:
- `INTERVAL` (crash-logger) — sample period in seconds; `0.1` = 10 Hz.
- `DEVIATION_MV` (watchdog) — deviation from learned baseline that triggers an alert.
- `MITIGATE_CMD` (watchdog) — path to mitigation hook; empty = alert only.
- `GPU_LIMIT_SAFE` (mitigate) — wattage to clamp GPUs to during mitigation.
- `HOLD_SECONDS` (mitigate) — how long to hold the safe state.
## Outputs
- `~/apricot-crash.log` — per-sample telemetry.
- `~/apricot-rail-alerts.log` — watchdog alerts + baselines.
- `journalctl --user -u apricot-rail-watchdog` — live alerts (WARNING priority).
- `journalctl -u apricot-cstate-tune` — one-shot C-state tune result at boot.
- `/var/lib/rasdaemon/ras-mc_event.db` (after rasdaemon setup) — decoded MCEs.
## Post-mortem flow when a crash happens
1. `ssh apricot` (after it comes back — BIOS "AC Back: Power On" auto-restarts).
2. `grep -n '^=== session start' ~/apricot-crash.log | tail -5` — find the new session boundary.
3. Everything between the previous session's last line and the new session marker is the last ~N seconds before death.
4. `tail ~/apricot-rail-alerts.log` — did the watchdog see rail deviation before the event?
5. `journalctl -b -1 --no-pager | tail -40` — kernel's last words (often normal; hard-off gives no panic).
6. SMART unsafe-shutdown counter: `sudo smartctl -a /dev/nvme0 | grep -i unsafe` — should increment by 1.
## Diagnosis so far
See [`docs/DIAGNOSIS.md`](docs/DIAGNOSIS.md).