# apricot hard-off diagnosis Running log of the investigation. Newest findings at top. ## Platform - Gigabyte X399 AORUS XTREME-CF, 8 years old, open-frame wet-bench (no mineral oil; "wet" refers to open-air test bench). - AMD Threadripper 2990WX (32-core, 250 W TDP). - 2× NVIDIA RTX 3090 (stock 370 W cap each). - 2× NVMe + 3× SATA. - 2× Corsair PSUs: - **HX1500i** — was producing audible coil-whine before the split; now carries only drives + Molex. - **HX1200** — now carries mobo + CPU + both GPUs. - Fedora Bluefin (ostree), kernel 6.17.12-200.fc42. - Non-ECC memory (`amd64_edac` cannot bind). ## Failure signature (consistent across all events) 1. Journal cuts abruptly mid-operation. No `Reached target Shutdown`, no `systemd-shutdown`, no kernel panic. 2. Next boot runs `XFS (dm-0): Starting recovery` — filesystem wasn't unmounted cleanly. 3. NVMe SMART `Unsafe Shutdowns` increments by 1 on each event. Current ratio ~66 % of all power cycles are unclean. 4. BIOS "AC Back: Power On" (inferred from behavior) auto-restarts the box after each event; earlier events where the box stayed dark likely latched PSU OCP/UVP protection. 5. No MCE / thermal-throttle / OOM / hung-task entries. → The kernel never runs a shutdown — the 12 V plane collapses from under it. Classic PSU OCP/UVP or VRM brownout. ## Timeline of captured crashes | Timestamp (PDT) | GPU 0 | GPU 1 | CPU Tctl | Load profile | |---|---|---|---|---| | 2026-04-16 15:58:06 | 158 W | **368 W** (pegged) | — | Sustained high — GPU 1 inference under load | | 2026-04-17 03:22:54 | 117 W | 25 W (idle) | 70 °C | **Near-idle** — background auto-commit + tor-manager only | | 2026-04-17 11:15:42 | 20 W | **368 W** (pegged) | 72 °C | High GPU 1 load | | 2026-04-17 21:35:10 | 117 W | 129 W | 69 °C | Moderate, both GPUs in P2 | Crashes span idle-to-sustained-peak — no consistent load correlation. ## Rail observations (it8628 SuperIO, after binding via `it87 force_id=0x8628`) Stable rails during normal operation: - `in5` on chip 1 (hwmon3/hwmon8 depending on boot order): **852 mV steady** → likely +12 V scaled ~14:1 → ~11.9 V actual. - `in5` on chip 2: **1632 mV steady** → likely +5 V scaled ~3:1 → ~4.9 V actual. **Key observation 2026-04-17**: Between crashes, `in5` on chip 1 collapsed from **852 mV → 408 mV** twice (18:50:43-45, 19:20:50-52), recovering within one sample. Roughly a **50 % rail drop** — probably a ~12 V → ~5.7 V momentary sag. System survived both. Demonstrates the supply is visibly failing at slow timescales, not only at the microsecond scale that causes a hard-off. ## What has been ruled out - **Thermal**: all CPU/GPU/NVMe temps well below throttle thresholds at every crash. - **OOM / hung task**: journal shows none. - **MCE**: `edac_mce_amd` loaded, no events logged. - **Graceful shutdown path**: no systemd shutdown-target progression. - **nvidia-oc daemon**: fixed independently — was thrashing sqlite locks; not related to crashes. - **HX1500i as sole cause**: crashes continued after moving all load off it onto HX1200. ## What's consistent with observations - **Aging filter caps on PSU and/or motherboard VRM**. Both the squealing HX1500i *and* the HX1200 have produced visible rail excursions. Board is 8 years old. - **Load-independent failure**: crashes happen at both idle and peak load, but the in5 rail drops caught by the watchdog indicate intermittent supply failure decoupled from workload. ## What remains to rule out (physical) - Visual inspection of VRM caps on the board (open bench, trivial). - Multimeter back-probe of 12 V at the 24-pin during load, to watch for sag below 11.4 V. - Swap to a third known-good PSU for a day. - Reseat EPS12V / 24-pin connectors (oxidation on 8-year-old pins is plausible). ## Software stack currently deployed - **10 Hz telemetry logger** (`apricot-crash-monitor.service`) — writes ~/apricot-crash.log, fsync per second. - **Rail watchdog** (`apricot-rail-watchdog.service`) — baseline-learning on `in5`, 30 mV deviation threshold, invokes mitigation on trigger. - **Emergency mitigation** (`apricot-rail-mitigate`) — drops GPU cap to 250 W, pins CPU governor to powersave, holds 60 s, restores. - **C-state tune** (`apricot-cstate-tune.service`) — disables C2+ at boot to reduce VRM transient demand. - **IT8628E binding** (`/etc/modprobe.d/it87.conf` + `/etc/modules-load.d/it87.conf`) — SuperIO sensors auto-load with correct `force_id`. - **rasdaemon** — optional, via `apricot-rasdaemon-setup`. ## Non-software fixes kept separate from this package - nvidia-oc WAL-mode patch (upstreamed via ACS to `origin/master` of the nvidia-oc repo, commit `bea1934`).