net-tools/docs/topology.md
Natalie 03e47fc4df feat(@tools/net-tools): add mesh/lan tooling with host renderers
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-06-09 19:53:08 -07:00

9.7 KiB

Mesh topology

Networks

              ┌─────────────────────────────────────────────┐
              │  vps (quinn-vps) — 1984 Hosting, Iceland     │
              │  WireGuard hub   wg 10.9.0.1                 │
              │  public 89.127.233.145:51820                 │
              └───────────────┬─────────────────────────────┘
                              │  wg1 tunnel (AllowedIPs 10.9.0.0/24, 10.0.0.0/24)
          ┌───────────────────┼───────────────────┐
          │                   │                   │
   ┌──────┴──────┐     ┌──────┴──────┐     ┌──────┴──────┐
   │ apricot     │     │ black       │     │ plum        │
   │ wg 10.9.0.2 │     │ wg 10.9.0.4 │     │ wg 10.9.0.3 │
   │ lan 10.0.0. │─────│ lan 10.0.0. │     │ macOS,      │
   │     116     │ LAN │     11      │     │ roams       │
   │ mesh DNS    │     │ LAN DNS     │     │ (DHCP)      │
   └─────────────┘     └─────────────┘     └─────────────┘
        apricot + black share the home LAN (10.0.0.0/24);
        plum joins it only when physically home, else routes via the hub.
  • Mesh 10.9.0.0/24 — full WireGuard overlay via the Iceland hub. Every host reaches every other by <host>.wg while the tunnel is up.
  • LAN 10.0.0.0/24 — apricot + black, plus plum when home. The tunnel also routes this /24, so 10.0.0.x works off-LAN through the hub (higher latency).

DNS responsibilities — and how .wg actually resolves

Two delivery paths, and they serve different consumers. This distinction is load-bearing (a config that renders a record is not the same as a client that can resolve it):

  • apricot runs dnsmasq bound to 10.9.0.2:53 (the mesh view). Serves the host .wg + .lan records from mesh-hosts.json, written by wg-dns-sync. These records are consumed only by clients whose WireGuard config sets DNS=10.9.0.2 — i.e. phones. The named hosts (apricot/pear/fennel) do not point their resolver at 10.9.0.2, so for them dnsmasq does not answer.
  • For the named hosts, .wg/.lan is delivered by the static /etc/hosts block from mesh-hosts-render --install. Run it on every host that must resolve a peer's name. (Verified: before any install, dscacheutil -q host -a name apricot.wg on fennel returns nothing.)
  • fennel roams off-LAN where dnsmasq is unreachable, so the static /etc/hosts block is its only resolution path then.

The old *.local platform scheme is retired (platform → .com, infra → .lan); net-tools renders no .local.

Reachability matrix

from ↓ \ to → apricot pear fennel yuzu
apricot .lan ✦ · .wg fennel.wg only .wg
pear .lan ✦ · .wg fennel.wg only .wg
fennel .lan ✦ · .wg .lan ✦ · .wg .wg only
yuzu .wg only .wg only fennel.wg only

✦ preferred when co-located on the home LAN · ⚑ plum falls back to .wg when it roams · plum and vps are only ever reachable inbound via .wg (plum has no stable LAN IP; vps has no LAN leg).

.wg in this matrix is resolved via each host's static /etc/hosts block (mesh-hosts-render --install), not via dnsmasq — see DNS responsibilities above. The dnsmasq .wg records are the phones-only path. So the matrix holds only once the static block is installed on apricot, black, and plum.

Hub IP note

plum's live wg1.conf endpoint is 89.127.233.145:51820. An older magic-civilization/scripts/lan/README.md also lists 93.95.231.174 for the Iceland hub — treat that as stale/secondary unless confirmed against the hub's own WireGuard config. mesh-hosts.json records only the live .145.

Smart routing daemon (fennel)

smart-lan-router/smart-lan-router.py runs as a root LaunchDaemon on the laptop.

The problem it solves: the wg config's AllowedIPs includes 10.0.0.0/24, so the tunnel installs a route capturing the entire home LAN. While home, traffic to home hosts hairpins through the Iceland hub (~350ms) instead of going out the LAN interface (~5ms). (Measured: apricot 351ms via tunnel → 5.6ms via en0.)

What it does, each cycle:

  1. Detect location — read the default route's gateway + interface. It's HOME iff the gateway is lan.gateway and its ARP MAC == lan.gateway_mac (the home gateway's fingerprint). The MAC check is what distinguishes the real home LAN from a visited café network that also happens to use 10.0.0.0/24.
  2. Switch the subnet route — HOME → route 10.0.0.0/24 via the LAN interface (direct); AWAY → via the wg mesh interface (so home stays reachable through the tunnel). Re-asserted every cycle, because wg-quick re-adds the tunnel /24 on reconnect.

Why a subnet route, not per-host /32 pins (the old design): a /32 -interface route on macOS creates a self-MAC ARP entry that blackholes the host. A subnet route uses normal ARP, so every home host — at whatever DHCP address it currently holds — just works. This is drift-immune (apricot moving .116→.118 needs no config change) and free of the self-MAC bug. --status prints location + current route.

It re-reads mesh-hosts.json each cycle; a bad read keeps last-good and never tears down routing (KeepAlive root daemon over an autocommit-written repo).

Supersedes both the old per-host identity-probe pinner and the wg-route-watchdog system daemon (which unconditionally forced 10.0.0.0/24 through the tunnel — the home branch is the new, smarter behavior; the away branch preserves the watchdog's original purpose). The watchdog was retired (/Library/LaunchDaemons/com.natalie.wg-route-watchdog.plist + /usr/local/sbin/wg-route-watchdog.sh removed).

Fleet rename

Names follow fruit family = machine class (apricot=GPU, pear=CPU/storage, yuzu=cloud, fennel=laptop), executed alias-first: mesh-hosts.json sets the fruit name canonical with the old name in aliases[], and every renderer emits both (pear.wg+black.wg, forge.pear.lan+forge.black.lan). Nothing breaks on day one. Irreversible cutovers are separately gated: OS hostname (hostnamectl/scutil — also fixes plum's stale plum.voyager.nasty.sh), the Forgejo URL, black's NFS export host, ssh stanzas, and the reference sweep (memory, CLAUDE.md, MCP ssh-by-name). Never retire an old name until every consumer resolves the new one. apricot is unchanged.

Migration

This repo replaces tooling scattered across four places:

Was Now Status
session-tools/data/wg-mesh-hosts.json data/mesh-hosts.json (expanded: .wg view, hosts[], mac, identity, fruit names) here
session-tools/bin/wg-dns-sync bin/wg-dns-sync (robust symlink path resolution) here + fixed
magic-civilization/scripts/lan/subscribe-black-dns.sh — (retired: *.local scheme is dead) removed
setup-lan-dns.sh (not in ~/Code — drifted) bin/mesh-hosts-render replaced
bin/host-apply (per-device ssh view) new here
~/bin/smart-lan-router.py (loose) smart-lan-router/smart-lan-router.py (JSON-driven, self-heal) here + fixed
~/{install-smart-router.sh,com.lilith…plist} (loose) smart-lan-router/ here

Pending — gated on greenlight (these touch live DNS on apricot):

  1. Re-clone/pull this repo on apricot and run ./install.sh.
  2. Run sudo wg-dns-sync on apricot from this repo; verify dnsmasq still serves (dig @10.9.0.2 quinn.apricot.lan, dig @10.9.0.2 apricot.wg).
  3. Update the two session-tools consumers that call the old path by absolute reference — bin/apricot-doctor ("$repo/bin/wg-dns-sync") and bin/quinn-phone-bootstrap (ssh apricot 'cd …/session-tools && sudo bin/wg-dns-sync') — to the new repo path.
  4. Run sudo mesh-hosts-render --install on apricot, black, and plum (every host that must resolve a peer's .wg name — dnsmasq only answers .wg for phones with DNS=10.9.0.2). Then on plum retire the old setup-lan-dns.sh static block and /etc/resolver/{apricot,black}.lan.
  5. fennel (laptop): sudo smart-lan-router/install-smart-router.sh reinstalls the LaunchDaemon pointed at the repo path and retires the loose ~/bin/smart-lan-router.py, ~/install-smart-router.sh, ~/com.lilith.smart-lan-router.plist. Verify route -n get 10.0.0.11interface: en0 (not utun*).
  6. Only after apricot is verified on the new path: remove the originals from session-tools and magic-civilization/scripts/lan, and push.
  7. Fleet rename cutovers (each independently, after the above): ssh stanzas → OS hostname → Forgejo forge.pear.lan vhost → NFS export → reference sweep. See Fleet rename.

Do not delete the originals in the same change that adds this repo — every host still running the old path needs to re-install first.

Blocked right now: the laptop's LAN to pear/apricot is degraded by the very stale self-MAC ARP entry the daemon now self-heals (10.0.0.11 → fennel's own MAC, permanent). Clear it (sudo arp -d 10.0.0.11) and reinstall the daemon (step 5) to restore the LAN fast-path before attempting any remote cutover.