net-tools/docs/topology.md
Natalie 68c848dc56 feat(@tools/net-tools): add tray icon system
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-06-10 02:20:23 -07:00

185 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Mesh topology
## Networks
```
┌─────────────────────────────────────────────┐
│ yuzu (vps, quinn-vps) — 1984, Iceland │
│ WireGuard hub wg 10.9.0.1 │
│ public 89.127.233.145:51820 │
└───────────────┬─────────────────────────────┘
│ wg1 (AllowedIPs 10.9.0.0/24, 10.0.0.0/24)
┌───────────────┬───┴───────────────┬──────────────┐
│ │ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ ┌─────┴───────┐
│ apricot │ │ pear (black)│ │ fennel │ │ strawberry │
│ wg 10.9.0.2 │ │ wg 10.9.0.4 │ │ (plum) │ │ (phone- │
│ lan: DHCP, │─│ lan 10.0.0. │ │ wg 10.9.0.3 │ │ quinn) ios │
│ discovered │L│ 11 │ │ macOS, │ │ wg 10.9.0.5 │
│ mesh DNS │A│ LAN DNS │ │ roams │ │ DNS client │
└─────────────┘N└─────────────┘ └─────────────┘ └─────────────┘
apricot + pear share the home LAN (10.0.0.0/24); fennel joins it
when physically home; phones ride the tunnel with DNS=10.9.0.2.
```
- **Mesh `10.9.0.0/24`** — full WireGuard overlay via the Iceland hub. Every host
reaches every other by `<host>.wg` while the tunnel is up.
- **LAN `10.0.0.0/24`** — apricot + black, plus plum when home. The tunnel also
routes this /24, so `10.0.0.x` works off-LAN through the hub (higher latency).
## DNS responsibilities — and how `.wg` actually resolves
Two delivery paths, and they serve different consumers. This distinction is
load-bearing (a config that *renders* a record is not the same as a client that
can *resolve* it):
- **apricot** runs dnsmasq bound to `10.9.0.2:53` (the mesh view). Serves the
host `.wg` + `.lan` records from `mesh-hosts.json`, written by `wg-dns-sync`.
**These records are consumed only by clients whose WireGuard config sets
`DNS=10.9.0.2` — i.e. phones.** The named hosts (apricot/pear/fennel) do *not*
point their resolver at `10.9.0.2`, so for them dnsmasq does not answer.
- **For the named hosts, names are delivered by the managed `/etc/hosts` block**
from `mesh-hosts-render --install` (bare + `.lan` at *current* IPs, `.wg`,
service vhosts). Every node's agent regenerates it automatically on drift.
- **fennel** roams off-LAN where dnsmasq is unreachable, so the managed
`/etc/hosts` block is its only resolution path then.
The old `*.local` platform scheme is **retired** (platform → `.com`, infra →
`.lan`); net-tools renders no `.local`.
## Reachability matrix
| from ↓ \ to → | apricot | pear | fennel | yuzu |
|---|---|---|---|---|
| **apricot** | — | `.lan` ✦ · `.wg` | `fennel.wg` only | `.wg` |
| **pear** | `.lan` ✦ · `.wg` | — | `fennel.wg` only | `.wg` |
| **fennel** | `.lan` ✦ · `.wg` ⚑ | `.lan` ✦ · `.wg` ⚑ | — | `.wg` only |
| **yuzu** | `.wg` only | `.wg` only | `fennel.wg` only | — |
✦ preferred when co-located on the home LAN · ⚑ fennel falls back to `.wg` when
it roams · **fennel and yuzu are only ever reachable inbound via `.wg`** (fennel
has no stable LAN IP; yuzu has no LAN leg) · strawberry is reachable at
`strawberry.wg` (10.9.0.5) when its tunnel is up, but runs no services.
`.wg` in this matrix resolves via each node's managed `/etc/hosts` block, which
every agent maintains — the dnsmasq `.wg` records are the phones-only path
(see DNS responsibilities above).
## Hub IP note
plum's live `wg1.conf` endpoint is `89.127.233.145:51820`. An older
`magic-civilization/scripts/lan/README.md` also lists `93.95.231.174` for the
Iceland hub — treat that as stale/secondary unless confirmed against the hub's
own WireGuard config. `mesh-hosts.json` records only the live `.145`.
## The fleet agent
`smart-lan-router/smart-lan-router.py` runs as a root service on **every node**
(launchd on darwin, systemd on linux — `install-agent.sh` picks). One codebase;
each node derives its roles from its own `mesh-hosts.json` entry:
| Role | Who | What |
|------|-----|------|
| pull | all | `git pull` as the repo owner (never root); exit-and-restart when its own code changes — pushing to the forge updates the fleet |
| hostname | all (`fleet.enforce_hostname`) | converge the OS hostname to the canonical name — the fleet renames hosts, humans don't |
| discover | LAN nodes | declared MAC → current DHCP IP via ARP/`ip neigh``data/lan-state.json` (each LAN node discovers independently) |
| route | laptop, darwin | the home/away subnet switch below |
| render | all | regenerate `/etc/hosts` + ssh config on any change, at this node's vantage (mesh-only nodes resolve everything via `.wg` IPs) |
The original laptop problem the route role solves:
**The problem it solves:** the wg config's `AllowedIPs` includes `10.0.0.0/24`, so
the tunnel installs a route capturing the *entire* home LAN. While home, traffic
to home hosts hairpins through the Iceland hub (~350ms) instead of going out the
LAN interface (~5ms). (Measured: apricot 351ms via tunnel → 5.6ms via en0.)
**What it does, each cycle:**
1. **Detect location** — read the default route's gateway + interface. It's HOME
iff the gateway is `lan.gateway` *and* its ARP MAC == `lan.gateway_mac` (the
home gateway's fingerprint). The MAC check is what distinguishes the real home
LAN from a visited café network that also happens to use `10.0.0.0/24`.
2. **Switch the subnet route** — HOME → `route 10.0.0.0/24` via the LAN interface
(direct); AWAY → via the wg mesh interface (so home stays reachable through the
tunnel). Re-asserted every cycle, because `wg-quick` re-adds the tunnel `/24`
on reconnect.
3. **Name-sync (discover role)** — keep ssh + hosts in sync with reality. Each
LAN host's **MAC is stable while its DHCP IP drifts**, and the neighbour
table (ARP / `ip neigh`) maps MAC↔IP. The agent reads it (rate-limited
ping-sweep of the `/24` when a host is missing), resolves every `hosts[]`
entry with a `mac` to its *current* IP, and on any change writes
`data/lan-state.json` ({name: ip}, gitignored — volatile, per-device) and
regenerates both views: `mesh-hosts-render --install` (`/etc/hosts`) and, as
the node's render user (its `ssh_user`), `host-apply --ssh-apply`
(`~/.ssh/config`). Proven live: when apricot rebooted from `.116` to `.118`,
`ssh apricot` and `quinn.apricot.lan` followed automatically — no DHCP
reservations, no hand-edits.
**Why a subnet route, not per-host `/32` pins** (the old design): a `/32
-interface` route on macOS creates a *self-MAC* ARP entry that blackholes the
host. A subnet route uses normal ARP, so every home host — at whatever DHCP
address it currently holds — just works. This is **drift-immune** (apricot moving
`.116→.118` needs no config change) and free of the self-MAC bug. `--status`
prints location + current route.
It re-reads `mesh-hosts.json` each cycle; a bad read keeps last-good and never
tears down routing (`KeepAlive` root daemon over an autocommit-written repo).
**Supersedes** both the old per-host identity-probe pinner *and* the
`wg-route-watchdog` system daemon (which unconditionally forced `10.0.0.0/24`
through the tunnel — the home branch is the new, smarter behavior; the away
branch preserves the watchdog's original purpose). The watchdog was retired
(`/Library/LaunchDaemons/com.natalie.wg-route-watchdog.plist` +
`/usr/local/sbin/wg-route-watchdog.sh` removed).
## Fleet rename
Names follow **fruit family = machine class** (apricot=GPU stone fruit,
pear=CPU/storage pome, yuzu=cloud citrus, fennel=laptop vegetable,
strawberry=phone berry), executed **alias-first**: the fruit name is canonical,
the old name lives in `aliases[]` forever, and every renderer emits both —
`pear.wg`+`black.wg`, `forge.pear.lan`+`forge.black.lan`, `ssh black` keeps
working. Old names are **never retired**; nothing that says "black" ever breaks.
**OS hostnames converge automatically**: with `fleet.enforce_hostname: true`,
each node's agent renames its own OS (`scutil` ×3 / `hostnamectl`) to the
canonical name on its next cycle — this is how the relic FQDNs
(`plum.voyager.nasty.sh`, `0.vps.1984.uvlava.com`) die. Never run the rename by
hand. String-identity consumers stay untouched on purpose: the Forgejo runner
label stays `black` (workflows reference it), the forge URL and NFS exports keep
their old names as permanent aliases.
## Migration
This repo replaces tooling scattered across four places:
| Was | Now | Status |
|-----|-----|--------|
| `session-tools/data/wg-mesh-hosts.json` | `data/mesh-hosts.json` (expanded: `.wg` view, hosts[], mac, identity, fruit names) | ✅ here |
| `session-tools/bin/wg-dns-sync` | `bin/wg-dns-sync` (robust symlink path resolution) | ✅ here + fixed |
| `magic-civilization/scripts/lan/subscribe-black-dns.sh` | — (retired: `*.local` scheme is dead) | ✅ removed |
| `setup-lan-dns.sh` (not in ~/Code — drifted) | `bin/mesh-hosts-render` | ✅ replaced |
| `bin/host-apply` (per-device ssh view) | new | ✅ here |
| `~/bin/smart-lan-router.py` (loose) | `smart-lan-router/smart-lan-router.py` (JSON-driven, self-heal) | ✅ here + fixed |
| `~/{install-agent.sh,com.lilith…plist}` (loose) | `smart-lan-router/` | ✅ here |
**Done (2026-06-09):** agent installed + verified on all four nodes (launchd on
fennel; systemd on pear/apricot/yuzu); all three remote nodes are real git
clones of `origin/main` (repo public on the LAN-only forge for credential-less
pulls); `mesh-hosts-render --install` + `host-apply --ssh-apply` live on all
four; fennel's hostname converged; the old `wg-route-watchdog`, `setup-lan-dns`
block, `/etc/resolver/*.lan` files, loose `~/bin/smart-lan-router.py`, and the
stale self-MAC ARP entry are all retired.
**Still pending:**
1. **apricot mesh-DNS cutover** — run `sudo bin/wg-dns-sync` on apricot from
this repo (serves phones the `.wg`/`.lan` names); verify
`dig @10.9.0.2 apricot.wg`. Then update the two session-tools consumers that
call the old absolute path (`bin/apricot-doctor`, `bin/quinn-phone-bootstrap`)
and delete the originals from `session-tools/{data,bin}`.
2. **pear/yuzu hostname convergence** — automatic on the next pull cycle after
the `fleet.enforce_hostname` commit lands on the forge (the agents do it;
watch for `hostname converged: black → pear` in the journal).
3. **yuzu → home ssh auth** — yuzu reaches pear/apricot by name but its key is
not authorized there. Deliberate: internet-facing node, least-privilege.
Grant only if actually needed.