185 lines
11 KiB
Markdown
185 lines
11 KiB
Markdown
# Mesh topology
|
||
|
||
## Networks
|
||
|
||
```
|
||
┌─────────────────────────────────────────────┐
|
||
│ yuzu (vps, quinn-vps) — 1984, Iceland │
|
||
│ WireGuard hub wg 10.9.0.1 │
|
||
│ public 89.127.233.145:51820 │
|
||
└───────────────┬─────────────────────────────┘
|
||
│ wg1 (AllowedIPs 10.9.0.0/24, 10.0.0.0/24)
|
||
┌───────────────┬───┴───────────────┬──────────────┐
|
||
│ │ │ │
|
||
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ ┌─────┴───────┐
|
||
│ apricot │ │ pear (black)│ │ fennel │ │ strawberry │
|
||
│ wg 10.9.0.2 │ │ wg 10.9.0.4 │ │ (plum) │ │ (phone- │
|
||
│ lan: DHCP, │─│ lan 10.0.0. │ │ wg 10.9.0.3 │ │ quinn) ios │
|
||
│ discovered │L│ 11 │ │ macOS, │ │ wg 10.9.0.5 │
|
||
│ mesh DNS │A│ LAN DNS │ │ roams │ │ DNS client │
|
||
└─────────────┘N└─────────────┘ └─────────────┘ └─────────────┘
|
||
apricot + pear share the home LAN (10.0.0.0/24); fennel joins it
|
||
when physically home; phones ride the tunnel with DNS=10.9.0.2.
|
||
```
|
||
|
||
- **Mesh `10.9.0.0/24`** — full WireGuard overlay via the Iceland hub. Every host
|
||
reaches every other by `<host>.wg` while the tunnel is up.
|
||
- **LAN `10.0.0.0/24`** — apricot + black, plus plum when home. The tunnel also
|
||
routes this /24, so `10.0.0.x` works off-LAN through the hub (higher latency).
|
||
|
||
## DNS responsibilities — and how `.wg` actually resolves
|
||
|
||
Two delivery paths, and they serve different consumers. This distinction is
|
||
load-bearing (a config that *renders* a record is not the same as a client that
|
||
can *resolve* it):
|
||
|
||
- **apricot** runs dnsmasq bound to `10.9.0.2:53` (the mesh view). Serves the
|
||
host `.wg` + `.lan` records from `mesh-hosts.json`, written by `wg-dns-sync`.
|
||
**These records are consumed only by clients whose WireGuard config sets
|
||
`DNS=10.9.0.2` — i.e. phones.** The named hosts (apricot/pear/fennel) do *not*
|
||
point their resolver at `10.9.0.2`, so for them dnsmasq does not answer.
|
||
- **For the named hosts, names are delivered by the managed `/etc/hosts` block**
|
||
from `mesh-hosts-render --install` (bare + `.lan` at *current* IPs, `.wg`,
|
||
service vhosts). Every node's agent regenerates it automatically on drift.
|
||
- **fennel** roams off-LAN where dnsmasq is unreachable, so the managed
|
||
`/etc/hosts` block is its only resolution path then.
|
||
|
||
The old `*.local` platform scheme is **retired** (platform → `.com`, infra →
|
||
`.lan`); net-tools renders no `.local`.
|
||
|
||
## Reachability matrix
|
||
|
||
| from ↓ \ to → | apricot | pear | fennel | yuzu |
|
||
|---|---|---|---|---|
|
||
| **apricot** | — | `.lan` ✦ · `.wg` | `fennel.wg` only | `.wg` |
|
||
| **pear** | `.lan` ✦ · `.wg` | — | `fennel.wg` only | `.wg` |
|
||
| **fennel** | `.lan` ✦ · `.wg` ⚑ | `.lan` ✦ · `.wg` ⚑ | — | `.wg` only |
|
||
| **yuzu** | `.wg` only | `.wg` only | `fennel.wg` only | — |
|
||
|
||
✦ preferred when co-located on the home LAN · ⚑ fennel falls back to `.wg` when
|
||
it roams · **fennel and yuzu are only ever reachable inbound via `.wg`** (fennel
|
||
has no stable LAN IP; yuzu has no LAN leg) · strawberry is reachable at
|
||
`strawberry.wg` (10.9.0.5) when its tunnel is up, but runs no services.
|
||
|
||
`.wg` in this matrix resolves via each node's managed `/etc/hosts` block, which
|
||
every agent maintains — the dnsmasq `.wg` records are the phones-only path
|
||
(see DNS responsibilities above).
|
||
|
||
## Hub IP note
|
||
|
||
plum's live `wg1.conf` endpoint is `89.127.233.145:51820`. An older
|
||
`magic-civilization/scripts/lan/README.md` also lists `93.95.231.174` for the
|
||
Iceland hub — treat that as stale/secondary unless confirmed against the hub's
|
||
own WireGuard config. `mesh-hosts.json` records only the live `.145`.
|
||
|
||
## The fleet agent
|
||
|
||
`smart-lan-router/smart-lan-router.py` runs as a root service on **every node**
|
||
(launchd on darwin, systemd on linux — `install-agent.sh` picks). One codebase;
|
||
each node derives its roles from its own `mesh-hosts.json` entry:
|
||
|
||
| Role | Who | What |
|
||
|------|-----|------|
|
||
| pull | all | `git pull` as the repo owner (never root); exit-and-restart when its own code changes — pushing to the forge updates the fleet |
|
||
| hostname | all (`fleet.enforce_hostname`) | converge the OS hostname to the canonical name — the fleet renames hosts, humans don't |
|
||
| discover | LAN nodes | declared MAC → current DHCP IP via ARP/`ip neigh` → `data/lan-state.json` (each LAN node discovers independently) |
|
||
| route | laptop, darwin | the home/away subnet switch below |
|
||
| render | all | regenerate `/etc/hosts` + ssh config on any change, at this node's vantage (mesh-only nodes resolve everything via `.wg` IPs) |
|
||
|
||
The original laptop problem the route role solves:
|
||
|
||
**The problem it solves:** the wg config's `AllowedIPs` includes `10.0.0.0/24`, so
|
||
the tunnel installs a route capturing the *entire* home LAN. While home, traffic
|
||
to home hosts hairpins through the Iceland hub (~350ms) instead of going out the
|
||
LAN interface (~5ms). (Measured: apricot 351ms via tunnel → 5.6ms via en0.)
|
||
|
||
**What it does, each cycle:**
|
||
1. **Detect location** — read the default route's gateway + interface. It's HOME
|
||
iff the gateway is `lan.gateway` *and* its ARP MAC == `lan.gateway_mac` (the
|
||
home gateway's fingerprint). The MAC check is what distinguishes the real home
|
||
LAN from a visited café network that also happens to use `10.0.0.0/24`.
|
||
2. **Switch the subnet route** — HOME → `route 10.0.0.0/24` via the LAN interface
|
||
(direct); AWAY → via the wg mesh interface (so home stays reachable through the
|
||
tunnel). Re-asserted every cycle, because `wg-quick` re-adds the tunnel `/24`
|
||
on reconnect.
|
||
3. **Name-sync (discover role)** — keep ssh + hosts in sync with reality. Each
|
||
LAN host's **MAC is stable while its DHCP IP drifts**, and the neighbour
|
||
table (ARP / `ip neigh`) maps MAC↔IP. The agent reads it (rate-limited
|
||
ping-sweep of the `/24` when a host is missing), resolves every `hosts[]`
|
||
entry with a `mac` to its *current* IP, and on any change writes
|
||
`data/lan-state.json` ({name: ip}, gitignored — volatile, per-device) and
|
||
regenerates both views: `mesh-hosts-render --install` (`/etc/hosts`) and, as
|
||
the node's render user (its `ssh_user`), `host-apply --ssh-apply`
|
||
(`~/.ssh/config`). Proven live: when apricot rebooted from `.116` to `.118`,
|
||
`ssh apricot` and `quinn.apricot.lan` followed automatically — no DHCP
|
||
reservations, no hand-edits.
|
||
|
||
**Why a subnet route, not per-host `/32` pins** (the old design): a `/32
|
||
-interface` route on macOS creates a *self-MAC* ARP entry that blackholes the
|
||
host. A subnet route uses normal ARP, so every home host — at whatever DHCP
|
||
address it currently holds — just works. This is **drift-immune** (apricot moving
|
||
`.116→.118` needs no config change) and free of the self-MAC bug. `--status`
|
||
prints location + current route.
|
||
|
||
It re-reads `mesh-hosts.json` each cycle; a bad read keeps last-good and never
|
||
tears down routing (`KeepAlive` root daemon over an autocommit-written repo).
|
||
|
||
**Supersedes** both the old per-host identity-probe pinner *and* the
|
||
`wg-route-watchdog` system daemon (which unconditionally forced `10.0.0.0/24`
|
||
through the tunnel — the home branch is the new, smarter behavior; the away
|
||
branch preserves the watchdog's original purpose). The watchdog was retired
|
||
(`/Library/LaunchDaemons/com.natalie.wg-route-watchdog.plist` +
|
||
`/usr/local/sbin/wg-route-watchdog.sh` removed).
|
||
|
||
## Fleet rename
|
||
|
||
Names follow **fruit family = machine class** (apricot=GPU stone fruit,
|
||
pear=CPU/storage pome, yuzu=cloud citrus, fennel=laptop vegetable,
|
||
strawberry=phone berry), executed **alias-first**: the fruit name is canonical,
|
||
the old name lives in `aliases[]` forever, and every renderer emits both —
|
||
`pear.wg`+`black.wg`, `forge.pear.lan`+`forge.black.lan`, `ssh black` keeps
|
||
working. Old names are **never retired**; nothing that says "black" ever breaks.
|
||
|
||
**OS hostnames converge automatically**: with `fleet.enforce_hostname: true`,
|
||
each node's agent renames its own OS (`scutil` ×3 / `hostnamectl`) to the
|
||
canonical name on its next cycle — this is how the relic FQDNs
|
||
(`plum.voyager.nasty.sh`, `0.vps.1984.uvlava.com`) die. Never run the rename by
|
||
hand. String-identity consumers stay untouched on purpose: the Forgejo runner
|
||
label stays `black` (workflows reference it), the forge URL and NFS exports keep
|
||
their old names as permanent aliases.
|
||
|
||
## Migration
|
||
|
||
This repo replaces tooling scattered across four places:
|
||
|
||
| Was | Now | Status |
|
||
|-----|-----|--------|
|
||
| `session-tools/data/wg-mesh-hosts.json` | `data/mesh-hosts.json` (expanded: `.wg` view, hosts[], mac, identity, fruit names) | ✅ here |
|
||
| `session-tools/bin/wg-dns-sync` | `bin/wg-dns-sync` (robust symlink path resolution) | ✅ here + fixed |
|
||
| `magic-civilization/scripts/lan/subscribe-black-dns.sh` | — (retired: `*.local` scheme is dead) | ✅ removed |
|
||
| `setup-lan-dns.sh` (not in ~/Code — drifted) | `bin/mesh-hosts-render` | ✅ replaced |
|
||
| `bin/host-apply` (per-device ssh view) | new | ✅ here |
|
||
| `~/bin/smart-lan-router.py` (loose) | `smart-lan-router/smart-lan-router.py` (JSON-driven, self-heal) | ✅ here + fixed |
|
||
| `~/{install-agent.sh,com.lilith…plist}` (loose) | `smart-lan-router/` | ✅ here |
|
||
|
||
**Done (2026-06-09):** agent installed + verified on all four nodes (launchd on
|
||
fennel; systemd on pear/apricot/yuzu); all three remote nodes are real git
|
||
clones of `origin/main` (repo public on the LAN-only forge for credential-less
|
||
pulls); `mesh-hosts-render --install` + `host-apply --ssh-apply` live on all
|
||
four; fennel's hostname converged; the old `wg-route-watchdog`, `setup-lan-dns`
|
||
block, `/etc/resolver/*.lan` files, loose `~/bin/smart-lan-router.py`, and the
|
||
stale self-MAC ARP entry are all retired.
|
||
|
||
**Still pending:**
|
||
|
||
1. **apricot mesh-DNS cutover** — run `sudo bin/wg-dns-sync` on apricot from
|
||
this repo (serves phones the `.wg`/`.lan` names); verify
|
||
`dig @10.9.0.2 apricot.wg`. Then update the two session-tools consumers that
|
||
call the old absolute path (`bin/apricot-doctor`, `bin/quinn-phone-bootstrap`)
|
||
and delete the originals from `session-tools/{data,bin}`.
|
||
2. **pear/yuzu hostname convergence** — automatic on the next pull cycle after
|
||
the `fleet.enforce_hostname` commit lands on the forge (the agents do it;
|
||
watch for `hostname converged: black → pear` in the journal).
|
||
3. **yuzu → home ssh auth** — yuzu reaches pear/apricot by name but its key is
|
||
not authorized there. Deliberate: internet-facing node, least-privilege.
|
||
Grant only if actually needed.
|