net-tools/README.md

145 lines
10 KiB
Markdown
Raw Normal View History

# net-tools
Mesh/LAN tooling for the four-host **wg1 mesh** + home LAN, built around one
source of truth ([`data/mesh-hosts.json`](data/mesh-hosts.json)).
Components:
- **`bin/net`** — **the one command**: `status · whoami · doctor · issues ·
sync · up · down · enroll phone · gui`. Imports the agent as a library, so
every surface shares one implementation. The renderers (`host-apply`,
`mesh-hosts-render`, `wg-dns-sync`, `fleet-status`) remain as internals/direct
tools.
- **[`data/known-issues.json`](data/known-issues.json)** — the **triage
registry**: features that are known-broken or intentionally parked. `net
issues` lists them; `net doctor <host>` annotates each host with its parked
features (`⚠ KNOWN-…`) so a triaged problem is never re-investigated from
scratch. An optional per-issue `probe` (same shape as a host `identity`)
lets `doctor` flag an issue as *maybe-resolved* when it starts passing.
- **`gui/`** — Mesh control, the for-dummies window (`net gui`): plain-language
status per device; right-click for the power tools (copy address, ssh here,
diagnose path, `.wg` address). Every menu item is a `net` verb.
- **`tray/`** — the macOS menu-bar fleet tray (tunnel control + live fleet view).
- **[`smart-lan-router/`](smart-lan-router/)** — the **fleet agent**: one
service, identical on every node, roles derived from each node's entry in the
source of truth. It pulls the repo (config + its own code), discovers hosts'
current IPs by MAC, converges OS hostnames, switches the laptop's home/away
route, and regenerates the local views. (The home gateway is a dumb Xfinity
box with no API; all intelligence lives in the fleet.)
Everything that needs a host address, MAC, or identity probe derives from one
file: [`data/mesh-hosts.json`](data/mesh-hosts.json). Never hardcode a mesh IP,
MAC, or identity URL anywhere else — add it here and regenerate.
## The four hosts — fruit family encodes machine class
| Class | Canonical | Old alias | LAN | WG mesh | Public |
|-------|-----------|-----------|-----|---------|--------|
| GPU compute (stone fruit) | **apricot** | — | *DHCP, discovered* | `10.9.0.2` | — |
| CPU / storage (pome) | **pear** | `black` | `10.0.0.11` | `10.9.0.4` | — |
| laptop (vegetable) | **fennel** | `plum` | *roams* | `10.9.0.3` | — |
| cloud hub (citrus) | **yuzu** | `vps`,`quinn-vps` | — | `10.9.0.1` | `89.127.233.145` |
| phone (berry) | **strawberry** | `phone-quinn` | — | `10.9.0.5` | — |
LAN IPs are *live state*, not promises — agents discover them by MAC
(`data/lan-state.json`); the table's fixed entries are just today's DHCP truth.
The rename is **alias-first**: the fruit name is canonical, the old name is a
permanent alias, every renderer emits **both**`pear.wg` *and* `black.wg`
resolve, `ssh black` keeps working forever. OS hostnames are converged **by the
fleet itself**: `fleet.enforce_hostname: true` makes each agent rename its own
node (never run `hostnamectl`/`scutil` by hand). Old names are never retired —
the forge URL, NFS paths, and every `.git/config` keep resolving untouched.
Phones are hosts too — `class: phone` (berry family), `os: ios|android`. No
agent runs on them (`ssh_user: null` → no ssh stanza); they consume names via
the WireGuard app with `DNS=10.9.0.2`. Current: **strawberry** (alias
`phone-quinn`, ios, `10.9.0.5`). Enroll new ones with `wg-phone-add`, then add
the entry.
## Naming: one rule per suffix
- **bare `<host>`** and **`<host>.lan`** → the host's **current LAN IP**
(discovered, tracks DHCP drift). Direct at home; when away the daemon routes
the LAN `/24` through the tunnel, so the same name still works. This is the
everyday handle: `ssh apricot`, `ping pear`.
- **`<host>.wg`** → mesh IP (`10.9.0.x`). The explicit tunnel path — use to force
the mesh or to reach hosts with no LAN leg (`fennel.wg`, `yuzu.wg`; their bare
names also point here).
- **service vhosts** (`quinn.apricot.lan`, `forge.black.lan`, …) → declared in
`mesh-hosts.json` `services`, rendered at the hosting host's current IP.
(The old `*.local` scheme is **retired** — platform moved to real `.com` domains,
infra to `.lan`. net-tools carries no `.local` records.)
## The program owns the names — never hand-edit
`/etc/hosts` fleet/service records and the fleet block in `~/.ssh/config` are
**generated**. Hand-edits go stale on the next DHCP drift and are overwritten on
the next sync. To change anything: edit `data/mesh-hosts.json` (or just wait —
IP changes are discovered automatically) and let the renderers run. On install,
`mesh-hosts-render` also **adopts** loose hand-maintained lines for any name it
manages (it removes them; its block supersedes them).
## Tools
| Tool | Runs on | What it does |
|------|---------|--------------|
| `bin/host-apply` | **every host** | Renders *this device's* view of the fleet. Detects which host it is, then writes a managed ssh-config block (`~/.ssh/config`) with per-vantage `HostName`s: `public` > `.lan` (if this host reaches the LAN) > `.wg`. `--whoami`/`--ssh-print`/`--ssh-diff`/`--ssh-apply`. The hosts leg is `mesh-hosts-render`. |
| `smart-lan-router/smart-lan-router.py` | **every node** | The fleet agent (launchd/systemd). Roles, derived per node: **pull** — git pull as the repo owner, restart self on code change; **hostname** — converge OS hostname to the canonical name (`fleet.enforce_hostname`); **discover** — map declared MACs → current DHCP IPs via ARP/`ip neigh`, write `data/lan-state.json`; **route** (laptop only) — HOME (gateway MAC match) → LAN `/24` direct (~5ms), AWAY → via wg; **render** — regenerate both views on change. `--status` to inspect. Supersedes the per-host `/32` pinner, `wg-route-watchdog`, and `setup-lan-dns`. |
| `bin/fleet-status` | anywhere | Terminal dashboard: one row per agent node (location, route, repo HEAD, snapshot age, discovered IPs), read from each node's `data/agent-status.json` over the fleet ssh names. `STALE`/`no status` = that agent needs attention. |
| `bin/wg-dns-sync` | **apricot** | Renders `mesh-hosts.json``/etc/dnsmasq.d/wg-mesh.conf` (host `.wg` + `.lan` records on `10.9.0.2:53`, for wg clients with `DNS=10.9.0.2`). Idempotent; `--dry-run`. |
| `bin/mesh-hosts-render` | **every host** | Renders the fleet `/etc/hosts` block (bare/`.lan` at current IPs, `.wg`, service vhosts) and splices it at the top of `/etc/hosts`, adopting any loose lines it supersedes. Idempotent. `--print`/`--diff`/`--install`. |
| `bin/forge-dns-render` | **laptop/dev machines** | DX-only: renders cloud Forgejo shortcuts (mcforge, ctforge, ...) from `~/.vault/*_forge_creds` into a managed block at the bottom of `/etc/hosts`. Used by `net sync` and per-project `./run forge:dns`. Adopts loose entries. `--print`/`--diff`/`--install`. |
| `smart-lan-router/` | **fennel** | `com.lilith.smart-lan-router.plist` (launchd) + `install-agent.sh` (one installer: launchd or systemd) + `smart-lan-router.service.tmpl`. |
| [`tray/`](tray/) | **fennel** (menu bar) | The fleet tray (absorbed from the old `wireguard-vpn-tray` repo). Icon = tunnel state (green/yellow/red); menu = live fleet view from `data/agent-status.json`: agent freshness, HOME/AWAY + route, discovered host IPs, repo HEAD. Connect/disconnect actions. Install: `bash tray/install-tray.sh` (as the user, no sudo). |
All tools locate `data/mesh-hosts.json` by resolving their own symlink chain and
walking up to the repo, so they work whether run from the repo or a PATH symlink.
## Install — same agent, every node
```sh
git clone ssh://git@forge.black.lan:2222/lilith/net-tools.git ~/net-tools
cd ~/net-tools
./install.sh # symlink bin/* into ~/bin or ~/.local/bin
sudo smart-lan-router/install-agent.sh # ONE service: launchd on darwin, systemd on linux
```
The agent self-derives its roles from this node's `mesh-hosts.json` entry —
nothing platform- or host-specific to configure:
| Node | Platform | Roles (derived) |
|------|----------|-----------------|
| fennel | osx (launchd) | pull · hostname · discover · render · **route** (laptop) |
| apricot | bluefin (systemd) | pull · hostname · discover · render |
| pear | ubuntu-family (systemd) | pull · hostname · discover · render |
| yuzu | debian (systemd) | pull · hostname · render (no LAN leg) |
| strawberry (ios) + future android | — | **no agent possible** — WireGuard app with `DNS=10.9.0.2`; names served by apricot's mesh dnsmasq (`wg-dns-sync`) |
| windows | — | non-goal until a Windows node exists (would need a hosts/ssh/route port) |
Every node renders its own vantage: LAN-capable nodes get bare names + services
at current LAN IPs; mesh-only nodes (yuzu) get them at wg IPs. The `pull` role
re-fetches this repo (as the repo owner, never root) and restarts the agent when
its own code changes — fleet updates propagate by pushing to the forge.
## Changing things
| Want to… | Do |
|----------|----|
| add/rename a host, change a MAC, add a service vhost or phone | edit [`data/mesh-hosts.json`](data/mesh-hosts.json), let autocommit push — **every agent pulls, restarts on code change, and converges (incl. its OS hostname) within minutes** |
| react to a host changing DHCP IP | nothing — agents discover it by MAC and regenerate `/etc/hosts` + ssh automatically |
| rename a node's OS hostname | nothing by hand — `fleet.enforce_hostname` makes the node's own agent do it |
| force a regen now | `net sync` (mesh-hosts + forge-dns + ssh) or the individual `sudo ... --install` |
| apricot mesh DNS (phones) | `sudo wg-dns-sync` on apricot |
| enroll a phone | `wg-phone-add -d <device>` then add a `class: phone` entry |
Never hand-edit `/etc/dnsmasq.d/wg-mesh.conf`, the managed `/etc/hosts` records,
or the fleet block in `~/.ssh/config` — all generated, all overwritten.
## Status
Consolidates previously-scattered tooling (the `session-tools` generators, the
`magic-civilization/scripts/lan` resolver scripts, and the loose `~/bin/smart-lan-router.py`
daemon) into one repo. Pending gated cutovers (apricot DNS, the fleet rename,
retiring originals) are in [`docs/topology.md`](docs/topology.md#migration).