[manager] support multihost pipelines as local processes#6555
Conversation
Extend `LocalRunner` to provision a multihost pipeline (`runtime_config.hosts > 1`) as multiple processes on a single host instead of rejecting it, so multihost functionality can be developed and tested without Docker or Kubernetes. For a multihost pipeline the runner now spawns one pipeline process per host (`--initial coordination`, `--host-id <ordinal>`) plus one `feldera-coordinator` process, and reports the coordinator as the deployment location (the rest of the manager talks to it exactly as it talks to ordinal-0 in Kubernetes). Addressing: each host binds `127.100.<ordinal>.1` and the coordinator binds `127.100.255.1`, all on fixed ports (HTTP 8080, exchange 9000). Putting the ordinal in the third octet keeps each host's address distinct (so the shared exchange port does not collide) without a `.0` address, and the fixed non-zero second octet keeps members off `127.0.0.1` (where the api-server/compiler/runner bind) and lets the small, finite address set be pre-aliased once on macOS. This requires no changes to the coordinator: its `--host-template "127.100.#.1"` substitution and observed-peer-IP exchange-address discovery do the rest. Only one multihost pipeline runs per host at a time, which suffices for local development. Each member runs in its own working subdirectory with its own storage directory so members sharing the container do not collide. A per-member supervisor follows the process output, re-spawns it on a coordinated restart (exit code 55, mirroring `pipeline_run.sh`), records a fatal error on any other unexpected exit, and kills it on stop/drop. The coordinator executable is located via a new `--coordinator-binary` / `FELDERA_COORDINATOR_BINARY` local-runner option; multihost provisioning errors out if it is unset. HTTPS for multihost is not yet supported (the coordinator dials bare loopback IPs). Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
mythical-fred
left a comment
There was a problem hiding this comment.
A useful capability — being able to exercise the multihost code paths in a single container without docker/k8s removes a real friction point for development. The structure (one supervisor task per member, watch channel for cancellation, exit code 55 transparently re-spawning) mirrors the existing pipeline_run.sh supervision loop closely enough that the mental model carries over, and the doc comments on the loopback layout (third octet = ordinal, 255 for coordinator, fixed non-zero second octet so HTTP servers never collide with api-server/compiler/runner on 127.0.0.1) are the kind of thing I wish more code came with. Tests for IP layout are exactly the right thing to lock down.
A few non-blocking observations, defer to @blp on whether any are worth acting on:
-
macOS reality vs. the doc comment. The
MULTIHOST_LOOPBACK_OCTETdoc says the addresses can be pre-aliased once on macOS, but nothing in this PR actually does that aliasing — a developer on macOS will hit an opaquebind: Can't assign requested addressfrom the host process and have to debug their way toifconfig lo0 alias 127.100.0.1etc. Either a CLI helper /Makefiletarget to set the aliases, or a startup probe that catches the bind error and prints a clear "runsudo ifconfig lo0 alias …" hint, would save someone an afternoon. Even just a README pointer would help. -
Unbounded restart loop on exit code 55.
supervise_memberre-spawns immediately with no backoff or attempt counter. If a coordinator and a host get into a coordinated-restart livelock (config mismatch, port still in TIME_WAIT, whatever), this hot-spins and floods logs. A smallsleep(Duration::from_millis(250))beforecontinue, plus maybe a "more than N restarts in M seconds → record fatal" guard, would match what real supervisors (systemd, k8s) do and matches the spirit of Power-of-Ten rule 2 (bounded loops). Cheap insurance. -
Dropsignals cancel but does not await supervisors. That matches the single-process path'sstart_killshape, so it is consistent, but worth noting that if the tokio runtime is being torn down at the same moment, the detached supervisors may not get to kill their children and thehost-*/coordinatorprocesses can outlive the manager. Probably OK for local-dev, but if a developer ctrl-Cs the manager mid-run they may end up with orphans listening on those 127.100.x.1 ports — which then break the next run. A quickpgrep -f feldera-coordinatornote in the docs, or havingprovision_multihoststart by killing anything still bound to its target addresses, would be friendlier. -
Same
28080for coordinator and host HTTP ports. The constantsMULTIHOST_PIPELINE_HTTP_PORTandMULTIHOST_COORDINATOR_PORTare both28080, which is fine because IPs differ, but the "shared across hosts because their IPs differ" wording on the host constant invites the reader to think the coordinator is on a different port. Either collapse them into oneMULTIHOST_HTTP_PORTconstant or add a one-line comment on the coordinator one explicitly noting the deliberate equality. -
Clamp-then-check ordering.
n_hosts = requested_hosts.clamp(1, workers)followed byif n_hosts > MAX_MULTIHOST_HOSTSis technically correct (sinceworkers >= 1and the clamp can only shrink), but the more obvious shape isif requested_hosts > MAX_MULTIHOST_HOSTS { err }first, then the clamp. The current ordering makes the reader prove thatclamp(_, workers)can't push aboveMAX_MULTIHOST_HOSTS, which it can't, but the assertion is implicit. -
is_provisionedrace.multihost.error()is checked, then a TCP connect; a member could fail fatally in between and we'd returnOngoing. Self-correcting on the next poll, so not a real bug — just calling it out.
Overall direction looks good to me. Marking COMMENT rather than APPROVE since @blp is the requested reviewer and the author explicitly invited his judgment on the direction.
blp
left a comment
There was a problem hiding this comment.
This is a good idea, and thank you for doing it. I looked over most of the code, some in detail and some in a cursory way. Even if it does not work perfectly, it won't hurt the rest of the system.
@blp this PR was written by Claude. Feel free to ignore it if you don't agree with this direction, but I think this can make life easier. It extends local runner to support multihost configurations by spawning multiple pipeline processes on localhost. This allows testing (but not benchmarking) multihost features without docker or k8s.
This doesn't work without enterprise crates. I'll submit another PR, including documentation, against the cloud repo.
Describe Manual Test Plan
Checklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes