Skip to content

[manager] support multihost pipelines as local processes#6555

Merged
ryzhyk merged 2 commits into
mainfrom
local-runner-multihost
Jun 28, 2026
Merged

[manager] support multihost pipelines as local processes#6555
ryzhyk merged 2 commits into
mainfrom
local-runner-multihost

Conversation

@ryzhyk

@ryzhyk ryzhyk commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

@blp this PR was written by Claude. Feel free to ignore it if you don't agree with this direction, but I think this can make life easier. It extends local runner to support multihost configurations by spawning multiple pipeline processes on localhost. This allows testing (but not benchmarking) multihost features without docker or k8s.

This doesn't work without enterprise crates. I'll submit another PR, including documentation, against the cloud repo.

Describe Manual Test Plan

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

Extend `LocalRunner` to provision a multihost pipeline
(`runtime_config.hosts > 1`) as multiple processes on a single host
instead of rejecting it, so multihost functionality can be developed and
tested without Docker or Kubernetes.

For a multihost pipeline the runner now spawns one pipeline process per
host (`--initial coordination`, `--host-id <ordinal>`) plus one
`feldera-coordinator` process, and reports the coordinator as the
deployment location (the rest of the manager talks to it exactly as it
talks to ordinal-0 in Kubernetes).

Addressing: each host binds `127.100.<ordinal>.1` and the coordinator
binds `127.100.255.1`, all on fixed ports (HTTP 8080, exchange 9000).
Putting the ordinal in the third octet keeps each host's address distinct
(so the shared exchange port does not collide) without a `.0` address, and
the fixed non-zero second octet keeps members off `127.0.0.1` (where the
api-server/compiler/runner bind) and lets the small, finite address set be
pre-aliased once on macOS. This requires no changes to the coordinator:
its `--host-template "127.100.#.1"` substitution and observed-peer-IP
exchange-address discovery do the rest. Only one multihost pipeline runs
per host at a time, which suffices for local development.

Each member runs in its own working subdirectory with its own storage
directory so members sharing the container do not collide. A per-member
supervisor follows the process output, re-spawns it on a coordinated
restart (exit code 55, mirroring `pipeline_run.sh`), records a fatal error
on any other unexpected exit, and kills it on stop/drop.

The coordinator executable is located via a new
`--coordinator-binary` / `FELDERA_COORDINATOR_BINARY` local-runner option;
multihost provisioning errors out if it is unset. HTTPS for multihost is
not yet supported (the coordinator dials bare loopback IPs).

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
@ryzhyk ryzhyk requested a review from blp June 28, 2026 07:23
@ryzhyk ryzhyk added Pipeline manager Pipeline manager (API, API server, runner, compiler server) multihost Related to multihost or distributed pipelines labels Jun 28, 2026
Signed-off-by: feldera-bot <feldera-bot@feldera.com>

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful capability — being able to exercise the multihost code paths in a single container without docker/k8s removes a real friction point for development. The structure (one supervisor task per member, watch channel for cancellation, exit code 55 transparently re-spawning) mirrors the existing pipeline_run.sh supervision loop closely enough that the mental model carries over, and the doc comments on the loopback layout (third octet = ordinal, 255 for coordinator, fixed non-zero second octet so HTTP servers never collide with api-server/compiler/runner on 127.0.0.1) are the kind of thing I wish more code came with. Tests for IP layout are exactly the right thing to lock down.

A few non-blocking observations, defer to @blp on whether any are worth acting on:

  1. macOS reality vs. the doc comment. The MULTIHOST_LOOPBACK_OCTET doc says the addresses can be pre-aliased once on macOS, but nothing in this PR actually does that aliasing — a developer on macOS will hit an opaque bind: Can't assign requested address from the host process and have to debug their way to ifconfig lo0 alias 127.100.0.1 etc. Either a CLI helper / Makefile target to set the aliases, or a startup probe that catches the bind error and prints a clear "run sudo ifconfig lo0 alias …" hint, would save someone an afternoon. Even just a README pointer would help.

  2. Unbounded restart loop on exit code 55. supervise_member re-spawns immediately with no backoff or attempt counter. If a coordinator and a host get into a coordinated-restart livelock (config mismatch, port still in TIME_WAIT, whatever), this hot-spins and floods logs. A small sleep(Duration::from_millis(250)) before continue, plus maybe a "more than N restarts in M seconds → record fatal" guard, would match what real supervisors (systemd, k8s) do and matches the spirit of Power-of-Ten rule 2 (bounded loops). Cheap insurance.

  3. Drop signals cancel but does not await supervisors. That matches the single-process path's start_kill shape, so it is consistent, but worth noting that if the tokio runtime is being torn down at the same moment, the detached supervisors may not get to kill their children and the host-* / coordinator processes can outlive the manager. Probably OK for local-dev, but if a developer ctrl-Cs the manager mid-run they may end up with orphans listening on those 127.100.x.1 ports — which then break the next run. A quick pgrep -f feldera-coordinator note in the docs, or having provision_multihost start by killing anything still bound to its target addresses, would be friendlier.

  4. Same 28080 for coordinator and host HTTP ports. The constants MULTIHOST_PIPELINE_HTTP_PORT and MULTIHOST_COORDINATOR_PORT are both 28080, which is fine because IPs differ, but the "shared across hosts because their IPs differ" wording on the host constant invites the reader to think the coordinator is on a different port. Either collapse them into one MULTIHOST_HTTP_PORT constant or add a one-line comment on the coordinator one explicitly noting the deliberate equality.

  5. Clamp-then-check ordering. n_hosts = requested_hosts.clamp(1, workers) followed by if n_hosts > MAX_MULTIHOST_HOSTS is technically correct (since workers >= 1 and the clamp can only shrink), but the more obvious shape is if requested_hosts > MAX_MULTIHOST_HOSTS { err } first, then the clamp. The current ordering makes the reader prove that clamp(_, workers) can't push above MAX_MULTIHOST_HOSTS, which it can't, but the assertion is implicit.

  6. is_provisioned race. multihost.error() is checked, then a TCP connect; a member could fail fatally in between and we'd return Ongoing. Self-correcting on the next poll, so not a real bug — just calling it out.

Overall direction looks good to me. Marking COMMENT rather than APPROVE since @blp is the requested reviewer and the author explicitly invited his judgment on the direction.

@blp blp left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, and thank you for doing it. I looked over most of the code, some in detail and some in a cursory way. Even if it does not work perfectly, it won't hurt the rest of the system.

@ryzhyk ryzhyk added this pull request to the merge queue Jun 28, 2026
Merged via the queue into main with commit 74bf013 Jun 28, 2026
1 check passed
@ryzhyk ryzhyk deleted the local-runner-multihost branch June 28, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multihost Related to multihost or distributed pipelines Pipeline manager Pipeline manager (API, API server, runner, compiler server)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants