[manager] support multihost pipelines as local processes by ryzhyk · Pull Request #6555 · feldera/feldera

ryzhyk · 2026-06-28T07:23:35Z

@blp this PR was written by Claude. Feel free to ignore it if you don't agree with this direction, but I think this can make life easier. It extends local runner to support multihost configurations by spawning multiple pipeline processes on localhost. This allows testing (but not benchmarking) multihost features without docker or k8s.

This doesn't work without enterprise crates. I'll submit another PR, including documentation, against the cloud repo.

Describe Manual Test Plan

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

OpenAPI / REST HTTP API / feldera-types / manager (What is a breaking change?)
Feldera SQL (Syntax, Semantics)
feldera-sqllib (incl. dependencies fxp, etc.) (What is a breaking change?)
Python SDK (What is a breaking change?)
fda (CLI arguments)
Adapters (including configuration)
Storage Format / Checkpoints
Others (specify)

Describe Incompatible Changes

Extend `LocalRunner` to provision a multihost pipeline (`runtime_config.hosts > 1`) as multiple processes on a single host instead of rejecting it, so multihost functionality can be developed and tested without Docker or Kubernetes. For a multihost pipeline the runner now spawns one pipeline process per host (`--initial coordination`, `--host-id <ordinal>`) plus one `feldera-coordinator` process, and reports the coordinator as the deployment location (the rest of the manager talks to it exactly as it talks to ordinal-0 in Kubernetes). Addressing: each host binds `127.100.<ordinal>.1` and the coordinator binds `127.100.255.1`, all on fixed ports (HTTP 8080, exchange 9000). Putting the ordinal in the third octet keeps each host's address distinct (so the shared exchange port does not collide) without a `.0` address, and the fixed non-zero second octet keeps members off `127.0.0.1` (where the api-server/compiler/runner bind) and lets the small, finite address set be pre-aliased once on macOS. This requires no changes to the coordinator: its `--host-template "127.100.#.1"` substitution and observed-peer-IP exchange-address discovery do the rest. Only one multihost pipeline runs per host at a time, which suffices for local development. Each member runs in its own working subdirectory with its own storage directory so members sharing the container do not collide. A per-member supervisor follows the process output, re-spawns it on a coordinated restart (exit code 55, mirroring `pipeline_run.sh`), records a fatal error on any other unexpected exit, and kills it on stop/drop. The coordinator executable is located via a new `--coordinator-binary` / `FELDERA_COORDINATOR_BINARY` local-runner option; multihost provisioning errors out if it is unset. HTTPS for multihost is not yet supported (the coordinator dials bare loopback IPs). Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

mythical-fred

A useful capability — being able to exercise the multihost code paths in a single container without docker/k8s removes a real friction point for development. The structure (one supervisor task per member, watch channel for cancellation, exit code 55 transparently re-spawning) mirrors the existing pipeline_run.sh supervision loop closely enough that the mental model carries over, and the doc comments on the loopback layout (third octet = ordinal, 255 for coordinator, fixed non-zero second octet so HTTP servers never collide with api-server/compiler/runner on 127.0.0.1) are the kind of thing I wish more code came with. Tests for IP layout are exactly the right thing to lock down.

A few non-blocking observations, defer to @blp on whether any are worth acting on:

macOS reality vs. the doc comment. The MULTIHOST_LOOPBACK_OCTET doc says the addresses can be pre-aliased once on macOS, but nothing in this PR actually does that aliasing — a developer on macOS will hit an opaque bind: Can't assign requested address from the host process and have to debug their way to ifconfig lo0 alias 127.100.0.1 etc. Either a CLI helper / Makefile target to set the aliases, or a startup probe that catches the bind error and prints a clear "run sudo ifconfig lo0 alias …" hint, would save someone an afternoon. Even just a README pointer would help.
Unbounded restart loop on exit code 55. supervise_member re-spawns immediately with no backoff or attempt counter. If a coordinator and a host get into a coordinated-restart livelock (config mismatch, port still in TIME_WAIT, whatever), this hot-spins and floods logs. A small sleep(Duration::from_millis(250)) before continue, plus maybe a "more than N restarts in M seconds → record fatal" guard, would match what real supervisors (systemd, k8s) do and matches the spirit of Power-of-Ten rule 2 (bounded loops). Cheap insurance.
Drop signals cancel but does not await supervisors. That matches the single-process path's start_kill shape, so it is consistent, but worth noting that if the tokio runtime is being torn down at the same moment, the detached supervisors may not get to kill their children and the host-* / coordinator processes can outlive the manager. Probably OK for local-dev, but if a developer ctrl-Cs the manager mid-run they may end up with orphans listening on those 127.100.x.1 ports — which then break the next run. A quick pgrep -f feldera-coordinator note in the docs, or having provision_multihost start by killing anything still bound to its target addresses, would be friendlier.
Same 28080 for coordinator and host HTTP ports. The constants MULTIHOST_PIPELINE_HTTP_PORT and MULTIHOST_COORDINATOR_PORT are both 28080, which is fine because IPs differ, but the "shared across hosts because their IPs differ" wording on the host constant invites the reader to think the coordinator is on a different port. Either collapse them into one MULTIHOST_HTTP_PORT constant or add a one-line comment on the coordinator one explicitly noting the deliberate equality.
Clamp-then-check ordering. n_hosts = requested_hosts.clamp(1, workers) followed by if n_hosts > MAX_MULTIHOST_HOSTS is technically correct (since workers >= 1 and the clamp can only shrink), but the more obvious shape is if requested_hosts > MAX_MULTIHOST_HOSTS { err } first, then the clamp. The current ordering makes the reader prove that clamp(_, workers) can't push above MAX_MULTIHOST_HOSTS, which it can't, but the assertion is implicit.
is_provisioned race. multihost.error() is checked, then a TCP connect; a member could fail fatally in between and we'd return Ongoing. Self-correcting on the next poll, so not a real bug — just calling it out.

Overall direction looks good to me. Marking COMMENT rather than APPROVE since @blp is the requested reviewer and the author explicitly invited his judgment on the direction.

blp

This is a good idea, and thank you for doing it. I looked over most of the code, some in detail and some in a cursory way. Even if it does not work perfectly, it won't hurt the rest of the system.

ryzhyk requested a review from blp June 28, 2026 07:23

ryzhyk added Pipeline manager Pipeline manager (API, API server, runner, compiler server) multihost Related to multihost or distributed pipelines labels Jun 28, 2026

[ci] apply automatic fixes

2d5e8e3

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

mythical-fred reviewed Jun 28, 2026

View reviewed changes

blp approved these changes Jun 28, 2026

View reviewed changes

ryzhyk added this pull request to the merge queue Jun 28, 2026

Merged via the queue into main with commit 74bf013 Jun 28, 2026
1 check passed

ryzhyk deleted the local-runner-multihost branch June 28, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[manager] support multihost pipelines as local processes#6555

[manager] support multihost pipelines as local processes#6555
ryzhyk merged 2 commits into
mainfrom
local-runner-multihost

ryzhyk commented Jun 28, 2026

Uh oh!

mythical-fred left a comment

Uh oh!

blp left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ryzhyk commented Jun 28, 2026

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

blp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants