libnetwork: ref-count service aliases for VIP DNS records by firecow · Pull Request #52236 · moby/moby

firecow · 2026-03-26T12:05:09Z

- What I did

Fixed stale VIP DNS records for network aliases persisting after service updates in Docker Swarm.

- How I did it

Each container has its own service config, so there is no single canonical alias list per service while a rolling update is in flight. Aliases are now ref-counted at the service level (service.aliasRefs):

addServiceBinding increments the ref count for each alias the task brings, and adds the VIP DNS record on a 0→1 transition.
rmServiceBinding decrements on full removal, and deletes the VIP DNS record on a 1→0 transition.
Each backend stores its own alias list (lbBackend.aliases) so the decrement uses the task's actual claim, even from cleanupServiceBindings.

The VIP-alias add/remove paths in addEndpointNameResolution / deleteEndpointNameResolution are removed; the svcName VIP record stays where it is. addServiceBinding also diffs against any prior aliases stored on the same endpoint, so a re-bind (disable→enable, or alias-set change for the same eID) only adjusts ref counts by the delta.

The previous version of this PR removed stale aliases as soon as the first new task in a rolling update arrived, which would have broken old tasks still serving on the old alias. Ref-counting keeps the alias registered until the last task referencing it is gone.

- How to verify it

# Step 1: deploy with alias
services:
  echo-server:
    deploy: { replicas: 2, endpoint_mode: vip }
    image: jmalloc/echo-server
    networks:
      default: { aliases: [testalias] }

# Step 2: redeploy without alias
services:
  echo-server:
    deploy: { replicas: 2, endpoint_mode: vip }
    image: jmalloc/echo-server

Verified end-to-end on a single-node swarm:

Scenario	`testalias`	`tasks.testalias`
After deploy with alias	resolves to VIP	resolves to backend IPs
Mid rolling-update removing alias (one old task still up)	still resolves to VIP	resolves only to old task IP
After rolling-update removing alias	NXDOMAIN/SERVFAIL	NXDOMAIN
After rolling-update re-adding alias	resolves to VIP	resolves to new backend IPs

- Human readable description for the release notes

Fix stale VIP DNS records for swarm service network aliases not being removed during rolling updates.

corhere

Great find! Though I am not confident the fix is quite so straightforward. For better or worse, there is a distinct service config for every container. So I don't think it would be correct to assume that the service config for the most recently created container is the canonical config for all containers referencing the same service ID. With this change as written, the old aliases would be deleted once the first task in the rolling update is started, while old tasks are still running that expect the old alias to still exist. New aliases should be added at the start of a rolling update and the old aliases deleted at the end. I'm afraid the aliases for a service may have to be ref-counted.

firecow · 2026-04-29T08:54:11Z

@corhere thanks for catching this — you were right. Force-pushed a rewrite that ref-counts aliases at the service level instead of treating any one task's alias list as canonical:

service.aliasRefs map[string]int tracks how many tasks claim each alias.
VIP DNS record is added on the 0→1 transition (first task with the alias) and removed on the 1→0 transition (last task referencing it leaves).
Each backend stores its own alias slice on lbBackend.aliases, so rmServiceBinding (and cleanupServiceBindings) decrement by the task's actual claim instead of a shared s.aliases.
addServiceBinding diffs against any prior aliases for the same eID, so a re-bind (disable→enable, or alias set change) only adjusts the ref counts by the delta.

Verified the mid-flight case end-to-end: during a rolling update that removes an alias, the alias still resolves while at least one old task still references it, and only disappears once the last old task is gone — which is the property the previous version was missing.

Re-requesting review.

corhere

Thank you! At a glance this change looks sane but I'll need to trace deeper through libnetwork before I can be confident in the correctness of this PR.

Copilot

Pull request overview

This PR fixes stale VIP DNS records for Swarm service network aliases during rolling updates by ref-counting aliases so VIP alias records persist until the last task using an alias is gone.

Changes:

Move VIP-alias DNS record management out of addEndpointNameResolution / deleteEndpointNameResolution and into service-binding lifecycle.
Introduce alias ref-counting (service.aliasRefs) and per-backend alias tracking (lbBackend.aliases) to correctly add/remove VIP alias records on 0→1 / 1→0 transitions.
Update cleanup flow to pass backend-specific alias lists so cleanupServiceBindings decrements the correct aliases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
daemon/libnetwork/service_common.go	Adds alias ref-counting logic, per-backend alias storage, and centralizes VIP-alias record add/remove in service binding paths.
daemon/libnetwork/service.go	Replaces `service.aliases` with `service.aliasRefs` and adds `lbBackend.aliases` to support accurate decrements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

firecow · 2026-04-30T11:48:37Z

@corhere Force-pushed: per-LB aliasRefs (per-network, not per-service), add bool helper inlined into the diff loops, your loop-shape and slices.Clone suggestions applied, and added TestServiceAliasRefCounting covering rolling-update persistence, per-network scoping, and same-eID rebind.

corhere

I am satisfied that this change is logically sound. I just have some nits to pick with commentary and tests.

If you ever need to resolve merge conflicts or otherwise sync with the base branch of a PR, please rebase and force-push instead of merging the base branch into your PR branch. We want to keep a clean Git history.

corhere · 2026-05-11T21:33:04Z

+	// Keyed per-network because Swarm services can attach to multiple
+	// networks with different alias sets.


What do you mean by "keyed per-network" here?

corhere · 2026-05-11T21:43:56Z

+//  2. Aliases are scoped per-network: configuring an alias on network A
+//     does not publish a VIP DNS record for it on network B, even though
+//     both networks belong to the same service.


Networks do not "belong to" services.

Suggested change

// 2. Aliases are scoped per-network: configuring an alias on network A

// does not publish a VIP DNS record for it on network B, even though

// both networks belong to the same service.

// 2. Aliases are scoped per-network: configuring an alias on network A

// does not publish a VIP DNS record for it on network B, or vice

// versa.

corhere · 2026-05-11T21:50:44Z

+		// Old task with alias "old".
+		assert.NilError(t, c.addServiceBinding(svcName, svcID, n1.ID(), "ep-old", "ctr-old",
+			vip1, nil, []string{"old"}, nil, net.ParseIP("172.20.0.10"), "test"))
+		assert.Check(t, is.Equal(true, resolves(t, n1, "old")), "alias 'old' should resolve after first task")


Suggested change

assert.Check(t, is.Equal(true, resolves(t, n1, "old")), "alias 'old' should resolve after first task")

assert.Check(t, resolves(t, n1, "old"), "alias 'old' should resolve after first task")

It's shorter to type, and yields more helpful failure messages automatically. Save the Equal assertions for comparing non-boolean values.

During rolling updates, VIP DNS records for service aliases were never cleaned up when aliases were removed from the service spec. VIP alias records were only added when the first backend joined a network (addService=true) and only removed when the last backend left (rmService=true). During rolling updates there is always at least one backend, so neither condition triggered and stale aliases persisted. Each container has its own service config, so there is no single canonical alias list per service while a rolling update is in flight. Instead, ref-count aliases per (network, alias) on each loadBalancer: add the VIP DNS record on the 0->1 transition (first task on that network claiming the alias) and remove it on the 1->0 transition (last task referencing it has left). Ref counts are per-network because a Swarm service can attach to multiple networks with different alias sets, and a VIP DNS record only makes sense on the network that actually has the alias configured. Aliases now survive a rolling update for as long as any task -- old or new -- still references them on that network, and new aliases register as soon as the first task carrying them starts. Signed-off-by: Mads Jon Nielsen <madsjon@gmail.com>

firecow · 2026-05-12T12:39:39Z

@corhere Addressed your nits and rebased onto current master (was previously sync'd via merge — apologies, won't do that again):

Reworded the aliasRefs comment on loadBalancer to be explicit: the map is keyed by alias, and the per-network scoping comes from loadBalancer being one-per-network (since a service can configure different alias sets per network).
Test comment: dropped the "belong to the same service" phrasing per your suggestion.
Test assertions: swept assert.Check(t, is.Equal(true/false, resolves(...))) → assert.Check(t, resolves(...)) / assert.Check(t, !resolves(...)) across the whole test, and removed the now-unused is import.

No code changes, just commentary and tests. Re-requesting review.

corhere

LGTM! Thank you so much for fixing this bug!

vvoland

Thanks, LGTM!

github-actions Bot added the area/daemon Core Engine label Mar 26, 2026

vvoland added area/networking Networking area/swarm labels Mar 26, 2026

vvoland requested a review from corhere March 26, 2026 15:36

vvoland assigned corhere Mar 26, 2026

corhere reviewed Apr 17, 2026

View reviewed changes

Comment thread daemon/libnetwork/service_common.go Outdated

Comment thread daemon/libnetwork/service_common.go Outdated

firecow force-pushed the fix-stale-network-alias-vip-dns branch 3 times, most recently from 27e010b to 3045a0c Compare April 29, 2026 08:53

firecow changed the title ~~libnetwork: reconcile VIP DNS when service aliases change~~ libnetwork: ref-count service aliases for VIP DNS records Apr 29, 2026

corhere added this to the 29.4.2 milestone Apr 29, 2026

corhere added impact/changelog area/networking/dns Networking kind/bugfix PR's that fix bugs labels Apr 29, 2026

corhere modified the milestones: 29.4.2, 29.5.0 Apr 29, 2026

corhere reviewed Apr 29, 2026

View reviewed changes

Comment thread daemon/libnetwork/service_common.go Outdated

Comment thread daemon/libnetwork/service_common.go Outdated

Comment thread daemon/libnetwork/service_common.go Outdated

corhere requested review from akerouanton, Copilot and robmry April 29, 2026 22:39

Copilot started reviewing on behalf of corhere April 29, 2026 22:39 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread daemon/libnetwork/service_common.go Outdated

Comment thread daemon/libnetwork/service_common.go Outdated

firecow force-pushed the fix-stale-network-alias-vip-dns branch from 3045a0c to c09bf96 Compare April 30, 2026 08:36

corhere force-pushed the fix-stale-network-alias-vip-dns branch from 634a509 to c09bf96 Compare May 11, 2026 21:16

corhere reviewed May 11, 2026

View reviewed changes

firecow force-pushed the fix-stale-network-alias-vip-dns branch from c09bf96 to 1143f41 Compare May 12, 2026 12:39

corhere approved these changes May 12, 2026

View reviewed changes

corhere requested a review from thaJeztah May 12, 2026 18:27

vvoland approved these changes May 13, 2026

View reviewed changes

vvoland merged commit 2cb4bbd into moby:master May 13, 2026
187 of 188 checks passed

		// Keyed per-network because Swarm services can attach to multiple
		// networks with different alias sets.

	assert.Check(t, is.Equal(true, resolves(t, n1, "old")), "alias 'old' should resolve after first task")
	assert.Check(t, resolves(t, n1, "old"), "alias 'old' should resolve after first task")

Conversation

firecow commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corhere left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

firecow commented Apr 29, 2026

Uh oh!

corhere left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

firecow commented Apr 30, 2026

Uh oh!

corhere left a comment

Choose a reason for hiding this comment

Uh oh!

corhere May 11, 2026

Choose a reason for hiding this comment

Uh oh!

corhere May 11, 2026

Choose a reason for hiding this comment

Uh oh!

corhere May 11, 2026

Choose a reason for hiding this comment

Uh oh!

firecow commented May 12, 2026

Uh oh!

corhere left a comment

Choose a reason for hiding this comment

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

firecow commented Mar 26, 2026 •

edited

Loading