Skip to content

feat: add single tailnet support to pgcoord#9351

Merged
coadler merged 24 commits into
mainfrom
colin/single-pgcoord
Sep 21, 2023
Merged

feat: add single tailnet support to pgcoord#9351
coadler merged 24 commits into
mainfrom
colin/single-pgcoord

Conversation

@coadler

@coadler coadler commented Aug 25, 2023

Copy link
Copy Markdown
Contributor

This adds support for the single tailnet expieriment to pgcoord. Afer this there aren't any major blockers for making it GA.

@coadler coadler self-assigned this Aug 25, 2023
@coadler coadler force-pushed the colin/single-pgcoord branch from 20a5183 to 4a4bbdb Compare August 25, 2023 22:47
@coadler coadler marked this pull request as ready for review August 25, 2023 23:25
@coadler coadler requested a review from spikecurtis August 28, 2023 02:17
Comment thread coderd/database/migrations/000152_pg_coordinator_single_tailnet.up.sql Outdated
Comment thread coderd/database/dbtestutil/db.go Outdated
db = database.New(sqlDB)

err = migrations.Up(sqlDB)
require.NoError(t, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is required, how were our tests working before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only if you specify a database URL to run tests against. I needed to run a standalone postgres to see the logs and previously it wouldn't run the migrations. I can prob remove this, I'm not sure if it has any greater effects.

Comment thread enterprise/tailnet/pgcoord.go Outdated
Comment thread enterprise/tailnet/pgcoord.go Outdated
Comment thread enterprise/tailnet/pgcoord.go Outdated
Comment thread coderd/database/migrations/000152_pg_coordinator_single_tailnet.up.sql Outdated
Comment thread enterprise/tailnet/pgcoord.go Outdated
Comment thread enterprise/tailnet/pgcoord_test.go Outdated
Comment thread enterprise/tailnet/pgcoord_test.go Outdated
require.NoError(t, err)

assertEventuallyNoClientsForAgent(ctx, t, store, agent1.id)
assertEventuallyNoAgents(ctx, t, store, agent1.id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional test cases I'd like to see:

  1. MultiAgent with multiple coordinators
  2. MultiAgent UpdateSelf first, then SubscribeAgent --- verify the agent gets the update.
  3. UnsubscribeAgent --- verify that the agent no longer gets updates about the client and vice versa
  4. Subscribe to multiple agents

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on implementing these after the rest is solid 👍

@github-actions github-actions Bot added the stale This issue is like stale bread. label Sep 7, 2023
@coadler coadler force-pushed the colin/single-pgcoord branch from 1cd56e6 to a7b39df Compare September 7, 2023 19:20
@matifali matifali removed the stale This issue is like stale bread. label Sep 7, 2023
@coadler coadler requested a review from spikecurtis September 8, 2023 00:33
Comment thread enterprise/tailnet/pgcoord.go Outdated
delete(q.conns, c)

if _, ok := q.clientSubscriptions[c.UniqueID()]; !ok {
q.clientSubscriptions[c.UniqueID()] = map[uuid.UUID]struct{}{}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a memory leak: you add to q.clientSubscriptions but never remove.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still never delete anything from q.clientSubscriptions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I think I got this now.

Comment thread coderd/database/migrations/000155_pg_coordinator_single_tailnet.up.sql Outdated
Comment thread enterprise/tailnet/pgcoord.go Outdated
Comment thread enterprise/tailnet/pgcoord.go Outdated
Comment thread coderd/database/migrations/000155_pg_coordinator_single_tailnet.up.sql Outdated
@coadler coadler requested a review from spikecurtis September 16, 2023 01:49
Comment thread enterprise/tailnet/pgcoord.go Outdated
Comment thread enterprise/tailnet/pgcoord.go Outdated
delete(q.conns, c)

if _, ok := q.clientSubscriptions[c.UniqueID()]; !ok {
q.clientSubscriptions[c.UniqueID()] = map[uuid.UUID]struct{}{}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still never delete anything from q.clientSubscriptions

Comment thread enterprise/tailnet/pgcoord.go
Comment thread enterprise/tailnet/pgcoord.go Outdated
kind: c.Kind(),
}
cm, ok := q.mappers[mk]
if ok {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact that you're checking ok here smells funny and indicates that we're not keeping the data structures in sync.

It should be an invariant in this code that a client subscription in the q.clientSubscriptions map implies that a corresponding mapper exists. They are always added together and always removed together while holding the mutex.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think we should handle a case where the mapper doesn't exist. Panic?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, panic, or at the very least, drop a Critical log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a pretty easy way to trigger this invariant. It's possible for (*querier).cleanupConn to race with a call to removeClientSubscription, such that removeClientSubscription is trying to remove a subscription that no longer exists. I'm not too sure how to prevent this, except by ignoring duplicates.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invariant is that a client subscription exists in q.clientSubscriptions AND in the mappers, or neither. It can never be in one but not the other when the mutex is unlocked.

You're right though, that in this design, it's possible for removeClientSubscription to race with cleanupConn such that you can try to remove a subscription that was already cleaned up. What you've done is fine and follows the invariant --- if the subscription is not in q.clientSubscriptions that means it's already cleaned up from the mappers too, and removeClientSubscription can return early.

Comment thread enterprise/tailnet/pgcoord.go Outdated
@coadler coadler requested a review from spikecurtis September 20, 2023 04:14
Comment thread enterprise/tailnet/pgcoord.go
Comment thread enterprise/tailnet/pgcoord.go Outdated
if sub.active {
q.newClientSubscription(sub.q, sub.agentID)
} else {
q.removeClientSubscription(sub.q, sub.agentID)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you send a subscription with agentID unset on remove. The subscriber handles this case, but then it forwards it to the querier which I don't think handles the case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's technically handled by checking the clientSubscriptions map, since there will never be an agent subscription with uuid.Nil. I had an explicit check but just replaced it with the map lookup, since it essentially does the same thing. Might be good to comment.

Comment thread enterprise/tailnet/pgcoord_test.go
Comment thread enterprise/tailnet/multiagent_test.go Outdated
@coadler coadler requested a review from spikecurtis September 20, 2023 15:39

@spikecurtis spikecurtis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@coadler coadler merged commit c900b5f into main Sep 21, 2023
@coadler coadler deleted the colin/single-pgcoord branch September 21, 2023 19:30
@github-actions github-actions Bot locked and limited conversation to collaborators Sep 21, 2023
@coadler coadler linked an issue Sep 21, 2023 that may be closed by this pull request
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Single Tailnet: Implement for HA coordinator

3 participants