Skip to content

feat: add a dependency management graph for agents#20208

Merged
SasSwart merged 31 commits into
mainfrom
jjs/coder-10352
Oct 24, 2025
Merged

feat: add a dependency management graph for agents#20208
SasSwart merged 31 commits into
mainfrom
jjs/coder-10352

Conversation

@SasSwart
Copy link
Copy Markdown
Contributor

@SasSwart SasSwart commented Oct 8, 2025

Relates to coder/internal#1093

This is the first of N pull requests to allow coder script ordering.
It introduces what is for now dead code, but paves the way for various interfaces that allow coder scripts and other processes to depend on one another via CLI commands and terraform configurations.

The next step is to add reactivity to the graph, such that changes in the status of one vertex will propagate and allow other vertices to change their own statuses.

Concurrency and stress testing yield the following:

CPU Profile:
Screenshot 2025-10-17 at 10 38 52

Mem Profile:
Screenshot 2025-10-17 at 10 38 01

Predictably, lock contention and memory allocation are the largest components of this system under stress. Nothing seems untoward.

@SasSwart SasSwart changed the title add in memory lock coordinator for script ordering use feat: allow orchestration and ordering of coder scripts Oct 8, 2025
@github-actions github-actions Bot added the stale This issue is like stale bread. label Oct 16, 2025
@SasSwart SasSwart removed the stale This issue is like stale bread. label Oct 16, 2025
@SasSwart SasSwart changed the title feat: allow orchestration and ordering of coder scripts feat: add a dependency management graph for agents Oct 16, 2025
@SasSwart SasSwart marked this pull request as ready for review October 16, 2025 08:45
@SasSwart SasSwart requested review from johnstcn and mafredri October 16, 2025 08:47
Comment thread agent/unit/graph.go
Comment thread agent/unit/unit_test.go Outdated
Comment thread agent/unit/unit_test.go Outdated
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice documentation and starting point. 👍🏻

The DependencyCoordinator seems premature as it's currently just a thin wrapper that exposes the underlying graph. Consider:

  • Not embedding Graph directly (breaks encapsulation)
  • Using domain-specific terminology (e.g., AddDependency instead of AddEdge)
  • Adding validation at this layer (cycles, self-references)

This would make the business logic clearer and easier to validate.

Comment thread agent/unit/unit.go Outdated
Comment thread agent/unit/unit.go Outdated
Comment thread agent/unit/graph.go Outdated
Comment thread agent/unit/graph.go
Comment thread agent/unit/graph.go Outdated
Comment thread agent/unit/unit_test.go Outdated
Comment thread agent/unit/unit_test.go Outdated
Comment thread agent/unit/graph_test.go Outdated
Comment thread agent/unit/graph_test.go
}
})

t.Run("ConcurrentReadWrite", func(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could potentially just keep this concurrency test and drop the others.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed redundant tests here 6ff5772

Comment thread agent/unit/graph_test.go Outdated
Comment on lines +65 to +78
// saveDOTFile saves a DOT representation of the graph to a file
func saveDOTFile(t *testing.T, graph *unit.Graph[Status, *Unit], filename string) {
outputDir := createTestOutputDir(t)
dot, err := graph.ToDOT(filename)
if err != nil {
t.Logf("Failed to generate DOT for %s: %v", filename, err)
return
}

dotPath := filepath.Join(outputDir, filename+".dot")
err = os.WriteFile(dotPath, []byte(dot), 0o600)
require.NoError(t, err, "failed to write DOT file")
t.Logf("Saved DOT file: %s", dotPath)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use t.TempDir() here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was for these files to be visible in a predictable place after the tests run. But I'm not sure why I preferred that to having them in a temp dir for during test debugging. I think your suggestion is a better approach.

Applied in c528868

Comment thread agent/unit/graph_test.go Outdated
expected, err := os.ReadFile(goldenFile)
require.NoError(t, err, "read golden file, run \"make gen/golden-files\" and commit the changes")

require.Equal(t, string(expected), dot, "golden file mismatch (-want +got): %s, run \"make gen/golden-files\", verify and commit the changes", goldenFile)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cmp.Diff for a nicer diff output.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d79d0a3

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d79d0a3

Comment thread agent/unit/graph_test.go Outdated
Comment thread agent/unit/graph_test.go Outdated
Comment on lines +122 to +159
// Check for forward edge
vertices := graph.GetForwardAdjacentVertices(unit1)
require.Len(t, vertices, 2)
// Unit 1 depends on the completion of Unit2
require.Contains(t, vertices, DependencyEdge{
From: unit1,
To: unit2,
Edge: StatusCompleted,
})
// Unit 1 depends on the start of Unit3
require.Contains(t, vertices, DependencyEdge{
From: unit1,
To: unit3,
Edge: StatusStarted,
})

// Check for reverse edges
unit2ReverseEdges := graph.GetReverseAdjacentVertices(unit2)
require.Len(t, unit2ReverseEdges, 1)
// Unit 2 must be completed before Unit 1 can start
require.Contains(t, unit2ReverseEdges, DependencyEdge{
From: unit1,
To: unit2,
Edge: StatusCompleted,
})

unit3ReverseEdges := graph.GetReverseAdjacentVertices(unit3)
require.Len(t, unit3ReverseEdges, 1)
// Unit 3 must be started before Unit 1 can complete
require.Contains(t, unit3ReverseEdges, DependencyEdge{
From: unit1,
To: unit3,
Edge: StatusStarted,
})

// Assert on the DOT representation using golden file
assertDOTGraph(t, graph, "ForwardAndReverseEdges")
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be converted to a table-driven test?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If done, the table driven test could also assert post-execution that there was a table test for each golden file in the testdata. Leaving unused ones behind in a refactor could be confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 813c783. They aren't good table tests. The tests aren't really well suited to be table tests in their current form. But we have accomplished an assertion to check that every test case has a golden file and matches it.

Once I implement proper serialization they will look better as table tests

Comment thread agent/unit/graph_test.go Outdated
}
})

t.Run("StressTest", func(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better off as a benchmark instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, stress tests shouldn't be run as part of the regular test suite, making benchmark a good fit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've turned this into a benchmark in 6ff5772

Comment thread agent/unit/graph_test.go Outdated
Comment on lines +113 to +120
graph := &unit.Graph[Status, *Unit]{}
unit1 := &Unit{Name: "unit1", Status: StatusPending}
unit2 := &Unit{Name: "unit2", Status: StatusPending}
unit3 := &Unit{Name: "unit3", Status: StatusPending}
err := graph.AddEdge(unit1, unit2, StatusCompleted)
require.NoError(t, err)
err = graph.AddEdge(unit1, unit3, StatusStarted)
require.NoError(t, err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to load the graph from the golden DOT instead? A testing pattern I find works well is

  • Load serialized representation
  • Make assertions about data structure
  • Serialize data structure
  • Assert no differences in before versus after ser/deser

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if we skip this? It would work well, but the serialization is a little tricky given that marshalling to DOT doesn't store the metadata in the rest of this wrapper and so unmarshaling from a golden file doesn't provide a complete Graph.

I could write complete serialize() and deserialize() methods for the Graph, or I could try to get rid of the wrapper. Getting rid of the wrapper is premature in my opinion and it would be a better use of my time to implement the rest of this feature before I further optimise tests that already prove the functionality.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip it. Just a suggestion.

Comment thread agent/unit/graph.go Outdated
g.mu.Lock()
defer g.mu.Unlock()

// Initialize the graph if it doesn't exist
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Initialize the graph if it doesn't exist
// Initialize the graph on first use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 813c783

Comment thread agent/unit/graph.go Outdated
g.nextID = 1
}

// Get or create IDs for vertices
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Get or create IDs for vertices

I'd suggest removing all low-value comments. Look especially close at comments that only explain what, not why.

To me the comment and code are saying the exact same thing: getOrCreateVertexID.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 813c783

Comment thread agent/unit/graph_test.go Outdated
)

// Test types for thread safety testing.
// values in production might differ from these test values.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit vague and premature if we don't even know what production values are going to be. Perhaps name the type testStatus to make the intent clear?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c528868

Comment thread agent/unit/graph_test.go Outdated
)

// Unit is a test type for the graph.
// Production types might be more complex.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems a bit vague and doesn't add value, I'd suggest taking a close look at all test comments to remove any superfluous ones.

Consider naming testUnit (like above).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c528868

Comment thread agent/unit/graph_test.go Outdated
// createTestOutputDir creates a directory for test outputs.
// This directory should be listed in the .gitignore file as:
// test-output/
func createTestOutputDir(t *testing.T) string {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be replaced by t.TempDir().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yip. done in c528868

Comment thread agent/unit/graph.go
// The next ID to assign to a vertex.
nextID int64
// Store edge types by "fromID->toID" key. This is used to lookup the edge type for a given edge.
edgeTypes map[string]EdgeType
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: While not typically enforced for unexported fields/types. Go comments are typically written starting with the name, example "edgeTypes are indexed by "fromID->toID" and can be used to ...".

Comment thread agent/unit/graph_test.go Outdated
var UpdateGoldenFiles = flag.Bool("update", false, "update .golden files")

// assertDOTGraph asserts that the graph's DOT representation matches the golden file
func assertDOTGraph(t *testing.T, graph *unit.Graph[Status, *Unit], goldenName string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the assert/require nomenclature, this function requires rather than asserts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to actually assert in c528868

Comment thread agent/unit/graph_test.go Outdated
Comment on lines +122 to +159
// Check for forward edge
vertices := graph.GetForwardAdjacentVertices(unit1)
require.Len(t, vertices, 2)
// Unit 1 depends on the completion of Unit2
require.Contains(t, vertices, DependencyEdge{
From: unit1,
To: unit2,
Edge: StatusCompleted,
})
// Unit 1 depends on the start of Unit3
require.Contains(t, vertices, DependencyEdge{
From: unit1,
To: unit3,
Edge: StatusStarted,
})

// Check for reverse edges
unit2ReverseEdges := graph.GetReverseAdjacentVertices(unit2)
require.Len(t, unit2ReverseEdges, 1)
// Unit 2 must be completed before Unit 1 can start
require.Contains(t, unit2ReverseEdges, DependencyEdge{
From: unit1,
To: unit2,
Edge: StatusCompleted,
})

unit3ReverseEdges := graph.GetReverseAdjacentVertices(unit3)
require.Len(t, unit3ReverseEdges, 1)
// Unit 3 must be started before Unit 1 can complete
require.Contains(t, unit3ReverseEdges, DependencyEdge{
From: unit1,
To: unit3,
Edge: StatusStarted,
})

// Assert on the DOT representation using golden file
assertDOTGraph(t, graph, "ForwardAndReverseEdges")
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If done, the table driven test could also assert post-execution that there was a table test for each golden file in the testdata. Leaving unused ones behind in a refactor could be confusing.

Comment thread agent/unit/graph_test.go Outdated

graph := &unit.Graph[Status, *Unit]{}
var wg sync.WaitGroup
const numGoroutines = 100
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a bit extreme, I think perhaps 10 should suffice.

It's also a good idea to increase the chance of collisions e.g. via:

sync := make(chan struct{})
for range 10; go func() {
	// prepare ...
	<-sync
	// work ...
}
close(sync)

Comment thread agent/unit/graph_test.go Outdated
to := &Unit{Name: fmt.Sprintf("unit-%d-%d", goroutineID, j+1), Status: StatusPending}
err := graph.AddEdge(from, to, StatusCompleted)

mu.Lock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid synchronizing goroutines when you're trying to create collisions. Keep errors local until done, then store them.

(This and above is applicable to the other concurrency tests as well.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in df67702

Copy link
Copy Markdown
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SasSwart I might be missing a bit of context here, but I’d like to better understand where this is heading.

From what I can see, this PR introduces a wrapper around the gonum library, and since gonum already provides a rich graph API, I’m wondering what additional value we’re aiming to add here. Have you written down the next implementation steps somewhere? It’d be great to make sure we’re aligned on the direction before going too deep.

The next step is to add reactivity to the graph, such that changes in the status of one vertex will propagate and allow other vertices to change their own statuses.

That makes sense conceptually, though I’m curious whether the full async trigger mechanism is necessary, especially given that users will likely only have 3–5 scripts running on average. Would a simpler orchestrator that periodically checks the status of running scripts and reacts accordingly be enough to achieve the same result, but with less moving parts?

Also, did we explore any existing task orchestration libraries before starting this implementation? For example:

I’m mostly trying to understand whether this design solves something that current tools can’t, or if we might be able to simplify by leaning on existing solutions.

@mafredri @johnstcn WDYT too?

@SasSwart
Copy link
Copy Markdown
Contributor Author

@mtojek The wrapper is not around gonum specifically. Rather, it is meant as a facade that allows us to swap out the underlying implementation if we need to. I put it behind a wrapper precisely because if we ever want to switch it out with any of the alternatives you've mentioned, that will be easy. This has already shown its worth when I needed to switch from a POC hand rolled implementation to gonum and the facade made that simpler to do.

Right now, second guessing gonum is not the most productive way to ship this quickly. We have something that works for the graphing aspect. The remaining work is to implement:

  • reactivity to changes in the graph (pub/sub)
  • plumbing this into the rest of the ecosystem

I estimate that I can get reactivity implemented in a day, and plumbing in 1.5 days.
Then, we just need to add 1.5 days for code review adjustments.

I have half a day left after my flight today, then I have tomorrow and Friday.
If all goes to plan, we should be able to wrap all components of this work up in time for the code freeze next week.

@SasSwart
Copy link
Copy Markdown
Contributor Author

@mtojek as you've requested internally, I'll take an hour or so away from working on this PR to document the sub-issues and next steps and then I will request approval from @johnstcn or @mafredri (whichever is available first).

@SasSwart SasSwart requested review from johnstcn and mafredri October 24, 2025 13:17
Comment thread agent/unit/graph.go
Comment on lines +49 to +55
if g.gonumGraph == nil {
g.gonumGraph = simple.NewDirectedGraph()
g.vertexToID = make(map[VertexType]int64)
g.idToVertex = make(map[int64]VertexType)
g.edgeTypes = make(map[string]EdgeType)
g.nextID = 1
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is the only place where we do this "lazy" init. Is the reasoning here to allow a zero value of a Graph to be useful?

Alternatives:

  • Add a constructor
  • Extract this to e.g. initGraph() and ensure this gets called on every function call

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do this because @mafredri didn't want a constructor 😅

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm down with not having a constructor, but it just seems like a potential foot-gun if this is the only place where we check if g.gonumGraph is not nil.

Comment thread agent/unit/graph_test.go
operationCount := 0

for operationCount < 50 {
operation := float32(randInt(100)) / 100.0
Copy link
Copy Markdown
Member

@johnstcn johnstcn Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read closely enough before, but since this is the only place you're using randInt here, might I suggest:

Suggested change
operation := float32(randInt(100)) / 100.0
operation := must(cryptorand.Float32())

(Or just use cryptorand.Float64 as it's already defined)

In cryptorand/numbers.go:

// Float32 returns a random number in [0.0,1.0) as a float32.
func Float32() (float32, error) {
	rng, cs := secureRand()
	return rng.Float32(), cs.err
}

This utility function also gets sprinkled a lot throughout the codebase:

func must[T any](v T, err error) T {
  if err != nil {
    panic(err)
  }
}

Comment thread agent/unit/graph_test.go
Comment on lines +228 to +231
const numWriters = 50
const numReaders = 100
const operationsPerWriter = 1000
const operationsPerReader = 2000
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need so many ops/readers/writers here? I think we could probably get similar enough assurances with the benchmark while running with reduced concurrency here.

@SasSwart SasSwart merged commit 6c62136 into main Oct 24, 2025
25 checks passed
@SasSwart SasSwart deleted the jjs/coder-10352 branch October 24, 2025 14:18
@github-actions github-actions Bot locked and limited conversation to collaborators Oct 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants