feat: add a dependency management graph for agents by SasSwart · Pull Request #20208 · coder/coder

SasSwart · 2025-10-08T09:34:17Z

This is the first of N pull requests to allow coder script ordering.
It introduces what is for now dead code, but paves the way for various interfaces that allow coder scripts and other processes to depend on one another via CLI commands and terraform configurations.

The next step is to add reactivity to the graph, such that changes in the status of one vertex will propagate and allow other vertices to change their own statuses.

Concurrency and stress testing yield the following:

CPU Profile:

Mem Profile:

Predictably, lock contention and memory allocation are the largest components of this system under stress. Nothing seems untoward.

mafredri

Nice documentation and starting point. 👍🏻

The DependencyCoordinator seems premature as it's currently just a thin wrapper that exposes the underlying graph. Consider:

Not embedding Graph directly (breaks encapsulation)
Using domain-specific terminology (e.g., AddDependency instead of AddEdge)
Adding validation at this layer (cycles, self-references)

This would make the business logic clearer and easier to validate.

… multiple other units

johnstcn · 2025-10-17T08:56:20Z

+		}
+	})
+
+	t.Run("ConcurrentReadWrite", func(t *testing.T) {


I think we could potentially just keep this concurrency test and drop the others.

I've removed redundant tests here 6ff5772

johnstcn · 2025-10-17T08:58:56Z

+// saveDOTFile saves a DOT representation of the graph to a file
+func saveDOTFile(t *testing.T, graph *unit.Graph[Status, *Unit], filename string) {
+	outputDir := createTestOutputDir(t)
+	dot, err := graph.ToDOT(filename)
+	if err != nil {
+		t.Logf("Failed to generate DOT for %s: %v", filename, err)
+		return
+	}
+
+	dotPath := filepath.Join(outputDir, filename+".dot")
+	err = os.WriteFile(dotPath, []byte(dot), 0o600)
+	require.NoError(t, err, "failed to write DOT file")
+	t.Logf("Saved DOT file: %s", dotPath)
+}


Why not use t.TempDir() here?

My intention was for these files to be visible in a predictable place after the tests run. But I'm not sure why I preferred that to having them in a temp dir for during test debugging. I think your suggestion is a better approach.

Applied in c528868

johnstcn · 2025-10-17T08:59:10Z

+	expected, err := os.ReadFile(goldenFile)
+	require.NoError(t, err, "read golden file, run \"make gen/golden-files\" and commit the changes")
+
+	require.Equal(t, string(expected), dot, "golden file mismatch (-want +got): %s, run \"make gen/golden-files\", verify and commit the changes", goldenFile)


Use cmp.Diff for a nicer diff output.

done in d79d0a3

done in d79d0a3

johnstcn · 2025-10-17T09:02:20Z

+		// Check for forward edge
+		vertices := graph.GetForwardAdjacentVertices(unit1)
+		require.Len(t, vertices, 2)
+		// Unit 1 depends on the completion of Unit2
+		require.Contains(t, vertices, DependencyEdge{
+			From: unit1,
+			To:   unit2,
+			Edge: StatusCompleted,
+		})
+		// Unit 1 depends on the start of Unit3
+		require.Contains(t, vertices, DependencyEdge{
+			From: unit1,
+			To:   unit3,
+			Edge: StatusStarted,
+		})
+
+		// Check for reverse edges
+		unit2ReverseEdges := graph.GetReverseAdjacentVertices(unit2)
+		require.Len(t, unit2ReverseEdges, 1)
+		// Unit 2 must be completed before Unit 1 can start
+		require.Contains(t, unit2ReverseEdges, DependencyEdge{
+			From: unit1,
+			To:   unit2,
+			Edge: StatusCompleted,
+		})
+
+		unit3ReverseEdges := graph.GetReverseAdjacentVertices(unit3)
+		require.Len(t, unit3ReverseEdges, 1)
+		// Unit 3 must be started before Unit 1 can complete
+		require.Contains(t, unit3ReverseEdges, DependencyEdge{
+			From: unit1,
+			To:   unit3,
+			Edge: StatusStarted,
+		})
+
+		// Assert on the DOT representation using golden file
+		assertDOTGraph(t, graph, "ForwardAndReverseEdges")
+	})


Could this be converted to a table-driven test?

If done, the table driven test could also assert post-execution that there was a table test for each golden file in the testdata. Leaving unused ones behind in a refactor could be confusing.

Done in 813c783. They aren't good table tests. The tests aren't really well suited to be table tests in their current form. But we have accomplished an assertion to check that every test case has a golden file and matches it.

Once I implement proper serialization they will look better as table tests

johnstcn · 2025-10-17T09:02:46Z

+		}
+	})
+
+	t.Run("StressTest", func(t *testing.T) {


Would this be better off as a benchmark instead?

I think so too, stress tests shouldn't be run as part of the regular test suite, making benchmark a good fit.

I've turned this into a benchmark in 6ff5772

johnstcn · 2025-10-17T09:05:46Z

+		graph := &unit.Graph[Status, *Unit]{}
+		unit1 := &Unit{Name: "unit1", Status: StatusPending}
+		unit2 := &Unit{Name: "unit2", Status: StatusPending}
+		unit3 := &Unit{Name: "unit3", Status: StatusPending}
+		err := graph.AddEdge(unit1, unit2, StatusCompleted)
+		require.NoError(t, err)
+		err = graph.AddEdge(unit1, unit3, StatusStarted)
+		require.NoError(t, err)


Would it make more sense to load the graph from the golden DOT instead? A testing pattern I find works well is

Load serialized representation

Make assertions about data structure

Serialize data structure

Assert no differences in before versus after ser/deser

Do you mind if we skip this? It would work well, but the serialization is a little tricky given that marshalling to DOT doesn't store the metadata in the rest of this wrapper and so unmarshaling from a golden file doesn't provide a complete Graph.

I could write complete serialize() and deserialize() methods for the Graph, or I could try to get rid of the wrapper. Getting rid of the wrapper is premature in my opinion and it would be a better use of my time to implement the rest of this feature before I further optimise tests that already prove the functionality.

We can skip it. Just a suggestion.

mafredri · 2025-10-17T10:59:29Z

+	g.mu.Lock()
+	defer g.mu.Unlock()
+
+	// Initialize the graph if it doesn't exist


Suggested change

// Initialize the graph if it doesn't exist

// Initialize the graph on first use.

done in 813c783

mafredri · 2025-10-17T11:00:58Z

+		g.nextID = 1
+	}
+
+	// Get or create IDs for vertices


Suggested change

// Get or create IDs for vertices

I'd suggest removing all low-value comments. Look especially close at comments that only explain what, not why.

To me the comment and code are saying the exact same thing: getOrCreateVertexID.

Done in 813c783

mafredri · 2025-10-17T11:04:19Z

+)
+
+// Test types for thread safety testing.
+// values in production might differ from these test values.


This seems a bit vague and premature if we don't even know what production values are going to be. Perhaps name the type testStatus to make the intent clear?

done in c528868

mafredri · 2025-10-17T11:05:49Z

+)
+
+// Unit is a test type for the graph.
+// Production types might be more complex.


This also seems a bit vague and doesn't add value, I'd suggest taking a close look at all test comments to remove any superfluous ones.

Consider naming testUnit (like above).

done in c528868

mafredri · 2025-10-17T11:08:37Z

+// createTestOutputDir creates a directory for test outputs.
+// This directory should be listed in the .gitignore file as:
+// test-output/
+func createTestOutputDir(t *testing.T) string {


Could be replaced by t.TempDir().

yip. done in c528868

mafredri · 2025-10-17T11:14:04Z

+	// The next ID to assign to a vertex.
+	nextID int64
+	// Store edge types by "fromID->toID" key. This is used to lookup the edge type for a given edge.
+	edgeTypes map[string]EdgeType


Suggestion: While not typically enforced for unexported fields/types. Go comments are typically written starting with the name, example "edgeTypes are indexed by "fromID->toID" and can be used to ...".

mafredri · 2025-10-17T11:15:16Z

+var UpdateGoldenFiles = flag.Bool("update", false, "update .golden files")
+
+// assertDOTGraph asserts that the graph's DOT representation matches the golden file
+func assertDOTGraph(t *testing.T, graph *unit.Graph[Status, *Unit], goldenName string) {


Given the assert/require nomenclature, this function requires rather than asserts.

updated to actually assert in c528868

mafredri · 2025-10-17T11:18:43Z

+		// Check for forward edge
+		vertices := graph.GetForwardAdjacentVertices(unit1)
+		require.Len(t, vertices, 2)
+		// Unit 1 depends on the completion of Unit2
+		require.Contains(t, vertices, DependencyEdge{
+			From: unit1,
+			To:   unit2,
+			Edge: StatusCompleted,
+		})
+		// Unit 1 depends on the start of Unit3
+		require.Contains(t, vertices, DependencyEdge{
+			From: unit1,
+			To:   unit3,
+			Edge: StatusStarted,
+		})
+
+		// Check for reverse edges
+		unit2ReverseEdges := graph.GetReverseAdjacentVertices(unit2)
+		require.Len(t, unit2ReverseEdges, 1)
+		// Unit 2 must be completed before Unit 1 can start
+		require.Contains(t, unit2ReverseEdges, DependencyEdge{
+			From: unit1,
+			To:   unit2,
+			Edge: StatusCompleted,
+		})
+
+		unit3ReverseEdges := graph.GetReverseAdjacentVertices(unit3)
+		require.Len(t, unit3ReverseEdges, 1)
+		// Unit 3 must be started before Unit 1 can complete
+		require.Contains(t, unit3ReverseEdges, DependencyEdge{
+			From: unit1,
+			To:   unit3,
+			Edge: StatusStarted,
+		})
+
+		// Assert on the DOT representation using golden file
+		assertDOTGraph(t, graph, "ForwardAndReverseEdges")
+	})


If done, the table driven test could also assert post-execution that there was a table test for each golden file in the testdata. Leaving unused ones behind in a refactor could be confusing.

mafredri · 2025-10-17T11:21:04Z

+
+		graph := &unit.Graph[Status, *Unit]{}
+		var wg sync.WaitGroup
+		const numGoroutines = 100


These are a bit extreme, I think perhaps 10 should suffice.

It's also a good idea to increase the chance of collisions e.g. via:

sync := make(chan struct{}) for range 10; go func() { // prepare ... <-sync // work ... } close(sync)

mafredri · 2025-10-17T11:22:23Z

+					to := &Unit{Name: fmt.Sprintf("unit-%d-%d", goroutineID, j+1), Status: StatusPending}
+					err := graph.AddEdge(from, to, StatusCompleted)
+
+					mu.Lock()


Avoid synchronizing goroutines when you're trying to create collisions. Keep errors local until done, then store them.

(This and above is applicable to the other concurrency tests as well.)

done in df67702

mtojek

@SasSwart I might be missing a bit of context here, but I’d like to better understand where this is heading.

From what I can see, this PR introduces a wrapper around the gonum library, and since gonum already provides a rich graph API, I’m wondering what additional value we’re aiming to add here. Have you written down the next implementation steps somewhere? It’d be great to make sure we’re aligned on the direction before going too deep.

The next step is to add reactivity to the graph, such that changes in the status of one vertex will propagate and allow other vertices to change their own statuses.

That makes sense conceptually, though I’m curious whether the full async trigger mechanism is necessary, especially given that users will likely only have 3–5 scripts running on average. Would a simpler orchestrator that periodically checks the status of running scripts and reacts accordingly be enough to achieve the same result, but with less moving parts?

Also, did we explore any existing task orchestration libraries before starting this implementation? For example:

goyek/goyek
Roemer/gotaskr
go-task/task
(though I’m not sure if it offers a usable library API)

I’m mostly trying to understand whether this design solves something that current tools can’t, or if we might be able to simplify by leaning on existing solutions.

@mafredri @johnstcn WDYT too?

SasSwart · 2025-10-22T09:23:08Z

@mtojek The wrapper is not around gonum specifically. Rather, it is meant as a facade that allows us to swap out the underlying implementation if we need to. I put it behind a wrapper precisely because if we ever want to switch it out with any of the alternatives you've mentioned, that will be easy. This has already shown its worth when I needed to switch from a POC hand rolled implementation to gonum and the facade made that simpler to do.

Right now, second guessing gonum is not the most productive way to ship this quickly. We have something that works for the graphing aspect. The remaining work is to implement:

reactivity to changes in the graph (pub/sub)
plumbing this into the rest of the ecosystem

I estimate that I can get reactivity implemented in a day, and plumbing in 1.5 days.
Then, we just need to add 1.5 days for code review adjustments.

I have half a day left after my flight today, then I have tomorrow and Friday.
If all goes to plan, we should be able to wrap all components of this work up in time for the code freeze next week.

SasSwart · 2025-10-22T09:32:18Z

@mtojek as you've requested internally, I'll take an hour or so away from working on this PR to document the sub-issues and next steps and then I will request approval from @johnstcn or @mafredri (whichever is available first).

johnstcn · 2025-10-24T13:22:17Z

+	if g.gonumGraph == nil {
+		g.gonumGraph = simple.NewDirectedGraph()
+		g.vertexToID = make(map[VertexType]int64)
+		g.idToVertex = make(map[int64]VertexType)
+		g.edgeTypes = make(map[string]EdgeType)
+		g.nextID = 1
+	}


It looks like this is the only place where we do this "lazy" init. Is the reasoning here to allow a zero value of a Graph to be useful?

Alternatives:

Add a constructor

Extract this to e.g. initGraph() and ensure this gets called on every function call

we do this because @mafredri didn't want a constructor 😅

I'm down with not having a constructor, but it just seems like a potential foot-gun if this is the only place where we check if g.gonumGraph is not nil.

johnstcn · 2025-10-24T13:28:37Z

+				operationCount := 0
+
+				for operationCount < 50 {
+					operation := float32(randInt(100)) / 100.0


I didn't read closely enough before, but since this is the only place you're using randInt here, might I suggest:

Suggested change

operation := float32(randInt(100)) / 100.0

operation := must(cryptorand.Float32())

(Or just use cryptorand.Float64 as it's already defined)

In cryptorand/numbers.go:

// Float32 returns a random number in [0.0,1.0) as a float32. func Float32() (float32, error) { rng, cs := secureRand() return rng.Float32(), cs.err }

This utility function also gets sprinkled a lot throughout the codebase:

func must[T any](v T, err error) T { if err != nil { panic(err) } }

johnstcn · 2025-10-24T13:42:04Z

+		const numWriters = 50
+		const numReaders = 100
+		const operationsPerWriter = 1000
+		const operationsPerReader = 2000


Do we really need so many ops/readers/writers here? I think we could probably get similar enough assurances with the benchmark while running with reduced concurrency here.

add in memory lock coordinator for script ordering use

8660ced

github-actions Bot assigned SasSwart Oct 8, 2025

SasSwart changed the title ~~add in memory lock coordinator for script ordering use~~ feat: allow orchestration and ordering of coder scripts Oct 8, 2025

SasSwart added 4 commits October 8, 2025 13:52

rename sync to unit

804bd54

Document the use of agent units

293d310

elaborate documentation

75def32

Refactor agent/unit for consistency

f9aa288

github-actions Bot added the stale This issue is like stale bread. label Oct 16, 2025

SasSwart removed the stale This issue is like stale bread. label Oct 16, 2025

Implement a bidirectional graph for agent unit dependency management

9dbfbfb

SasSwart changed the title ~~feat: allow orchestration and ordering of coder scripts~~ feat: add a dependency management graph for agents Oct 16, 2025

SasSwart marked this pull request as ready for review October 16, 2025 08:45

SasSwart requested review from johnstcn and mafredri October 16, 2025 08:47

Update names to appease the linter

542484c

johnstcn reviewed Oct 16, 2025

View reviewed changes

Comment thread agent/unit/graph.go

Comment thread agent/unit/unit_test.go Outdated

johnstcn reviewed Oct 16, 2025

View reviewed changes

Comment thread agent/unit/unit_test.go Outdated

Document major components of the new agent dependency management system

2eab36b

mafredri reviewed Oct 16, 2025

View reviewed changes

SasSwart added 11 commits October 16, 2025 11:39

simplify graph init

6fd8725

add cycle checking and tests

20225b2

switch to gonum

610babb

add DOT based golden file testing

f3c3720

generate golden files

f1c0a90

Add test case to ensure that a unit can depend on the same status for…

8c31399

… multiple other units

update docs

5d2fb9e

turn embedded struct into attribute to preserve encapsulation

775ea06

remove defunct file

91d3235

remove defunct file

5ecedfe

add tests for concurrency

b964ca8

make -B fmt lint

1a57847

johnstcn reviewed Oct 17, 2025

View reviewed changes

add documentation and improve concurrency test

becc4e7

mafredri reviewed Oct 17, 2025

View reviewed changes

SasSwart mentioned this pull request Oct 20, 2025

feat: allow execution ordering for coder_script #10352

Closed

mtojek reviewed Oct 21, 2025

View reviewed changes

SasSwart added 9 commits October 23, 2025 11:34

simplify the randInt test function

ce51bad

convert TestGraph to table test and improve documentation

813c783

simplify tests

c528868

use cmp.Diff to provide more useful test outputs

d79d0a3

make fmt

e6751ce

increase the chances of collisions on graph concurrency tests

291cea8

remove redundant graph tests

6ff5772

fix make gen nilpointer

49be2b0

increase the chances of a race condition in agent graph tests

df67702

SasSwart requested review from johnstcn and mafredri October 24, 2025 13:17

strip carriage returns from graph goldens files

722f652

johnstcn approved these changes Oct 24, 2025

View reviewed changes

SasSwart mentioned this pull request Oct 24, 2025

add a graph to workspace agents that tracks dependencies between scripts coder/internal#1093

Closed

SasSwart merged commit 6c62136 into main Oct 24, 2025
25 checks passed

SasSwart deleted the jjs/coder-10352 branch October 24, 2025 14:18

github-actions Bot locked and limited conversation to collaborators Oct 24, 2025

	// Initialize the graph if it doesn't exist
	// Initialize the graph on first use.

	operation := float32(randInt(100)) / 100.0
	operation := must(cryptorand.Float32())

Conversation

SasSwart commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

SasSwart commented Oct 8, 2025 •

edited

Loading

mtojek left a comment •

edited

Loading

johnstcn Oct 24, 2025 •

edited

Loading