Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions .coder/plans/PLAN-00f509fa-4804-4f18-b262-d1a164d7e269.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# RFC: Register Workspace Agent Scripts as Sync Units

## Problem

The `agent/unit.Manager` (sync unit system) and `agent/agentscripts.Runner`
(script runner) are currently independent. Scripts run by the agent are not
visible in `coder exp sync list`, and there is no way to observe script
lifecycle through the dependency coordination system.

## Goal

Every script managed by the script runner should be registered as a sync
unit so that:

1. All scripts are immediately visible via `coder exp sync list` as soon as
the script runner initializes (bulk registration, all at once).
2. As the script runner kicks off each script, its unit status transitions
to `started`.
3. When a script finishes, its unit status transitions to `completed`.

## Design

### Unit name

Each script's `DisplayName` is used as the unit name (`unit.ID`). This is
user-friendly but not guaranteed unique. Uniqueness will be addressed in a
follow-up.

### Architecture: shared `unit.Manager`

Currently, `unit.NewManager()` is created inside `agentsocket.NewServer()`,
hidden from the rest of the agent. This PR lifts the `unit.Manager` out:

```
agent.init()
├─ unitManager := unit.NewManager()
├─ scriptRunner := agentscripts.New(opts) // receives unitManager
└─ socketServer := agentsocket.NewServer(logger, unitManager, ...) // receives same instance
```

The agent becomes the composition root that owns the manager and passes it
to both consumers.

### Changes by layer

#### 1. `agent/agentsocket/server.go`

- `NewServer` takes `*unit.Manager` as an explicit required parameter
(not an Option, since Options are shared with the client).
- The service no longer creates its own manager; it uses the one provided.

#### 2. `agent/agentscripts/agentscripts.go`

- New optional field `UnitManager *unit.Manager` in `Options`.
- **`Init()`**: after storing scripts, registers each script as a unit
using `script.DisplayName` as the `unit.ID`. Registration errors for
empty display names are logged and skipped; duplicate registration
errors are tolerated (idempotent).
- **`run()`**: at the start of execution (before `cmd.Start()`), updates
the unit status to `StatusStarted`. After execution completes (in the
deferred completion block), updates the unit status to `StatusComplete`.
- All unit manager calls are guarded by a nil check, so the script runner
works without a unit manager (backwards compatible, tests unaffected).

#### 3. `agent/agent.go`

- `init()` creates `unit.NewManager()` before the script runner and socket
server.
- Passes the manager to `agentscripts.New()` via `Options.UnitManager`.
- Passes the manager to `agentsocket.NewServer()` as a required argument.
- `initSocketServer()` updated to accept and forward the manager.

### Status lifecycle per script

```
Init() → Register(displayName) → pending
run() before start → UpdateStatus(_, StatusStarted) → started
run() after complete → UpdateStatus(_, StatusComplete) → completed
```

For cron scripts that run multiple times, the cycle repeats:
`completed → started → completed → ...`

### What this does NOT change

- No proto/DRPC changes; the existing `SyncList` RPC from PR #26443
surfaces the units as-is.
- No new CLI commands.
- No uniqueness enforcement on `DisplayName` (follow-up).
- No dependency edges between scripts (follow-up).
- Stop scripts and cron scripts follow the same pattern.

## Files changed

| File | Change |
|------|--------|
| `agent/unit/manager.go` | No changes needed |
| `agent/agentsocket/server.go` | `NewServer` takes `*unit.Manager` param |
| `agent/agentsocket/service.go` | No changes needed |
| `agent/agentscripts/agentscripts.go` | Add `UnitManager` to Options; register in Init; update status in run |
| `agent/agent.go` | Create shared manager; wire to both subsystems |
| Tests | Update `NewServer` call sites; add script-unit integration test |
5 changes: 5 additions & 0 deletions agent/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ import (
"github.com/coder/coder/v2/agent/proto"
"github.com/coder/coder/v2/agent/proto/resourcesmonitor"
"github.com/coder/coder/v2/agent/reconnectingpty"
"github.com/coder/coder/v2/agent/unit"
"github.com/coder/coder/v2/agent/usershell"
"github.com/coder/coder/v2/agent/x/agentdesktop"
"github.com/coder/coder/v2/agent/x/agentmcp"
Expand Down Expand Up @@ -367,6 +368,7 @@ type agent struct {
socketServerEnabled bool
socketPath string
socketServer *agentsocket.Server
unitManager *unit.Manager

derpTLSConfig *tls.Config
}
Expand Down Expand Up @@ -449,6 +451,7 @@ func (a *agent) init() {
panic(err)
}
a.sshServer = sshSrv
a.unitManager = unit.NewManager()
a.scriptRunner = agentscripts.New(agentscripts.Options{
LogDir: a.logDir,
DataDirBase: a.scriptDataDir,
Expand All @@ -458,6 +461,7 @@ func (a *agent) init() {
GetScriptLogger: func(logSourceID uuid.UUID) agentscripts.ScriptLogger {
return a.logSender.GetScriptLogger(logSourceID)
},
UnitManager: a.unitManager,
})
// Register runner metrics. If the prom registry is nil, the metrics
// will not report anywhere.
Expand Down Expand Up @@ -553,6 +557,7 @@ func (a *agent) initSocketServer() {

server, err := agentsocket.NewServer(
a.logger.Named("socket"),
a.unitManager,
agentsocket.WithPath(a.socketPath),
)
if err != nil {
Expand Down
62 changes: 62 additions & 0 deletions agent/agentscripts/agentscripts.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ import (
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/agent/agentssh"
"github.com/coder/coder/v2/agent/proto"
"github.com/coder/coder/v2/agent/unit"
"github.com/coder/coder/v2/coderd/database/dbtime"
"github.com/coder/coder/v2/codersdk"
"github.com/coder/coder/v2/codersdk/agentsdk"
Expand Down Expand Up @@ -56,6 +57,9 @@ type Options struct {
SSHServer *agentssh.Server
Filesystem afero.Fs
GetScriptLogger func(logSourceID uuid.UUID) ScriptLogger
// UnitManager, when set, causes the runner to register each script as a
// sync unit during Init and to update unit status as scripts execute.
UnitManager *unit.Manager
}

// New creates a runner for the provided scripts.
Expand Down Expand Up @@ -139,6 +143,29 @@ func (r *Runner) Init(scripts []codersdk.WorkspaceAgentScript, scriptCompleted S
}
r.Logger.Info(r.cronCtx, "initializing agent scripts", slog.F("script_count", len(scripts)), slog.F("log_dir", r.LogDir))

// Register all scripts as sync units so they are immediately visible
// via "coder exp sync list". Registration happens before any script
// executes, giving observers a complete picture of pending work.
if r.UnitManager != nil {
for _, script := range scripts {
unitName := scriptUnitName(script)
if unitName == "" {
r.Logger.Warn(r.cronCtx, "skipping unit registration for script with no usable name",
slog.F("script_id", script.ID))
continue
}
if err := r.UnitManager.Register(unit.ID(unitName)); err != nil {
if errors.Is(err, unit.ErrUnitAlreadyRegistered) {
r.Logger.Warn(r.cronCtx, "duplicate unit name during script registration",
slog.F("unit_name", unitName),
slog.F("script_id", script.ID))
continue
}
return xerrors.Errorf("register script unit %q: %w", unitName, err)
}
}
}

err := r.Filesystem.MkdirAll(r.ScriptBinDir(), 0o700)
if err != nil {
return xerrors.Errorf("create script bin dir: %w", err)
Expand Down Expand Up @@ -335,6 +362,9 @@ func (r *Runner) run(ctx context.Context, script codersdk.WorkspaceAgentScript,
cmd.Stdout = io.MultiWriter(fileWriter, infoW)
cmd.Stderr = io.MultiWriter(fileWriter, errW)

// Mark the script's sync unit as started before execution begins.
r.setUnitStatus(ctx, script, unit.StatusStarted)

start := dbtime.Now()
defer func() {
end := dbtime.Now()
Expand Down Expand Up @@ -404,6 +434,10 @@ func (r *Runner) run(ctx context.Context, script codersdk.WorkspaceAgentScript,
if err != nil {
logger.Warn(ctx, "reporting script completed: track command goroutine", slog.Error(err))
}

// Mark the script's sync unit as completed after execution finishes,
// regardless of whether the script succeeded or failed.
r.setUnitStatus(ctx, script, unit.StatusComplete)
}()

err = cmd.Start()
Expand Down Expand Up @@ -488,3 +522,31 @@ func (r *Runner) isClosed() bool {
return false
}
}

// setUnitStatus updates the sync unit status for a script.
// It is a no-op when no UnitManager is configured or when the script
// has no usable unit name.
func (r *Runner) setUnitStatus(ctx context.Context, script codersdk.WorkspaceAgentScript, status unit.Status) {
unitName := scriptUnitName(script)
if r.UnitManager == nil || unitName == "" {
return
}
if err := r.UnitManager.UpdateStatus(unit.ID(unitName), status); err != nil {
// Log but do not fail; unit tracking is best-effort relative
// to script execution.
r.Logger.Warn(ctx, "failed to update unit status",
slog.F("unit_name", unitName),
slog.F("status", string(status)),
slog.Error(err))
}
}

// scriptUnitName returns the unit name for a script. It prefers
// ResourceAddress (the Terraform resource address, unique across
// modules) and falls back to DisplayName.
func scriptUnitName(script codersdk.WorkspaceAgentScript) string {
if script.ResourceAddress != "" {
return script.ResourceAddress
}
return script.DisplayName
}
124 changes: 124 additions & 0 deletions agent/agentscripts/agentscripts_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"github.com/coder/coder/v2/agent/agentscripts"
"github.com/coder/coder/v2/agent/agentssh"
"github.com/coder/coder/v2/agent/agenttest"
"github.com/coder/coder/v2/agent/unit"
"github.com/coder/coder/v2/codersdk"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/coder/v2/testutil"
Expand Down Expand Up @@ -355,3 +356,126 @@ func (*fakeScriptLogger) Flush(context.Context) error {
func newFakeScriptLogger() *fakeScriptLogger {
return &fakeScriptLogger{make(chan agentsdk.Log, 100)}
}

func setupWithUnitManager(t *testing.T, getScriptLogger func(uuid.UUID) agentscripts.ScriptLogger) (*agentscripts.Runner, *unit.Manager) {
t.Helper()
logger := testutil.Logger(t)
fs := afero.NewOsFs()
s, err := agentssh.NewServer(context.Background(), logger, prometheus.NewRegistry(), fs, agentexec.DefaultExecer, nil)
require.NoError(t, err)
t.Cleanup(func() {
_ = s.Close()
})
m := unit.NewManager()
r := agentscripts.New(agentscripts.Options{
LogDir: t.TempDir(),
DataDirBase: t.TempDir(),
Logger: logger,
SSHServer: s,
Filesystem: fs,
GetScriptLogger: getScriptLogger,
UnitManager: m,
})
return r, m
}

func TestScriptUnitsRegistered(t *testing.T) {
t.Parallel()

runner, mgr := setupWithUnitManager(t, func(_ uuid.UUID) agentscripts.ScriptLogger {
return noopScriptLogger{}
})
defer runner.Close()

scripts := []codersdk.WorkspaceAgentScript{
{
ID: uuid.New(),
LogSourceID: uuid.New(),
DisplayName: "Install Tools",
ResourceAddress: "coder_script.install_tools",
Script: "echo install",
RunOnStart: true,
},
{
ID: uuid.New(),
LogSourceID: uuid.New(),
DisplayName: "Configure Environment",
ResourceAddress: "module.dev.coder_script.configure",
Script: "echo configure",
RunOnStart: true,
},
{
ID: uuid.New(),
LogSourceID: uuid.New(),
DisplayName: "", // No display name or resource address; should be skipped.
Script: "echo nameless",
RunOnStart: true,
},
{
ID: uuid.New(),
LogSourceID: uuid.New(),
DisplayName: "Fallback Only", // No resource address; falls back to display name.
Script: "echo fallback",
RunOnStart: true,
},
}

aAPI := agenttest.NewFakeAgentAPI(t, testutil.Logger(t), nil, nil)
err := runner.Init(scripts, aAPI.ScriptCompleted)
require.NoError(t, err)

// Three scripts should be registered (two with resource address, one with display name only).
units := mgr.ListUnits()
require.Len(t, units, 3)

unitsByName := make(map[unit.ID]unit.Unit)
for _, u := range units {
unitsByName[u.ID()] = u
}

// ResourceAddress takes precedence over DisplayName.
for _, name := range []string{"coder_script.install_tools", "module.dev.coder_script.configure", "Fallback Only"} {
u, ok := unitsByName[unit.ID(name)]
require.True(t, ok, "expected unit %q to be registered", name)
require.Equal(t, unit.StatusPending, u.Status(), "unit %q should be pending", name)
}
}

func TestScriptUnitsLifecycle(t *testing.T) {
t.Parallel()

runner, mgr := setupWithUnitManager(t, func(_ uuid.UUID) agentscripts.ScriptLogger {
return noopScriptLogger{}
})
defer runner.Close()

scriptName := "coder_script.my_script"
scripts := []codersdk.WorkspaceAgentScript{
{
ID: uuid.New(),
LogSourceID: uuid.New(),
DisplayName: "My Script",
ResourceAddress: scriptName,
Script: "echo lifecycle",
RunOnStart: true,
},
}

aAPI := agenttest.NewFakeAgentAPI(t, testutil.Logger(t), nil, nil)
err := runner.Init(scripts, aAPI.ScriptCompleted)
require.NoError(t, err)

// Before execution: pending.
u, err := mgr.Unit(unit.ID(scriptName))
require.NoError(t, err)
require.Equal(t, unit.StatusPending, u.Status())

// Execute the script.
err = runner.Execute(context.Background(), agentscripts.ExecuteStartScripts)
require.NoError(t, err)

// After execution: completed.
u, err = mgr.Unit(unit.ID(scriptName))
require.NoError(t, err)
require.Equal(t, unit.StatusComplete, u.Status())
}
Loading
Loading