Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
c7774fe
improvement(helm): production-ready chart with security, ESO, and doc…
waleedlatif1 May 12, 2026
80bfaf8
fix(helm): correct resource names in README (sim-sim-* → sim-*)
waleedlatif1 May 12, 2026
7644ba4
improvement(helm): split app/realtime env into Secret-bound + inline …
waleedlatif1 May 12, 2026
c6a4478
fix(helm): address PR review — cronjob validation, ESO apiVersion, se…
waleedlatif1 May 12, 2026
794e418
fix(helm): require critical secrets to be mapped when ESO is enabled
waleedlatif1 May 12, 2026
3340872
fix(helm): auto-enable PDB when HPA minReplicas > 1
waleedlatif1 May 12, 2026
c0bf587
fix(helm): prevent realtime envDefaults from masking app.env Secret v…
waleedlatif1 May 12, 2026
d1a5394
feat(helm): add Claude Skill for chart deployment
waleedlatif1 May 12, 2026
2fa9259
docs(helm): add CRON_SECRET to TL;DR, dry-run, and example install he…
waleedlatif1 May 12, 2026
9519a8f
fix(helm): require INTERNAL_API_SECRET in inline secret mode
waleedlatif1 May 12, 2026
2919ad6
docs(helm): surface INTERNAL_API_SECRET upgrade requirement in NOTES.txt
waleedlatif1 May 12, 2026
9a6b68b
fix(helm): NetworkPolicy egress to OTEL collector + external-db examp…
waleedlatif1 May 12, 2026
5138b09
fix(helm): NOTES.txt no longer prints false secret warning for ESO users
waleedlatif1 May 12, 2026
e05a8af
fix(helm): existingSecret mode no longer drops app.env / realtime.env…
waleedlatif1 May 12, 2026
bc50116
fix(helm): correct realtime env overlay + filter chart-computed keys …
waleedlatif1 May 12, 2026
17632aa
fix(helm): skip envDefaults in existingSecret mode + document egress …
waleedlatif1 May 12, 2026
468dad1
fix(helm): copy-pasteable install commands in copilot + ESO examples
waleedlatif1 May 12, 2026
33a45a0
polish(helm): configurable NetworkPolicy ingress peers + clearer API_…
waleedlatif1 May 12, 2026
716a677
test(helm): add helm-unittest suites + CI workflow + ci values matrix
waleedlatif1 May 12, 2026
4dc7966
test(helm): add helm test hook + kind apiserver dry-run in CI
waleedlatif1 May 12, 2026
0de97f4
chore(helm): remove pre-1.0.0 upgrade fluff + tighten .helmignore
waleedlatif1 May 12, 2026
34b1b6e
chore(helm): drop CI workflow + ci/ fixtures + CONTRIBUTING.md
waleedlatif1 May 12, 2026
51cbb9e
feat(helm): pod rollout on Secret change + topologySpreadConstraints
waleedlatif1 May 12, 2026
570e5f0
fix(helm): drop empty-string shadowing in app/realtime env merge
waleedlatif1 May 12, 2026
a4837de
fix(helm): make topologySpreadConstraints per-component to match docs…
waleedlatif1 May 12, 2026
b9ceff9
fix(helm): allow cron pods through app NetworkPolicy
waleedlatif1 May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions helm/sim/.claude/skills/sim-helm/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
name: sim-helm
description: Install, upgrade, and operate the Sim Helm chart on Kubernetes. Covers install path selection (inline / existingSecret / External Secrets Operator), required secret generation, the values.yaml mental model (env vs envDefaults vs Secret), and common failure triage. Invoke when a user asks about deploying Sim to a cluster, authoring a Sim values.yaml, debugging a Sim pod that won't start, upgrading a Sim release, or wiring Sim into a secret manager.
license: Apache-2.0
---

# Sim Helm Chart — Operations Skill

This skill helps an agent deploy and operate the **Sim** Helm chart at `helm/sim/` in the [simstudioai/sim](https://github.com/simstudioai/sim) repository. Use it when the user is installing, upgrading, troubleshooting, or authoring values for the Sim chart.

The skill is **diagnostic-first**: capture context, classify the situation, load only the references that apply, then act. Do not dump the README at the user. Do not invent values that are not in their current state.

---

## Workflow — follow in order

### 1. Capture context

Before recommending anything, ask (or infer from the conversation) all of these. **Never skip this step.** A wrong assumption here corrupts every downstream step.

| Question | Why it matters |
|---|---|
| Cluster: EKS / GKE / AKS / OpenShift / kind / other? | Storage class, ingress class, identity provider differ |
| Secret strategy: inline `--set`, pre-existing K8s Secret, or External Secrets Operator (ESO)? | The chart has three distinct code paths |
| Postgres: chart-bundled, or external (RDS / Cloud SQL / Azure DB)? | Different value blocks (`postgresql.*` vs `externalDatabase.*`) |
| Public-facing? Ingress class? TLS? | `ingress.enabled`, `ingress.className`, cert-manager wiring |
| HA? (target replicas) | Drives `autoscaling.enabled`, `app.replicaCount`, PDB activation |
| Existing values.yaml the user is editing? | Always read it before proposing a diff — never write blind |

If the user has a `values.yaml`, read it. If they don't, ask before writing one.

### 2. Diagnose

Map the user's request to one of these categories and load the matching reference(s):

| Situation | Reference |
|---|---|
| User wants to install for the first time | `references/install-paths.md` then `references/secrets.md` |
| User needs to generate the required secrets | `references/secrets.md` |
| User asks "what does this value do" / wants to author values.yaml | `references/values-model.md` |
| Pod won't start, error message, `CrashLoopBackOff`, image pull error, ingress not routing | `references/troubleshooting.md` |
| User asks about ESO / Vault / AWS Secrets Manager / Azure Key Vault / GCP Secret Manager | `references/install-paths.md` (ESO section) |
| User asks "is X production-ready" / autoscaling / network policy / security context | Read the README's "Production checklist" section directly — no separate reference |

Load **only** what the situation requires. Loading every reference burns tokens and produces vague answers.

### 3. Propose

When proposing values changes:

- Show the **minimal diff** against the user's current values.yaml. Don't rewrite the file.
- Name the **risk** (e.g., "this puts the secret in `helm get values` output — fine for dev, not for prod").
- Name the **rollback** (e.g., "if this breaks, `helm rollback sim 1` reverts").
- Cite the canonical source (`helm/sim/values.yaml` line numbers, README section, or this skill's reference file).

### 4. Validate before applying

Always run these before telling the user to `helm install` / `helm upgrade`:

```bash
# Schema + value validation
helm lint helm/sim --values <user-values>.yaml

# Render full manifest set to catch template errors
helm template sim helm/sim --values <user-values>.yaml > /tmp/render.yaml

# For upgrades, render against the live release first
helm upgrade --dry-run sim helm/sim --values <user-values>.yaml
```

If lint or template fails, fix the values — do not work around chart validation. The chart's `fail` statements exist to catch misconfigurations that would otherwise surface as `CrashLoopBackOff` at runtime.

### 5. Deliver

Every recommendation should include:

- The exact command(s) to run
- A one-line summary of what will change
- The success signal (e.g., "`kubectl rollout status deploy/sim-app` returns Ready")
- The rollback command if something breaks

---

## Quick reference — the three secret modes

| Mode | When | Code path |
|---|---|---|
| **Inline (`--set`)** | Dev / kind / dry-run only. Values leak into `helm get values`. | `app.env.<KEY>: "..."` |
| **Pre-existing Secret** | GitOps with Sealed Secrets / SOPS, or hand-managed Secrets. Chart references a Secret you create. | `app.secrets.existingSecret.enabled: true` + `.name` |
| **External Secrets Operator (recommended for prod)** | Vault, AWS SM, Azure KV, GCP SM. Chart renders an `ExternalSecret` that ESO syncs. | `externalSecrets.enabled: true` + `secretStoreRef` + `remoteRefs.app.<KEY>` |

These modes are **mutually exclusive** for the app Secret. ESO takes precedence over inline. `existingSecret` takes precedence over inline. The chart **fails template rendering** when ESO is enabled and a required key (`BETTER_AUTH_SECRET`, `ENCRYPTION_KEY`, `INTERNAL_API_SECRET`, plus `CRON_SECRET` when cronjobs are enabled) is neither in `app.env` nor mapped in `remoteRefs.app` — see `references/install-paths.md`.

---

## Quick reference — the four required secrets

| Key | Generate with | Notes |
|---|---|---|
| `BETTER_AUTH_SECRET` | `openssl rand -hex 32` | Session signing |
| `ENCRYPTION_KEY` | `openssl rand -hex 32` | App-level encryption |
| `INTERNAL_API_SECRET` | `openssl rand -hex 32` | Service-to-service auth (app ↔ realtime) |
| `CRON_SECRET` | `openssl rand -hex 32` | Required iff `cronjobs.enabled=true` (default true) |

Optional but commonly needed:

| Key | Generate with | Notes |
|---|---|---|
| `API_ENCRYPTION_KEY` | `openssl rand -hex 32` | Must be **exactly 64 hex chars**. Required to encrypt user API keys at rest. |
| `postgresql.auth.password` | `openssl rand -base64 24 \| tr -d '/+='` | Only if using chart-bundled Postgres. Must match `^[a-zA-Z0-9._-]+$` for DATABASE_URL compatibility. |

See `references/secrets.md` for storage patterns and rotation guidance.

---

## Rules of engagement

These are non-negotiable. Violating any of these has burned users in the past.

1. **Never recommend `--set` for production secrets.** They land in `helm get values` and Helm release history. Direct users to `existingSecret` or ESO.
2. **Never set `image.tag: latest`.** The chart defaults to `Chart.AppVersion` for a reason — reproducible rollouts. If the user pinned `latest`, push back.
3. **Never edit chart templates to work around a `fail` statement.** The validation exists because a misconfiguration would otherwise surface as a runtime CrashLoopBackOff with cryptic env errors.
4. **Never drop `automountServiceAccountToken: false`** unless the workload genuinely needs in-cluster API access (Sim's app/realtime/postgres pods do not).
5. **Never `kubectl delete sts` without `--cascade=orphan`** on a live Postgres. It deletes the pods and PVCs.
6. **Never tell a user "the chart works on your cluster" without `helm lint` + `helm template` against their values.** Static reading is not validation.
7. **Always confirm before `helm uninstall` in a shared namespace.** PVCs survive but other namespace resources may not.

---

## When the user is stuck and you can't diagnose

Get logs from every component in parallel. This single block answers ~80% of "it's broken" questions:

```bash
kubectl --namespace <ns> get pods,events --sort-by='.lastTimestamp'
kubectl --namespace <ns> logs deploy/sim-app --tail=200
kubectl --namespace <ns> logs deploy/sim-realtime --tail=200
kubectl --namespace <ns> logs sts/sim-postgresql --tail=200
kubectl --namespace <ns> logs job/sim-migrations --tail=200 2>/dev/null
kubectl --namespace <ns> describe pod -l app.kubernetes.io/name=sim
```

Then map the symptom to `references/troubleshooting.md`.

---

## What this skill does **not** cover

- Sim application configuration beyond env vars (provider keys, knowledge base setup, etc.) — that's the Sim app docs at https://docs.sim.ai
- Kubernetes cluster setup (creating an EKS cluster, installing ingress-nginx, etc.) — that's cloud-provider docs
- Authoring new chart templates — that's `helm/sim/templates/_helpers.tpl` and the chart's own contributor docs
- Running Sim outside Kubernetes (Docker Compose, bare-metal) — see the root `README.md`

If the user's question falls outside this scope, say so and point them at the right doc.
192 changes: 192 additions & 0 deletions helm/sim/.claude/skills/sim-helm/references/install-paths.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Install Path Selection

Three mutually-exclusive paths for the app Secret. Pick exactly one. The chart enforces this at template time.

## Decision tree

```
Is this a production install?
├── No (dev / kind / minikube / dry-run)
│ → Inline `--set` is fine. Skip to "Path A".
└── Yes
Do you already manage secrets with Vault / AWS Secrets Manager /
Azure Key Vault / GCP Secret Manager / 1Password Connect?
├── Yes → External Secrets Operator. Path C.
└── No
Do you use GitOps with Sealed Secrets, SOPS, or
hand-managed Kubernetes Secrets?
├── Yes → Pre-existing Secret. Path B.
└── No → Install ESO and go to Path C.
(Don't skip to inline `--set` for prod —
secrets land in `helm get values` and release history.)
```

---

## Path A — Inline `--set` (dev only)

```bash
helm install sim ./helm/sim \
--namespace sim --create-namespace \
--set app.env.BETTER_AUTH_SECRET=$(openssl rand -hex 32) \
--set app.env.ENCRYPTION_KEY=$(openssl rand -hex 32) \
--set app.env.INTERNAL_API_SECRET=$(openssl rand -hex 32) \
--set app.env.CRON_SECRET=$(openssl rand -hex 32) \
--set postgresql.auth.password=$(openssl rand -base64 24 | tr -d '/+=')
```

The chart generates a `Secret` named `<release>-app-secrets` containing every non-empty key from `app.env` + `realtime.env`. Both `app` and `realtime` Deployments mount it via `envFrom`.

**Risks:**
- Secrets are visible in `helm get values <release>` and `helm history <release>`.
- Anyone with read access to the release's ConfigMap (`sh.helm.release.v1.<release>.v<N>`) can recover the secrets — they're stored base64-encoded inside.

---

## Path B — Pre-existing Kubernetes Secret

Create the Secret first, then point the chart at it.

```bash
kubectl create namespace sim
kubectl create secret generic sim-app-secrets --namespace sim \
--from-literal=BETTER_AUTH_SECRET=$(openssl rand -hex 32) \
--from-literal=ENCRYPTION_KEY=$(openssl rand -hex 32) \
--from-literal=INTERNAL_API_SECRET=$(openssl rand -hex 32) \
--from-literal=CRON_SECRET=$(openssl rand -hex 32)

kubectl create secret generic sim-postgres-secret --namespace sim \
--from-literal=POSTGRES_PASSWORD=$(openssl rand -base64 24 | tr -d '/+=')
```

```yaml
# values.yaml
app:
secrets:
existingSecret:
enabled: true
name: sim-app-secrets

postgresql:
auth:
existingSecret:
enabled: true
name: sim-postgres-secret
passwordKey: POSTGRES_PASSWORD
```

**The chart cannot introspect your Secret.** If you forget a required key, the pod will fail at runtime with `CreateContainerConfigError: secret key "X" not found`. The required keys are: `BETTER_AUTH_SECRET`, `ENCRYPTION_KEY`, `INTERNAL_API_SECRET`, plus `CRON_SECRET` when cronjobs are enabled.

For GitOps (Sealed Secrets / SOPS), seal/encrypt the Secret YAML before committing — never commit a plain `kubectl create secret` output.

---

## Path C — External Secrets Operator (production recommended)

ESO syncs from your existing secret store (Vault, AWS SM, Azure KV, GCP SM, etc.) into a Kubernetes Secret on a refresh interval. The chart renders the `ExternalSecret` resource; ESO does the syncing.

### Prerequisites

1. Install ESO once per cluster:
```bash
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets \
-n external-secrets --create-namespace
```
2. Create a `ClusterSecretStore` (or namespace-scoped `SecretStore`) that points at your secret manager. ESO's docs cover the auth wiring for each provider.

### Values

```yaml
externalSecrets:
enabled: true
apiVersion: v1beta1 # v1beta1 works on ESO >= 0.7. Bump to v1 only on ESO >= 0.17.
refreshInterval: 1h
secretStoreRef:
name: my-cluster-secret-store
kind: ClusterSecretStore # or SecretStore for namespace-scoped
remoteRefs:
app:
BETTER_AUTH_SECRET: sim/app/better-auth-secret
ENCRYPTION_KEY: sim/app/encryption-key
INTERNAL_API_SECRET: sim/app/internal-api-secret
CRON_SECRET: sim/app/cron-secret # required iff cronjobs.enabled
# Optional but commonly mapped:
API_ENCRYPTION_KEY: sim/app/api-encryption-key
OPENAI_API_KEY: sim/providers/openai
postgresql:
password: sim/postgresql/password # required if postgresql.enabled
externalDatabase:
password: sim/postgresql/password # required if externalDatabase.enabled

# Leave app.env empty (or only set non-secret values like NEXT_PUBLIC_APP_URL).
app:
env: {}
```

### Fail-fast behavior

The chart will refuse to render if:

- `externalSecrets.enabled=true` and any of `BETTER_AUTH_SECRET`, `ENCRYPTION_KEY`, `INTERNAL_API_SECRET` (or `CRON_SECRET` when cronjobs are enabled) is **neither** set in `app.env` **nor** mapped in `remoteRefs.app`. Error message names the missing key.
- A key is set in `app.env` with a non-empty value but not mapped in `remoteRefs.app` (would be silently dropped from the rendered Secret).

These checks catch the "renders cleanly, CrashLoopBackOffs at runtime" failure mode that plagued earlier chart versions.

### Remote ref shapes

Each `remoteRefs.app.<KEY>` value can be either:

```yaml
# Shorthand — just the path/key in the store
BETTER_AUTH_SECRET: sim/app/better-auth-secret
```

```yaml
# Full form — pass any field ESO supports
BETTER_AUTH_SECRET:
key: sim/app/better-auth-secret
property: value # for stores that return JSON
version: "v3" # pin a specific version
decodingStrategy: Base64 # for base64-stored values
```

---

## Cross-cutting: things that are NOT secrets

Operational tunables (rate limits, timeouts, IVM pool size, branding) live in `app.envDefaults` and `realtime.envDefaults`. They're rendered as **inline `env:`** on the Deployment, not written to the Secret. See `values-model.md` for the full mental model.

Don't try to push these into ESO — they're not sensitive, they'd just bloat the secret store.

---

## Verifying your choice

After `helm install`:

```bash
# What Secret will the pods mount?
helm template sim helm/sim -f my-values.yaml | grep -A2 "envFrom:"

# For ESO: did the ExternalSecret render?
helm template sim helm/sim -f my-values.yaml | grep -B1 -A10 "kind: ExternalSecret"

# For existingSecret: is your pre-created Secret referenced?
helm template sim helm/sim -f my-values.yaml | grep -E "name: .*-app-secrets"
```

For ESO, after `helm install`, verify the sync:

```bash
kubectl get externalsecret -n sim
kubectl describe externalsecret <release>-app-secrets -n sim
# Status should show 'SecretSynced=True'
```
Loading