Skip to content

feat(guardrails): run PII via Presidio sidecars + TS recognizer registry#5174

Open
TheodoreSpeaks wants to merge 6 commits into
stagingfrom
fix/pii-presidio-sidecar
Open

feat(guardrails): run PII via Presidio sidecars + TS recognizer registry#5174
TheodoreSpeaks wants to merge 6 commits into
stagingfrom
fix/pii-presidio-sidecar

Conversation

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator

Summary

  • Replace the per-call python3 subprocess (which cold-loaded the ~1GB spaCy model on every call) with two long-lived Presidio sidecar containers (analyzer + anonymizer) reached over localhost. The model loads once at container start and is reused. The app image no longer carries Python/Presidio/venv.
  • Works for async runs too: the log-redaction persist path runs in the trigger.dev runtime (no sidecar), so it HTTP-hops to the app container's /api/guardrails/mask-batch route, which calls the sidecars. Gated by the existing pii-redaction feature flag.
  • Move VIN out of Python into a TS recognizer (ISO-3779 check-digit) behind a CUSTOM_RECOGNIZERS registry — adding a future detector Presidio lacks is one entry + a catalog line. Masking is uniform: the anonymizer replaces any span by its entity_type, so custom entities mask for free.
  • Drive the guardrails block's PII-type picker from the shared pii-entities catalog (adds VIN, fixes pre-existing drift) so the block and Data Retention settings never diverge.
  • New env: PRESIDIO_ANALYZER_URL / PRESIDIO_ANONYMIZER_URL. Deletes validate_pii.py, requirements.txt, setup.sh, and the Dockerfile venv step.

Deploy order: infra first. The two sidecars are added to the app ECS task in the infra repo (separate change) and must be deployed + healthy before this merges.

Type of Change

  • Bug fix / improvement (PII redaction was non-functional in deployed runtimes)

Testing

  • Unit: vin.test.ts, validate_pii.test.ts, plus existing mask-client / pii-redaction / mask-batch route tests (29 passing)
  • Verified request/response shapes live against the official mcr.microsoft.com/presidio-analyzer + presidio-anonymizer images (analyze spans, default replace<ENTITY_TYPE>, custom VIN entity masked, /health)
  • bun run lint clean, bun run check:api-validation:strict passes, full sim build + TypeScript clean

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

- resolve the guardrails venv via candidate paths and fail fast instead of
  silently falling back to system python3 (the misleading "Presidio not
  installed" that broke redaction and the guardrails block in deployed runtimes)
- install the en_core_web_lg spaCy model in setup.sh and app.Dockerfile
- route log redaction through an internal /api/guardrails/mask-batch endpoint
  so Presidio always runs in the app container, including async executions that
  persist inside the trigger.dev runtime
- chunk maskPIIBatchViaHttp by count (2000) and bytes (256KB) so large
  executions split across requests and never hit the contract's 100k cap
- add AbortSignal.timeout(45s) per request so a slow/unreachable app container
  aborts and the caller scrubs, instead of hanging the trigger.dev job
- catch maskPIIBatch failures in the route: log and return a structured 500
  (broken venv fails loudly server-side; caller still scrubs, no leak)
- add mask-client tests (order across chunks, count split, non-2xx, empty)
A single token (5min TTL) could expire mid-batch when a large execution
fans out into many sequential chunk requests; mint one per request instead.
- replace the per-call python3 subprocess (cold spaCy load every call) with
  two long-lived Presidio sidecars (analyzer + anonymizer) reached over HTTP;
  the app image no longer carries Python/Presidio/venv
- add PRESIDIO_ANALYZER_URL / PRESIDIO_ANONYMIZER_URL
- move VIN out of Python into a TS recognizer (check-digit validated) behind a
  CUSTOM_RECOGNIZERS registry so new custom detectors are one entry; masking is
  handled uniformly by the anonymizer
- drive the guardrails block's PII type picker from the shared pii-entities
  catalog (adds VIN, fixes drift) so block + Data Retention never diverge
- delete validate_pii.py, requirements.txt, setup.sh and the Dockerfile venv step
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 23, 2026 12:38am

Request Review

@cursor

cursor Bot commented Jun 23, 2026

Copy link
Copy Markdown

PR Summary

High Risk
PII handling and log persistence paths change architecture-wide; PII redaction depends on sidecars being deployed and reachable before this ships, or masking fails closed to scrub markers.

Overview
Replaces per-call Python subprocess PII detection with HTTP calls to long-lived Presidio analyzer/anonymizer sidecars (PRESIDIO_ANALYZER_URL / PRESIDIO_ANONYMIZER_URL), and drops Python, venv, and Presidio from the app Docker image.

Adds an internal JWT-protected POST /api/guardrails/mask-batch route plus maskPIIBatchViaHttp (chunked requests, timeouts). Log redaction in the trigger.dev runtime now masks via that HTTP hop instead of in-process Python, still scrubbing to [REDACTION_FAILED] on failure.

VIN moves to a TypeScript CUSTOM_RECOGNIZERS registry (vin.ts); spans merge with Presidio before anonymization. The guardrails block’s PII picker is driven from shared PII_ENTITY_GROUPS (adds VIN, aligns with Data Retention).

Reviewed by Cursor Bugbot for commit 91ce2d1. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/lib/guardrails/validate_pii.ts
@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Replaces the per-call Python subprocess (which cold-loaded the ~1 GB spaCy model on every invocation) with two long-lived Presidio sidecar containers reached over HTTP, and ports VIN detection into a TS recognizer with ISO 3779 check-digit validation. The log-redaction persist path (which also runs in the trigger.dev runtime) now routes masking through an internal /api/guardrails/mask-batch endpoint so Presidio stays centralized in the app container.

  • Sidecar architecture: validate_pii.ts is rewritten to call PRESIDIO_ANALYZER_URL/analyze and PRESIDIO_ANONYMIZER_URL/anonymize directly; mask-client.ts adds the byte/count-chunked HTTP client used by the pii-redaction path, minting per-request internal JWTs to handle long multi-chunk batches.
  • Custom recognizer registry: recognizers.ts introduces CUSTOM_RECOGNIZERS so VIN (and future custom entities) are detected in TS, stripped from the Presidio analyzer request, and their spans forwarded to the anonymizer sidecar for uniform <ENTITY_TYPE> replacement.
  • Shared catalog: guardrails.ts block options and Data Retention settings now both derive from PII_ENTITY_GROUPS in pii-entities.ts, eliminating the previous drift.

Confidence Score: 5/5

Safe to merge once the infra-side sidecar deploy is in place; all failure paths scrub rather than leak PII.

The rewrite is well-structured: sidecar calls are individually timeout-bounded, the fail-safe scrub path is preserved end-to-end, internal auth is enforced on the new route, and the VIN check-digit implementation is correct. The only finding is a style-level contract note on mapWithConcurrency usage with no runtime impact.

No files require special attention beyond confirming the infra-side Presidio sidecar containers are deployed and healthy before this merges (as noted in the PR description).

Important Files Changed

Filename Overview
apps/sim/lib/guardrails/validate_pii.ts Full rewrite from Python subprocess to direct Presidio sidecar HTTP calls; well-structured with correct error handling, span merging, and concurrency
apps/sim/lib/guardrails/mask-client.ts New HTTP client for the trigger.dev→app-container path; correctly chunks by bytes/count, mints per-request tokens, and propagates failures so callers scrub
apps/sim/lib/guardrails/vin.ts ISO 3779 check-digit-validated VIN recognizer; transliteration table and weight array match the standard, global regex usage with matchAll/replace is correct
apps/sim/lib/guardrails/recognizers.ts Clean registry pattern for custom TS-side recognizers; CUSTOM_ENTITY_TYPES set correctly excludes VIN from Presidio analyzer forwarding
apps/sim/app/api/guardrails/mask-batch/route.ts New internal route for centralizing Presidio access; internal JWT auth, contract-validated body, and fail-loud error path that prevents PII leakage
apps/sim/lib/logs/execution/pii-redaction.ts Switches from direct maskPIIBatch import to maskPIIBatchViaHttp; fail-safe scrubbing path unchanged and correct
apps/sim/blocks/blocks/guardrails.ts PII type picker now driven by the shared PII_ENTITY_GROUPS catalog, adding VIN and eliminating drift between block UI and Data Retention settings
docker/app.Dockerfile Removes Python/venv/Presidio install steps; app image no longer carries the ~1 GB spaCy model

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant GBlock as Guardrails Block
    participant VPII as validate_pii.ts
    participant TSVR as TS Recognizers (VIN)
    participant PA as Presidio Analyzer sidecar
    participant PAnon as Presidio Anonymizer sidecar
    participant TrigDev as trigger.dev runtime
    participant PiiRed as pii-redaction.ts
    participant MaskCl as mask-client.ts
    participant MaskRoute as /api/guardrails/mask-batch
    Note over GBlock,PAnon: Real-time guardrails path (in-process)
    GBlock->>VPII: validatePII / maskPIIBatch
    VPII->>TSVR: detect(text) → VIN spans
    VPII->>PA: POST /analyze (non-custom entities)
    PA-->>VPII: Presidio spans
    VPII->>PAnon: POST /anonymize (all spans)
    PAnon-->>VPII: masked text
    VPII-->>GBlock: PIIValidationResult
    Note over TrigDev,PAnon: Log-redaction persist path (trigger.dev → app container)
    TrigDev->>PiiRed: redactPIIFromExecution
    PiiRed->>MaskCl: maskPIIBatchViaHttp(texts, entityTypes)
    MaskCl->>MaskRoute: POST /api/guardrails/mask-batch (internal JWT, chunked)
    MaskRoute->>VPII: maskPIIBatch(texts)
    VPII->>TSVR: detect per text
    VPII->>PA: "POST /analyze per text (concurrency=8)"
    PA-->>VPII: spans
    VPII->>PAnon: POST /anonymize per text
    PAnon-->>VPII: masked text
    VPII-->>MaskRoute: masked[]
    MaskRoute-->>MaskCl: "{masked:[…]}"
    MaskCl-->>PiiRed: masked texts
    PiiRed-->>TrigDev: redacted payload
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant GBlock as Guardrails Block
    participant VPII as validate_pii.ts
    participant TSVR as TS Recognizers (VIN)
    participant PA as Presidio Analyzer sidecar
    participant PAnon as Presidio Anonymizer sidecar
    participant TrigDev as trigger.dev runtime
    participant PiiRed as pii-redaction.ts
    participant MaskCl as mask-client.ts
    participant MaskRoute as /api/guardrails/mask-batch
    Note over GBlock,PAnon: Real-time guardrails path (in-process)
    GBlock->>VPII: validatePII / maskPIIBatch
    VPII->>TSVR: detect(text) → VIN spans
    VPII->>PA: POST /analyze (non-custom entities)
    PA-->>VPII: Presidio spans
    VPII->>PAnon: POST /anonymize (all spans)
    PAnon-->>VPII: masked text
    VPII-->>GBlock: PIIValidationResult
    Note over TrigDev,PAnon: Log-redaction persist path (trigger.dev → app container)
    TrigDev->>PiiRed: redactPIIFromExecution
    PiiRed->>MaskCl: maskPIIBatchViaHttp(texts, entityTypes)
    MaskCl->>MaskRoute: POST /api/guardrails/mask-batch (internal JWT, chunked)
    MaskRoute->>VPII: maskPIIBatch(texts)
    VPII->>TSVR: detect per text
    VPII->>PA: "POST /analyze per text (concurrency=8)"
    PA-->>VPII: spans
    VPII->>PAnon: POST /anonymize per text
    PAnon-->>VPII: masked text
    VPII-->>MaskRoute: masked[]
    MaskRoute-->>MaskCl: "{masked:[…]}"
    MaskCl-->>PiiRed: masked texts
    PiiRed-->>TrigDev: redacted payload
Loading

Reviews (3): Last reviewed commit: "fix(guardrails): bound-parallelize mask ..." | Re-trigger Greptile

Comment thread apps/sim/app/api/guardrails/mask-batch/route.ts Outdated
Comment thread apps/sim/lib/guardrails/mask-client.ts Outdated
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/sim/lib/guardrails/mask-client.ts
- maskPIIBatch runs per-string sidecar calls with bounded concurrency (8) via
  mapWithConcurrency, so a chunk of many small leaves finishes within the 45s
  request timeout instead of aborting and scrubbing; order + fail-on-error kept
- drop stale comments referencing the deleted Python venv / 30s subprocess timeout
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 91ce2d1. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant