feat(guardrails): run PII via Presidio sidecars + TS recognizer registry#5174
feat(guardrails): run PII via Presidio sidecars + TS recognizer registry#5174TheodoreSpeaks wants to merge 6 commits into
Conversation
- resolve the guardrails venv via candidate paths and fail fast instead of silently falling back to system python3 (the misleading "Presidio not installed" that broke redaction and the guardrails block in deployed runtimes) - install the en_core_web_lg spaCy model in setup.sh and app.Dockerfile - route log redaction through an internal /api/guardrails/mask-batch endpoint so Presidio always runs in the app container, including async executions that persist inside the trigger.dev runtime
- chunk maskPIIBatchViaHttp by count (2000) and bytes (256KB) so large executions split across requests and never hit the contract's 100k cap - add AbortSignal.timeout(45s) per request so a slow/unreachable app container aborts and the caller scrubs, instead of hanging the trigger.dev job - catch maskPIIBatch failures in the route: log and return a structured 500 (broken venv fails loudly server-side; caller still scrubs, no leak) - add mask-client tests (order across chunks, count split, non-2xx, empty)
A single token (5min TTL) could expire mid-batch when a large execution fans out into many sequential chunk requests; mint one per request instead.
- replace the per-call python3 subprocess (cold spaCy load every call) with two long-lived Presidio sidecars (analyzer + anonymizer) reached over HTTP; the app image no longer carries Python/Presidio/venv - add PRESIDIO_ANALYZER_URL / PRESIDIO_ANONYMIZER_URL - move VIN out of Python into a TS recognizer (check-digit validated) behind a CUSTOM_RECOGNIZERS registry so new custom detectors are one entry; masking is handled uniformly by the anonymizer - drive the guardrails block's PII type picker from the shared pii-entities catalog (adds VIN, fixes drift) so block + Data Retention never diverge - delete validate_pii.py, requirements.txt, setup.sh and the Dockerfile venv step
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryHigh Risk Overview Adds an internal JWT-protected VIN moves to a TypeScript Reviewed by Cursor Bugbot for commit 91ce2d1. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
@greptile review |
|
@BugBot review |
- maskPIIBatch runs per-string sidecar calls with bounded concurrency (8) via mapWithConcurrency, so a chunk of many small leaves finishes within the 45s request timeout instead of aborting and scrubbing; order + fail-on-error kept - drop stale comments referencing the deleted Python venv / 30s subprocess timeout
|
@greptile review |
|
@BugBot review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 91ce2d1. Configure here.
Summary
python3subprocess (which cold-loaded the ~1GB spaCy model on every call) with two long-lived Presidio sidecar containers (analyzer + anonymizer) reached overlocalhost. The model loads once at container start and is reused. The app image no longer carries Python/Presidio/venv./api/guardrails/mask-batchroute, which calls the sidecars. Gated by the existingpii-redactionfeature flag.CUSTOM_RECOGNIZERSregistry — adding a future detector Presidio lacks is one entry + a catalog line. Masking is uniform: the anonymizer replaces any span by itsentity_type, so custom entities mask for free.pii-entitiescatalog (adds VIN, fixes pre-existing drift) so the block and Data Retention settings never diverge.PRESIDIO_ANALYZER_URL/PRESIDIO_ANONYMIZER_URL. Deletesvalidate_pii.py,requirements.txt,setup.sh, and the Dockerfile venv step.Deploy order: infra first. The two sidecars are added to the app ECS task in the infra repo (separate change) and must be deployed + healthy before this merges.
Type of Change
Testing
vin.test.ts,validate_pii.test.ts, plus existingmask-client/pii-redaction/ mask-batch route tests (29 passing)mcr.microsoft.com/presidio-analyzer+presidio-anonymizerimages (analyze spans, defaultreplace→<ENTITY_TYPE>, custom VIN entity masked,/health)bun run lintclean,bun run check:api-validation:strictpasses, fullsimbuild + TypeScript cleanChecklist