Skip to content

feat(pii): build & own combined PII (analyzer + anonymizer) image#5176

Open
TheodoreSpeaks wants to merge 5 commits into
stagingfrom
feat/presidio-image
Open

feat(pii): build & own combined PII (analyzer + anonymizer) image#5176
TheodoreSpeaks wants to merge 5 commits into
stagingfrom
feat/presidio-image

Conversation

@TheodoreSpeaks

@TheodoreSpeaks TheodoreSpeaks commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Build and own a single Presidio image (docker/pii.Dockerfile) replacing the stock mcr.microsoft.com/presidio-analyzer + presidio-anonymizer sidecars. A thin FastAPI service (docker/pii/server.py) builds one AnalyzerEngine + one AnonymizerEngine at startup and serves both on one port (3000): /health, /supportedentities, /analyze, /anonymize (stock-compatible shapes) — so the app needs one PRESIDIO_URL.
  • Multi-language (en/es/it/pl/fi) via a NlpEngineProvider with five pinned spaCy lg models: en_core_web_lg + es/it/pl/fi_core_news_lg, all 3.8.0. Pinned presidio==2.2.362, spacy==3.8.14, fastapi/uvicorn on python:3.12-slim-bookworm, non-root. Models fetched with curl retry/resume (direct pip wheel URLs truncate).
  • Native VIN recognizer (ISO-3779 check digit) ported from validate_pii.py, registered under all five languages (it's language-agnostic). Plus explicit registration of recognizers Presidio ships but does NOT load by default (confirmed: they don't auto-load even with the NLP model present):
    • en locale ids: UK_NINO, AU_ABN/ACN/TFN/MEDICARE, IN_PAN/AADHAAR/VEHICLE_REGISTRATION/VOTER/PASSPORT, SG_NRIC_FIN, SG_UEN
    • national ids: ES_NIF/NIE (es), IT_FISCAL_CODE/DRIVER_LICENSE/VAT_CODE/PASSPORT/IDENTITY_CARD (it), PL_PESEL (pl), FI_PERSONAL_IDENTITY_CODE (fi)
  • Wire CI: ci.yml (build-dev, build-amd64, ghcr-arm64, manifests + both ECR ternaries) and images.yml; new secret ECR_PII, GHCR ghcr.io/simstudioai/pii. AMD64 for staging/prod, ARM64+manifest on main — mirrors app/db/realtime.

Verification (full 5-model image built + run locally)

  • Image: 6.07 GB. Cold start: ~9s (5 lg models load at import). Warm RAM: ~3.3 GiB idle.
  • VIN: detected under en AND es (the multi-language fix); invalid check digit rejected.
  • National ids fire in-language: ES_NIF, IT_FISCAL_CODE, PL_PESEL (checksum-validated), FI_PERSONAL_IDENTITY_CODE all detect with score 1.0; absent before explicit registration.
  • Per-language /supportedentities returns the expected sets (en includes VIN + UK/AU/IN/SG; es/it/pl/fi include VIN + their national ids; global pattern recognizers — email/phone/IP/IBAN/etc. — present in every language automatically). Note: NER-derived labels (PERSON/LOCATION/ORGANIZATION/AGE/ID) vary per model.
  • Contract: missing text → 422 (not 500); /anonymize default → <EMAIL_ADDRESS>.
  • bun run lint + bun run check:api-validation:strict pass; both workflow YAMLs parse.

Review feedback addressed (Bugbot + Greptile)

  • VIN registered per-language (was en-only → missed under non-English routing).
  • HEALTHCHECK --start-period 40s → 180s (measured cold start ~9s locally, but generous for slower ECS CPUs).
  • Dropped --no-cache-dir so the pip cache mount works.
  • Pydantic request models → missing text returns 422; operator type defaults to replace instead of KeyError→500.

Required infra companion changes (NOT in this PR — infra repo)

  • Create ECR repo sim/pii; add GitHub secret ECR_PII=sim/pii (dev/staging/prod accounts).
  • deployment-templates/app/taskdef.json: collapse the two presidio-* sidecars into one container pii on the new ECR image, single containerPort: 3000, health /health; drop 5001/5002. Sizing: image is ~6 GB and idles at ~3.3 GiB RAM with five lg models — set container memory accordingly (≥4 GiB reservation, headroom for load) and bump ephemeral storage for the image.
  • App container env: replace PRESIDIO_ANALYZER_URL/PRESIDIO_ANONYMIZER_URL with PRESIDIO_URL=http://localhost:3000.

App follow-ups (separate PR — out of scope here)

  • Current app talks to Presidio via a Python subprocess (validate_pii.tsvalidate_pii.py), not HTTP. Rewire to call PRESIDIO_URL over HTTP; add PRESIDIO_URL to env.ts; drop the guardrails venv install from app.Dockerfile.
  • Remove the TS VIN path (vin.ts, recognizers.ts) — VIN is now native (and multi-language) in the image.
  • Workflow-log redaction language: logger.ts:623 calls redactPIIFromExecution(payload, { entityTypes }) with no language, so it's hardcoded to 'en' (the default in pii-redaction.ts). Add a language to the Data Retention config (PiiRedactionConfig, default 'en') and thread it through logger.ts → redactPIIFromExecution → maskPIIBatch. This also fixes a latent consistency gap: an org can select a national-id entity (e.g. ES_NIF) for redaction, but under 'en' its recognizer never loads, so it's silently never redacted. (The guardrails block already passes piiLanguage per call — leave it as-is.)
  • pii-entities.ts / piiLanguage align with the image; drop orphaned entities with no selectable language: KR_RRN (ko) and TH_TNIN (th).

Type of Change

  • New feature (build infra / CI)

Testing

Tested manually — full 5-model image built and run locally (see "Verification").

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Replace the stock mcr.microsoft.com/presidio-* sidecar images with a single
image we build and push to ECR/GHCR. A thin FastAPI service constructs one
AnalyzerEngine + one AnonymizerEngine at startup and serves both on port 3000
(/health, /supportedentities, /analyze, /anonymize) so the app needs one
PRESIDIO_URL. English only; pinned presidio 2.2.362 + en_core_web_lg 3.8.0.

Bakes in the native check-digit VIN recognizer and registers 12 English
recognizers Presidio ships but does not load by default (UK_NINO, AU_*, IN_*,
SG_*), taking the supported English set from 19 to 32.
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Building Building Preview, Comment Jun 23, 2026 7:26am

Request Review

@cursor

cursor Bot commented Jun 23, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
New security-adjacent PII detection path and large (~6GB) ECS sidecar with heavy memory/cold-start requirements; app/infra wiring to use it is still out of band for this PR.

Overview
Introduces a single owned PII container (docker/pii.Dockerfile + apps/pii/) that replaces separate Microsoft Presidio analyzer/anonymizer sidecars. A FastAPI app warms one AnalyzerEngine and one AnonymizerEngine at startup and exposes stock-style /health, /supportedentities, /analyze, and /anonymize on port 3000, so callers can use one PRESIDIO_URL.

The analyzer is configured for en/es/it/pl/fi with five pinned spaCy lg models in the image, plus explicit registration of locale/national-id recognizers Presidio does not load by default and a VIN recognizer with ISO 3779 check-digit validation registered per language. Requests use Pydantic models (e.g. missing text → 422; anonymize operator type defaults to replace).

CI adds the PII image to dev/staging/main build matrices in ci.yml and images.yml, resolves ECR_PII, and pushes ECR-only (no GHCR when ghcr_image is omitted). GHCR tagging is gated on a non-empty ghcr_image so ECR-only matrix entries do not get invalid tags. Workspace metadata adds @sim/pii in package.json / bun.lock without changing the JS turbo build.

Reviewed by Cursor Bugbot for commit 2b3bce1. Bugbot is set up for automated code reviews on this repo. Configure here.

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces the two-sidecar Presidio setup (separate analyzer + anonymizer containers) with a single combined image built around a thin FastAPI service. The service loads five spaCy lg models at startup and serves stock-compatible /analyze and /anonymize endpoints on one port, with VIN check-digit validation and explicit registration of locale-specific recognizers (UK, AU, IN, SG, ES, IT, PL, FI) that Presidio ships but does not load by default.

  • docker/presidio/server.py: Pydantic request models handle missing-field validation correctly; pop(\"type\", \"replace\") already supplies a safe default; the VIN ISO-3779 check-digit algorithm is correctly implemented. The entities or None expression silently converts an empty list to None, causing all entities to be analyzed instead of none.
  • docker/presidio.Dockerfile: Single-stage build; pip cache mounts are used correctly; five spaCy models are downloaded with curl retry/resume; build-essential is installed for native deps but never removed, leaving compiler toolchain in the runtime image.
  • CI (ci.yml / images.yml): Presidio is wired into every matrix job; the ECR ternary chain in ci.yml is extended; images.yml resolves secrets via secrets[matrix.ecr_repo_secret] and requires no ternary update.

Confidence Score: 5/5

Safe to merge; new infra files only, no changes to existing app logic.

The change adds an entirely new Docker image and CI matrix entries without touching any existing code paths. The FastAPI server is well-structured with Pydantic validation, the VIN check-digit implementation is correct, and the CI wiring mirrors the established pattern for other images.

No files require special attention; all changed files are new additions to the build infrastructure.

Important Files Changed

Filename Overview
docker/presidio/server.py New combined FastAPI service; Pydantic models handle validation correctly; VIN check-digit algorithm is sound; operator pop default already present.
docker/presidio.Dockerfile Single-stage build; pip cache mounts correct; build-essential installed but never uninstalled, remaining in the production image layer.
docker/presidio/requirements.txt All deps pinned to specific versions: presidio 2.2.362, spacy 3.8.14, fastapi 0.138.0, uvicorn 0.49.0.
.github/workflows/ci.yml Presidio added to all four matrix jobs; ECR ternary chain extended to include ECR_PRESIDIO.
.github/workflows/images.yml Presidio added to all three matrices; uses secrets[matrix.ecr_repo_secret] dynamic syntax so no ternary update needed.

Reviews (3): Last reviewed commit: "improvement(presidio): address review fe..." | Re-trigger Greptile

Comment thread docker/presidio.Dockerfile Outdated
Comment thread docker/presidio.Dockerfile Outdated
Comment thread docker/presidio/server.py Outdated
Comment thread docker/presidio/server.py Outdated
@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a single combined Presidio image (docker/presidio.Dockerfile) that replaces two separate upstream sidecars with one FastAPI service on port 3000, reducing the required env config to a single PRESIDIO_URL. CI and image workflows are wired consistently for dev/staging/prod across AMD64 and ARM64 targets.

  • New server.py: one warm AnalyzerEngine (VIN check-digit recognizer + 12 extra English recognizers) and one AnonymizerEngine at startup; exposes /health, /supportedentities, /analyze, and /anonymize with stock-compatible payload shapes.
  • presidio.Dockerfile: python:3.12-slim-bookworm, pinned deps, en_core_web_lg wheel fetched with curl --retry/resume to handle the ~400 MB download, non-root user at runtime.
  • CI changes: matrix entries and ECR ternary expressions updated in all four relevant jobs in ci.yml; images.yml already uses dynamic secret lookup so only matrix entries needed extending.

Confidence Score: 4/5

Safe to merge; the new Presidio sidecar is additive infrastructure with no app-side wiring in this PR

The VIN check-digit algorithm and all CI matrix updates are correct. The three findings are confined to the new FastAPI server: the pip cache mount is silently wasted (not a runtime failure), and two endpoint handlers will return 500 instead of 422 for malformed payloads — unlikely in practice since callers are internal, but worth fixing before the follow-up HTTP-wiring PR lands.

docker/presidio/server.py — input validation gaps in /analyze and /anonymize; docker/presidio.Dockerfile — contradictory pip cache flags

Important Files Changed

Filename Overview
docker/presidio/server.py New FastAPI service combining analyzer + anonymizer; VIN check-digit logic is correct per FMVSS 115; two endpoints lack input validation and will raise unhandled KeyErrors on malformed requests
docker/presidio.Dockerfile New Dockerfile for combined Presidio image; non-root user, curl retry/resume for large model wheel, and apt cache mounts are all correct; pip cache mount is negated by --no-cache-dir flag making it a no-op
docker/presidio/requirements.txt All deps fully pinned; spaCy model download deliberately handled outside pip via curl in the Dockerfile; no issues
.github/workflows/ci.yml All four matrix jobs correctly extended for presidio; ECR ternary updated in both places that need it
.github/workflows/images.yml Uses dynamic secret lookup via secrets[matrix.ecr_repo_secret] so no ternary update was needed; all three relevant matrices extended correctly

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant App as App Container
    participant P as Presidio Sidecar (localhost:3000)
    participant AE as AnalyzerEngine
    participant AnE as AnonymizerEngine

    App->>P: GET /health
    P-->>App: "{status: ok}"

    App->>P: "GET /supportedentities?language=en"
    P->>AE: get_supported_entities(en)
    AE-->>P: [32 entity types]
    P-->>App: [PERSON, EMAIL_ADDRESS, VIN, ...]

    App->>P: "POST /analyze {text, language, entities}"
    P->>AE: analyze(text, language, entities)
    AE-->>P: [RecognizerResult...]
    P-->>App: "[{entity_type, start, end, score}...]"

    App->>P: "POST /anonymize {text, analyzer_results, anonymizers}"
    P->>AnE: anonymize(text, analyzer_results, operators)
    AnE-->>P: EngineResult
    P-->>App: "{text: redacted, items: [...]}"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant App as App Container
    participant P as Presidio Sidecar (localhost:3000)
    participant AE as AnalyzerEngine
    participant AnE as AnonymizerEngine

    App->>P: GET /health
    P-->>App: "{status: ok}"

    App->>P: "GET /supportedentities?language=en"
    P->>AE: get_supported_entities(en)
    AE-->>P: [32 entity types]
    P-->>App: [PERSON, EMAIL_ADDRESS, VIN, ...]

    App->>P: "POST /analyze {text, language, entities}"
    P->>AE: analyze(text, language, entities)
    AE-->>P: [RecognizerResult...]
    P-->>App: "[{entity_type, start, end, score}...]"

    App->>P: "POST /anonymize {text, analyzer_results, anonymizers}"
    P->>AnE: anonymize(text, analyzer_results, operators)
    AnE-->>P: EngineResult
    P-->>App: "{text: redacted, items: [...]}"
Loading

Reviews (2): Last reviewed commit: "feat(presidio): build & own combined ana..." | Re-trigger Greptile

Comment thread docker/presidio.Dockerfile Outdated
Comment thread docker/presidio/server.py Outdated
Comment thread docker/presidio/server.py Outdated
Configure a multi-language spaCy NLP engine (en/es/it/pl/fi lg models) and
explicitly register the national-id recognizers Presidio ships but does not
load by default: ES_NIF/NIE, IT_FISCAL_CODE/DRIVER_LICENSE/VAT_CODE/PASSPORT/
IDENTITY_CARD, PL_PESEL, FI_PERSONAL_IDENTITY_CODE. Verified the NLP-engine +
explicit-registration path detects in-language (Finnish id, score 1.0).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 815d875. Configure here.

Comment thread apps/pii/server.py
Comment thread docker/pii.Dockerfile
- Register VIN under all served languages, not just en (Bugbot: VIN missed for
  non-English language routing).
- Bump HEALTHCHECK start-period to 180s — five lg models load at import (Bugbot).
- Drop --no-cache-dir so the pip cache mount actually works (Greptile).
- Pydantic request models for /analyze + /anonymize so missing 'text' returns
  422 not 500; default operator 'type' to 'replace' instead of KeyError->500
  (Greptile).
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Rename the image/repo/secret/files from 'presidio' to 'pii' for clarity — the
service does PII detection + anonymization (and backs the guardrails block's
block/mask), not just redaction, and 'pii' matches existing pii-* naming.

docker/presidio.Dockerfile -> docker/pii.Dockerfile
docker/presidio/ -> docker/pii/
ghcr.io/simstudioai/presidio -> .../pii
ECR_PRESIDIO secret -> ECR_PII (infra side already renamed)
No behavior change — paths/identifiers only.
@TheodoreSpeaks TheodoreSpeaks changed the title feat(presidio): build & own combined analyzer+anonymizer image feat(pii): build & own combined PII (analyzer + anonymizer) image Jun 23, 2026
- Move server.py + requirements.txt from docker/pii/ to apps/pii/ (source belongs
  under apps/, matching app/realtime; Dockerfile stays in docker/). Add a minimal
  @sim/pii package.json so the apps/* bun workspace glob accepts the Python service.
- Repoint docker/pii.Dockerfile COPY paths to apps/pii/; rename the container user
  presidio -> pii.
- Drop GHCR for pii — it's a private ECS sidecar pulled from ECR, never published.
  Removed it from the arm64/manifest (GHCR-only) jobs and guarded the build-amd64
  tag step to skip GHCR when no ghcr_image is set.
@TheodoreSpeaks TheodoreSpeaks requested a review from a team as a code owner June 23, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant