Skip to content

docs(security-ig): shared tool-definition drift (rug-pull) test corpus#2924

Open
eeee2345 wants to merge 1 commit into
modelcontextprotocol:mainfrom
eeee2345:docs/security-ig-tool-drift-corpus
Open

docs(security-ig): shared tool-definition drift (rug-pull) test corpus#2924
eeee2345 wants to merge 1 commit into
modelcontextprotocol:mainfrom
eeee2345:docs/security-ig-tool-drift-corpus

Conversation

@eeee2345

Copy link
Copy Markdown

Adds a small, labeled test corpus for post-approval tool-definition drift (the rug-pull case: a server passes admission, then changes a tool on a later session). It came out of a discussion in #security-ig.

It covers two complementary drift signals, each labeled with verdicts from a real engine (not asserted):

  • content-injection: schema and annotations stay identical, the attack is smuggled into the description text. Verified against the open ATR engine: 4/4 malicious fire, 3/3 benign quiet.
  • capability-surface: the declared surface escalates after approval (annotations readOnly -> destructive, declared effects, data-access, external-reach, auth-scope). Verified against the Interlock drift engine: 6/6 malicious fire, 5/5 benign quiet, plus 3 undeclared/hidden cases surfaced for review rather than auto-blocked.

Each case is baseline -> twin with an expected verdict, so a detector can be pointed at it and checked: does it catch the malicious change without firing on benign evolution? The benign controls are the point.

Capability-surface cases contributed by Maaz (Interlock); content-injection by me (ATR). Labels are real engine output, not assertions.

On location: I couldn't find an existing home for security test corpora in the repo, so I put this under docs/community/security-ig/ as Interest Group material. Not attached to it living here, happy to move it wherever fits, flagging @pcarleton on placement.


AI disclosure (per CONTRIBUTING): I used AI assistance (Claude Code) to draft this PR and assemble the corpus, and I reviewed it. The detection labels are real engine output, not generated: the content-injection half was run through the ATR engine, the capability-surface half through Interlock's. I understand the contents and can speak to any individual case.

…orpus

Two complementary drift signals, content-injection and capability-surface,
each labeled with real engine verdicts. Came out of #security-ig discussion.
Capability-surface cases contributed by Maaz (Interlock).

Signed-off-by: Adam Lin <adam@agentthreatrule.org>
@eeee2345 eeee2345 requested review from a team as code owners June 16, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant