docs(security-ig): shared tool-definition drift (rug-pull) test corpus#2924
Open
eeee2345 wants to merge 1 commit into
Open
docs(security-ig): shared tool-definition drift (rug-pull) test corpus#2924eeee2345 wants to merge 1 commit into
eeee2345 wants to merge 1 commit into
Conversation
…orpus Two complementary drift signals, content-injection and capability-surface, each labeled with real engine verdicts. Came out of #security-ig discussion. Capability-surface cases contributed by Maaz (Interlock). Signed-off-by: Adam Lin <adam@agentthreatrule.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a small, labeled test corpus for post-approval tool-definition drift (the rug-pull case: a server passes admission, then changes a tool on a later session). It came out of a discussion in #security-ig.
It covers two complementary drift signals, each labeled with verdicts from a real engine (not asserted):
Each case is baseline -> twin with an expected verdict, so a detector can be pointed at it and checked: does it catch the malicious change without firing on benign evolution? The benign controls are the point.
Capability-surface cases contributed by Maaz (Interlock); content-injection by me (ATR). Labels are real engine output, not assertions.
On location: I couldn't find an existing home for security test corpora in the repo, so I put this under docs/community/security-ig/ as Interest Group material. Not attached to it living here, happy to move it wherever fits, flagging @pcarleton on placement.
AI disclosure (per CONTRIBUTING): I used AI assistance (Claude Code) to draft this PR and assemble the corpus, and I reviewed it. The detection labels are real engine output, not generated: the content-injection half was run through the ATR engine, the capability-surface half through Interlock's. I understand the contents and can speak to any individual case.