Shared: Add DataFlow::DeduplicatePathGraph #14350
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds a shared parameterised module,
DataFlow::DeduplicatePathGraph, for post-processing aPathGraphso that it doesn't result in duplicate alerts or alerts with multiple identical paths.This issue usually arises from using
FlowState, which is embedded in thePathNodebut not rendered as part of its string value. This can thus result in paths that have different intermediate flow states but appear to be identical to the end-user.The issue with multiple alerts, i.e. seemingly-identical rows in the
#selectclause, is particularly bad for tools that attempt to diff results (such as DCA) but does not perform its own deduplication in advance.The module works by projecting
PathNodedown to their(node, toString)values, which is closer to what the end-user actually sees in the end. By seeing the path graph as an NFA that accepts input symbols of type(node, toString)we try to minimise this NFA by merging states.This is needed by the JavaScript data-flow migration, but I've put this in its own PR so it can be reviewed separately. I've used the library in a Ruby query that had some very ad-hoc alert deduplication logic. Note that the expected output diff is mainly due to reordering of result sets in the output.