python: Container summaries, part 2#13209
Conversation
6683fad to
8b65937
Compare
55dbddf to
b898338
Compare
copy, pop, get, popitem
copy, pop, get, getitem, setdefault
Also add read steps to taint tracking.
Reading from a tainted collection can be done in two situations:
1. There is an acces path
In this case a read step (possibly from a flow summary)
gives rise to a taint step.
2. There is no access path
In this case an explicit taint step (possibly via a flow
summary) should exist.
9cb83fc to
8d4f944
Compare
hvitved
left a comment
There was a problem hiding this comment.
A few drive-by comments.
| class CopySummary extends SummarizedCallable { | ||
| CopySummary() { this = "collection.copy" } | ||
|
|
||
| override DataFlow::CallCfgNode getACall() { |
There was a problem hiding this comment.
Have you considered having both getACallSimple and getACall, as in Ruby? Where the former may not depend on API graphs, and hence can be used in the call graph construction/type tracking?
There was a problem hiding this comment.
Yes, this is added in #13178 where our type trackers learn about flow summaries. After that, we should reformulate these summaries to use getACallSimple.
|
DCA revealed that adding all read steps is too permissive: It interacts with reading from the dict-splat parameter, losing precision in our parameter routing. I have converted to draft until this is resolved. |
|
Simply not adding extra taint steps derived from read steps solves the issue. The new DCA run has much fewer differences. |
How is that possible? Taint steps are only permitted when the access path is empty, and I would assume that dict-splat parameters are modeled using store-steps at the call sites (i.e., yielding a non-empty access path)? |
Well...we turn those store steps into taint steps... |
I think it would be better to remove those instead (perhaps follow-up). Other languages only lift read-collection-steps to taint steps, and not store-collection-steps, as that yields false-positive flow. |
Indeed, I have added this explicitly to https://github.com/github/codeql-python-team/issues/728. |
RasmusWL
left a comment
There was a problem hiding this comment.
Minor typo + one thing I would like to understand better. Otherwise, nice 👍
One possibly worry is that while the steps are recovered, they no longer admit type tracking. An experiment has been started to evaluate any consequences of this. We should not merge this PR unless the outcome is satisfying.
Also, I did not see a conclusion of this point you raised initially. Please let me know 😊
getLocalSource
I see that for get we changed
-<get-call>.getArg(0).asExpr().(StrConst).getText()
+<get-call>.getArg(0).getALocalSource().asExpr().(StrConst).getText()which made us get a few more results... is there any tests that captures this semantic change? (I could not see any)
PS. Let's also do moreDictStoreSteps at some point 👍
| input = "Argument[self]" and | ||
| output = "ReturnValue" and | ||
| preservesValue = false |
There was a problem hiding this comment.
Why does this part not preserve value?
For the sake of the argument, if you used https://docs.python.org/3.11/library/multiprocessing.html#multiprocessing.sharedctypes.copy I think that would very much preserve values -- but let me know your thoughts on this 😊
There was a problem hiding this comment.
My thought were that the operation does not preserve the pointer, which also seems to be what your link says.
Our reasoning is generally on pointers, but if it helps us to deviate here we should certainly consider that.
There was a problem hiding this comment.
I realize that my link is a little stupid, since it's not a method... however, it will be recognized as a MethodCallNode in our analysis still 🤔
Our reasoning is generally on pointers
I'm not sure. I don't think I've ever been part of a discussion where we talked about this aspect of our analysis.
I guess the more philosophical question is, if you copy the contents of a string to a new string located in different part of memory, is that data-flow or taint-flow? C++ has strcpy as data-flow, so that's one datapoint.
I think it's hard to discuss this aspect, since it doesn't seem like we need it for anything? (or maybe we do still need it until we've sorted out our taint handling?)
I would still propose we make this perserve value, but also realize that we shouldn't spend forever discussing this small detail, so if you really don't want to do it, I'll probably live happily with this only being a taint-step.
There was a problem hiding this comment.
I should have put a reference, but could not immediately find it. @taus and I thought it through for type tracking and LocalSourceNodes at some point. But since this step is quite likely to be type preserving (also when we remove time), it is probably fine to make this a dataflow step..
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
We did not lose a single alert in the DCA runs, so I concluded that this is not a problem for now.
Well spotted, I stole this pattern from Ruby and it does indeed capture also string constants which are sufficiently nearby.
Absolutely :-) |
RasmusWL
left a comment
There was a problem hiding this comment.
I did not add any tests for this specifically.
Could we have one then? (otherwise we might not notice if we break this in the future)
Sure, it is an easy change :-) |
| d = {SOURCE: "val"} | ||
| i = iter(d) | ||
| n = next(i) | ||
| SINK(n) #$ MISSING: flow="SOURCE, l:-3 -> n" No newline at end of file |
There was a problem hiding this comment.
Curious...looks like that setting never made it to my new machine...
This PR replaces explicit modelling of
copy,pop,get,popitem, andsetdefaultwith flow summaries.As these, especially
get, are very common vectors for receiving taint through web requests, the consequences are potentially major. In the first commit can be seen the consequences of removing the explicit taint steps, and in the second of further removing the explicit data flow steps. The final commit restores all steps by means of flow summaries and observing the changes from all commits simultaneously shows that while some edges have now been split into two and sometimes the middle node has had to be added, the select clauses and all inline expectations are unchanged (except for the recent data flow test documenting these methods, found intest_builtins.py).I added the commits one by one and started CI to document that all tests are honest. Unfortunately, I forgot to fix up the expectations for to of the experimental queries, so the
Commitstab does not show three green dots: In the first two commits, there are small discrepancies intest/experimental/query-tests/Security/CWE-176/UnicodeBypassValidationandtest/experimental/query-tests/Security/CWE-1236/CsvInjection. However, the third commit is all green, showing that all is well in the end.One possibly worry is that while the steps are recovered, they no longer admit type tracking. An experiment has been started to evaluate any consequences of this. We should not merge this PR unless the outcome is satisfying.
Synopsis: