Skip to content

python: Container summaries, part 2#13209

Merged
yoff merged 8 commits into
github:mainfrom
yoff:python/container-summaries-2
Jun 13, 2023
Merged

python: Container summaries, part 2#13209
yoff merged 8 commits into
github:mainfrom
yoff:python/container-summaries-2

Conversation

@yoff
Copy link
Copy Markdown
Contributor

@yoff yoff commented May 17, 2023

This PR replaces explicit modelling of copy, pop, get, popitem, and setdefault with flow summaries.
As these, especially get, are very common vectors for receiving taint through web requests, the consequences are potentially major. In the first commit can be seen the consequences of removing the explicit taint steps, and in the second of further removing the explicit data flow steps. The final commit restores all steps by means of flow summaries and observing the changes from all commits simultaneously shows that while some edges have now been split into two and sometimes the middle node has had to be added, the select clauses and all inline expectations are unchanged (except for the recent data flow test documenting these methods, found in test_builtins.py).

I added the commits one by one and started CI to document that all tests are honest. Unfortunately, I forgot to fix up the expectations for to of the experimental queries, so the Commits tab does not show three green dots: In the first two commits, there are small discrepancies in test/experimental/query-tests/Security/CWE-176/UnicodeBypassValidation and test/experimental/query-tests/Security/CWE-1236/CsvInjection. However, the third commit is all green, showing that all is well in the end.

One possibly worry is that while the steps are recovered, they no longer admit type tracking. An experiment has been started to evaluate any consequences of this. We should not merge this PR unless the outcome is satisfying.

Synopsis:

  • First commit removes explicit steps from taint tracking
  • Second commit removes explicit steps from data flow
  • Third commit adds corresponding summaries, thereby restoring (and improving) the data- and taint flow relations.
  • Fourth commit removes the new extra taint steps derived from read steps. This addition will be treated separately when we handle https://github.com/github/codeql-python-team/issues/728.

@yoff yoff closed this May 24, 2023
@yoff yoff force-pushed the python/container-summaries-2 branch from 6683fad to 8b65937 Compare May 24, 2023 09:15
@github-actions github-actions Bot removed the Python label May 24, 2023
@erik-krogh erik-krogh reopened this May 24, 2023
@yoff yoff force-pushed the python/container-summaries-2 branch 6 times, most recently from 55dbddf to b898338 Compare May 25, 2023 08:57
yoff added 3 commits May 26, 2023 13:22
copy, pop, get, popitem
copy, pop, get, getitem, setdefault

Also add read steps to taint tracking.

Reading from a tainted collection can be done in two situations:
1. There is an acces path
    In this case a read step (possibly from a flow summary)
    gives rise to a taint step.
2. There is no access path
    In this case an explicit taint step (possibly via a flow
    summary) should exist.
@yoff yoff force-pushed the python/container-summaries-2 branch 2 times, most recently from 9cb83fc to 8d4f944 Compare May 26, 2023 12:05
@yoff yoff added the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label May 30, 2023
@yoff yoff marked this pull request as ready for review May 30, 2023 10:01
@yoff yoff requested a review from a team as a code owner May 30, 2023 10:01
@yoff yoff added the no-change-note-required This PR does not need a change note label May 30, 2023
Copy link
Copy Markdown
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few drive-by comments.

class CopySummary extends SummarizedCallable {
CopySummary() { this = "collection.copy" }

override DataFlow::CallCfgNode getACall() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered having both getACallSimple and getACall, as in Ruby? Where the former may not depend on API graphs, and hence can be used in the call graph construction/type tracking?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is added in #13178 where our type trackers learn about flow summaries. After that, we should reformulate these summaries to use getACallSimple.

Comment thread python/ql/lib/semmle/python/frameworks/Stdlib.qll
@yoff yoff marked this pull request as draft June 1, 2023 06:04
@yoff
Copy link
Copy Markdown
Contributor Author

yoff commented Jun 1, 2023

DCA revealed that adding all read steps is too permissive: It interacts with reading from the dict-splat parameter, losing precision in our parameter routing. I have converted to draft until this is resolved.

@yoff yoff marked this pull request as ready for review June 1, 2023 15:51
@yoff yoff removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Jun 2, 2023
@yoff
Copy link
Copy Markdown
Contributor Author

yoff commented Jun 2, 2023

Simply not adding extra taint steps derived from read steps solves the issue. The new DCA run has much fewer differences.
I previously convinced myself that these extra taint steps were necessary, but now the ones we need are simply part of the summaries.

@hvitved
Copy link
Copy Markdown
Contributor

hvitved commented Jun 6, 2023

DCA revealed that adding all read steps is too permissive: It interacts with reading from the dict-splat parameter, losing precision in our parameter routing. I have converted to draft until this is resolved.

How is that possible? Taint steps are only permitted when the access path is empty, and I would assume that dict-splat parameters are modeled using store-steps at the call sites (i.e., yielding a non-empty access path)?

@calumgrant calumgrant requested a review from RasmusWL June 6, 2023 09:29
@yoff
Copy link
Copy Markdown
Contributor Author

yoff commented Jun 6, 2023

How is that possible? Taint steps are only permitted when the access path is empty, and I would assume that dict-splat parameters are modeled using store-steps at the call sites (i.e., yielding a non-empty access path)?

Well...we turn those store steps into taint steps...

@hvitved
Copy link
Copy Markdown
Contributor

hvitved commented Jun 7, 2023

Well...we turn those store steps into taint steps...

I think it would be better to remove those instead (perhaps follow-up). Other languages only lift read-collection-steps to taint steps, and not store-collection-steps, as that yields false-positive flow.

@yoff
Copy link
Copy Markdown
Contributor Author

yoff commented Jun 7, 2023

I think it would be better to remove those instead (perhaps follow-up). Other languages only lift read-collection-steps to taint steps, and not store-collection-steps, as that yields false-positive flow.

Indeed, I have added this explicitly to https://github.com/github/codeql-python-team/issues/728.

Copy link
Copy Markdown
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo + one thing I would like to understand better. Otherwise, nice 👍

One possibly worry is that while the steps are recovered, they no longer admit type tracking. An experiment has been started to evaluate any consequences of this. We should not merge this PR unless the outcome is satisfying.

Also, I did not see a conclusion of this point you raised initially. Please let me know 😊

getLocalSource

I see that for get we changed

-<get-call>.getArg(0).asExpr().(StrConst).getText()
+<get-call>.getArg(0).getALocalSource().asExpr().(StrConst).getText()

which made us get a few more results... is there any tests that captures this semantic change? (I could not see any)


PS. Let's also do moreDictStoreSteps at some point 👍

Comment on lines +3974 to +3976
input = "Argument[self]" and
output = "ReturnValue" and
preservesValue = false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this part not preserve value?

For the sake of the argument, if you used https://docs.python.org/3.11/library/multiprocessing.html#multiprocessing.sharedctypes.copy I think that would very much preserve values -- but let me know your thoughts on this 😊

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought were that the operation does not preserve the pointer, which also seems to be what your link says.
Our reasoning is generally on pointers, but if it helps us to deviate here we should certainly consider that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that my link is a little stupid, since it's not a method... however, it will be recognized as a MethodCallNode in our analysis still 🤔

Our reasoning is generally on pointers

I'm not sure. I don't think I've ever been part of a discussion where we talked about this aspect of our analysis.

I guess the more philosophical question is, if you copy the contents of a string to a new string located in different part of memory, is that data-flow or taint-flow? C++ has strcpy as data-flow, so that's one datapoint.

I think it's hard to discuss this aspect, since it doesn't seem like we need it for anything? (or maybe we do still need it until we've sorted out our taint handling?)

I would still propose we make this perserve value, but also realize that we shouldn't spend forever discussing this small detail, so if you really don't want to do it, I'll probably live happily with this only being a taint-step.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have put a reference, but could not immediately find it. @taus and I thought it through for type tracking and LocalSourceNodes at some point. But since this step is quite likely to be type preserving (also when we remove time), it is probably fine to make this a dataflow step..

Comment thread python/ql/lib/semmle/python/frameworks/Stdlib.qll Outdated
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
@yoff
Copy link
Copy Markdown
Contributor Author

yoff commented Jun 13, 2023

One possibly worry is that while the steps are recovered, they no longer admit type tracking. An experiment has been started to evaluate any consequences of this. We should not merge this PR unless the outcome is satisfying.

Also, I did not see a conclusion of this point you raised initially. Please let me know 😊

We did not lose a single alert in the DCA runs, so I concluded that this is not a problem for now.

getLocalSource

I see that for get we changed

-<get-call>.getArg(0).asExpr().(StrConst).getText()
+<get-call>.getArg(0).getALocalSource().asExpr().(StrConst).getText()

which made us get a few more results... is there any tests that captures this semantic change? (I could not see any)

Well spotted, I stole this pattern from Ruby and it does indeed capture also string constants which are sufficiently nearby.
I did not add any tests for this specifically.
Where did you see that it got us more results? I think we got more results by modelling the optional default value.

PS. Let's also do moreDictStoreSteps at some point 👍

Absolutely :-)

@yoff yoff requested a review from RasmusWL June 13, 2023 08:17
Copy link
Copy Markdown
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add any tests for this specifically.

Could we have one then? (otherwise we might not notice if we break this in the future)

@yoff
Copy link
Copy Markdown
Contributor Author

yoff commented Jun 13, 2023

I did not add any tests for this specifically.

Could we have one then? (otherwise we might not notice if we break this in the future)

Sure, it is an easy change :-)

@yoff yoff requested a review from RasmusWL June 13, 2023 09:21
d = {SOURCE: "val"}
i = iter(d)
n = next(i)
SINK(n) #$ MISSING: flow="SOURCE, l:-3 -> n" No newline at end of file
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious...looks like that setting never made it to my new machine...

@yoff yoff merged commit 1d65284 into github:main Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-change-note-required This PR does not need a change note Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants