Skip to content

Ruby: add more Array/Enumerable flow summaries#7614

Merged
nickrolfe merged 12 commits intomainfrom
nickrolfe/array_flow_summaries
Feb 9, 2022
Merged

Ruby: add more Array/Enumerable flow summaries#7614
nickrolfe merged 12 commits intomainfrom
nickrolfe/array_flow_summaries

Conversation

@nickrolfe
Copy link
Copy Markdown
Contributor

@nickrolfe nickrolfe commented Jan 17, 2022

Sorry, this is really big, and it might be too tedious to review all the changes.

@github-actions github-actions bot added the Ruby label Jan 17, 2022
@nickrolfe nickrolfe force-pushed the nickrolfe/array_flow_summaries branch 2 times, most recently from 57878ae to 8310a51 Compare January 19, 2022 14:45
@nickrolfe nickrolfe force-pushed the nickrolfe/array_flow_summaries branch from 525a60d to dbcc700 Compare January 27, 2022 12:48
@nickrolfe
Copy link
Copy Markdown
Contributor Author

I've added a commit to enable taint-flow checks in the array-flow test, but it fails with this unexpected result, that I am unable to explain:

| array_flow.rb:4:10:4:13 | ...[...] | Unexpected result: hasTaintFlow=0.1 |

@nickrolfe nickrolfe marked this pull request as ready for review January 27, 2022 13:12
@nickrolfe nickrolfe requested a review from a team as a code owner January 27, 2022 13:12
@nickrolfe nickrolfe force-pushed the nickrolfe/array_flow_summaries branch from 5ffc9a0 to 8248a94 Compare January 28, 2022 11:34
Copy link
Copy Markdown
Contributor

@hmac hmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mammoth piece of work @nickrolfe! I'm impressed and also a bit terrified 😅

I've reviewed most of the Array summaries. I'll take a look at Enumerable tomorrow.

I spent a bit of time trying to find the source of that spurious taint result, but got nowhere I'm afraid.

a.cycle(2) do |x|
sink x # $ hasValueFlow=29
a.combination(1) do |x|
sink(x[0]) # $ hasValueFlow=29
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combination returns an array of combinations, so it would be good to test that too:

b = a.combination(1)
sink(b[i][j]) # $ hasValueFlow=29

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? It looks like it returns either an enumerator or self:

[14] pry(main)> a
=> [0, 10, 20, 30]
[15] pry(main)> a.combination(1)
=> #<Enumerator: ...>
[16] pry(main)> a.combination(1) { |x| }
=> [0, 10, 20, 30]

I've updated the test to test the return value according to that behaviour.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was sloppy with my description here. Without a block it returns an enumerator, but the enumerator's elements are combinations, i.e.

pry(main)> [1,2,3].combination(2).to_a
=> [[1, 2], [1, 3], [2, 3]]

So it's the same deal as with the others, where we might want to pretend the enumerator is an array. A quick search indicates it's common to do things like foo.combination(n).to_a or foo.combination(n).each, so we'd need ArrayElement of Receiver -> ArrayElement[?] of ArrayElement[?] of ReturnValue to track flow through these expressions.

or
exists(ArrayIndex elementIndex, int argIndex |
argIndex in [0 .. mc.getNumberOfArguments() - 1] and
elementIndex = mc.getArgument(argIndex).getConstantValue().getInt()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, how does this line differ from this alternative?

elementIndex = getKnownArrayElementContent(mc.getArgument(argIndex))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure. I noticed Tom's use of getKnownArrayElementContent at some point, and started using it, but didn't go back to this one. @hvitved is there any difference between them?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think that should be the same

Copy link
Copy Markdown
Contributor

@hmac hmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much reviewed everything now. I think there's a few missing flow steps and some other super minor things but otherwise it looks 👌

input = "ArrayElement of Receiver" and
output = "Parameter[0] of BlockArgument" and
preservesValue = true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since

[:a,:taint1,:b].chunk{|x|x == :a ? :taint2 : x}.to_a
=> [[:taint2, [:a]], [:taint1, [:taint1]], [:b, [:b]]]

I think we also want this flow:

  • ArrayElement of Receiver -> ArrayElement[?] of ArrayElement[1] of ArrayElement[?] of ReturnValue
  • ReturnValue of BlockArgument -> ArrayElement[0] of ArrayElement[?] of ReturnValue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk returns an Enumerator rather than an array, but I think we still want to preserve knowledge of its contents as if it were an array, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned in another comment, I didn't implement flow into enumerator return values for any of these methods. I effectively ignored the versions of methods that return enumerators. It's not great, but I'm also not convinced that treating the enumerator as if it were an array is right, either. 🤷

override predicate propagatesFlowExt(string input, string output, boolean preservesValue) {
input = "ArrayElement of Receiver" and
output = ["Parameter[0] of BlockArgument", "Parameter[1] of BlockArgument"] and
preservesValue = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I think we want
ArrayElement of Receiver -> ArrayElement[?] of ArrayElement[?] of ReturnValue

or
input = "ReturnValue of BlockArgument" and
output = "ArrayElement[?] of ReturnValue" and
preservesValue = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no block is given to map it behaves like itself except wraps the array in an Enumerator. So we could have a summary for CollectWithoutBlockArgument which has flow from Receiver to ReturnValue. This is a real edge case though, so maybe not worth bothering with.

override predicate propagatesFlowExt(string input, string output, boolean preservesValue) {
input = "ArrayElement of Receiver" and
output = "ArrayElement[?] of Parameter[0] of BlockArgument" and
preservesValue = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a block argument is given, each_cons returns the receiver

[1,2,3].each_cons(2) { |x| 9 } == [1,2,3]

When no block is given, each_cons returns an enumerator whose elements are the successive n-tuples

[1,2,3].each_cons(2).to_a == [[1,2], [2,3]]

So we may want this extra flow:

  • ArrayElement of Receiver -> ArrayElement[?] of ReturnValue
  • ArrayElement of Receiver -> ArrayElement[?] of ArrayElement[?] of ReturnValue

@hvitved
Copy link
Copy Markdown
Contributor

hvitved commented Feb 4, 2022

I've added a commit to enable taint-flow checks in the array-flow test, but it fails with this unexpected result, that I am unable to explain:

This is expected, and it happens because * has a direct flow step from its argument to the return value (x may already be an array in *x). And all array read steps are also lifted to taint-steps (to account for the case where a taint flow source is an array), so we should simply tag this result as expected.

Copy link
Copy Markdown
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive and thorough work Nick. I have only spot-checked a few summaries, and they looked fine.

Copy link
Copy Markdown
Contributor Author

@nickrolfe nickrolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed a lot of the feedback. Thanks @hmac and @hvitved!

To summarise the remaining issues:

  1. So far, the summaries just kind of ignore the methods (or variants) that return enumerators. @hmac is effectively proposing that we treat them as if they were arrays. I'm not convinced.
    • Side note: a lot of the methods return self if given a block, and an enumerator otherwise. We're currently ignoring the latter variants, but I think the summaries we do have for the block variants should be restricted with exists(mc.getBlock()).
  2. I couldn't make @hvitved's suggested changes to use SimpleSummarizedCallable without introducing non-monotonic recursion errors. Is there a better way to avoid those?

a.cycle(2) do |x|
sink x # $ hasValueFlow=29
a.combination(1) do |x|
sink(x[0]) # $ hasValueFlow=29
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? It looks like it returns either an enumerator or self:

[14] pry(main)> a
=> [0, 10, 20, 30]
[15] pry(main)> a.combination(1)
=> #<Enumerator: ...>
[16] pry(main)> a.combination(1) { |x| }
=> [0, 10, 20, 30]

I've updated the test to test the return value according to that behaviour.

a = [0, 1, source(34)]
a.cycle(2) do |x|
sink x # $ hasValueFlow=34
end
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When called without a block, it returns an enumerator. Your suggested addition wouldn't work:

[26] pry(main)> b = a.cycle(2)
=> #<Enumerator: ...>
[27] pry(main)> b[0]
NoMethodError: undefined method `[]' for #<Enumerator: [0, 10, 20, 30]:cycle(2)>

As far as I know, we're not tracking flow into enumerator return values for any of these methods. So, if we made it b = a.cycle(2).to_a, I don't think we are able to track that flow.

or
exists(ArrayIndex elementIndex, int argIndex |
argIndex in [0 .. mc.getNumberOfArguments() - 1] and
elementIndex = mc.getArgument(argIndex).getConstantValue().getInt()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure. I noticed Tom's use of getKnownArrayElementContent at some point, and started using it, but didn't go back to this one. @hvitved is there any difference between them?

input = "ArrayElement of Receiver" and
output = "Parameter[0] of BlockArgument" and
preservesValue = true
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned in another comment, I didn't implement flow into enumerator return values for any of these methods. I effectively ignored the versions of methods that return enumerators. It's not great, but I'm also not convinced that treating the enumerator as if it were an array is right, either. 🤷

exists(ArrayIndex i |
input = "ArrayElement[" + i + "] of Receiver" and
output = "ArrayElement[" + i + "] of ReturnValue"
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvitved – you wrote this one, originally. Any reason not to make the simplification @hmac is suggesting?

@nickrolfe
Copy link
Copy Markdown
Contributor Author

nickrolfe commented Feb 4, 2022

Oh, I see I missed some of Tom's comments, because they were made on an earlier commit, before I moved everything to a new file.

Edit: now addressed.

@hmac
Copy link
Copy Markdown
Contributor

hmac commented Feb 7, 2022

I think the only remaining question mark is what we do about the methods that return enumerators. My argument is we should track flow through them as if they were arrays, so that expressions like arr.each_cons(2).map { |x, y| sink y } will work. Here are some examples of this kind of enumerator chaining in real code.

@asgerf
Copy link
Copy Markdown
Contributor

asgerf commented Feb 8, 2022

+1 for treating enumerables as arrays, or more specifically: treating ArrayElement to mean the content of any kind of enumerable (maybe it could then be renamed to something more appropriate).

We could also use an access path token Element to mean any kind of enumerable content, but then initially implement that as an alias for ArrayElement. That way, the summary rows are accurate, and there's a single place to update if it is later deemed important to distinguish arrays from other enumerables.

@asgerf
Copy link
Copy Markdown
Contributor

asgerf commented Feb 8, 2022

Does the enumerable stuff need to be in this PR though? I'd would simplify the access path rewrite if this PR gets merged ASAP.

@nickrolfe
Copy link
Copy Markdown
Contributor Author

I think the changes here are good enough that we could merge, to let you do your rewrite, and I could handle the rest in a followup PR after that.

However, there were two questions for @hvitved that I would not like to be missed:

  1. Is there any meaningful difference berween elementIndex = mc.getArgument(argIndex).getConstantValue().getInt() and elementIndex = getKnownArrayElementContent(mc.getArgument(argIndex))?
  2. Is there any reason not to make @hmac's suggested simplification of EachSliceSummary?

@hvitved
Copy link
Copy Markdown
Contributor

hvitved commented Feb 8, 2022

maybe it could then be renamed to something more appropriate

It will be renamed as part of my work in implementing flow through hashes.

@hvitved
Copy link
Copy Markdown
Contributor

hvitved commented Feb 8, 2022

However, there were two questions for @hvitved that I would not like to be missed:

D'oh, my comments were still pending.

  1. Is there any meaningful difference berween elementIndex = mc.getArgument(argIndex).getConstantValue().getInt() and elementIndex = getKnownArrayElementContent(mc.getArgument(argIndex))?

No.

Is there any reason not to make @hmac's suggested simplification of EachSliceSummary?

That would allow flow in this case, which we don't want:

x = "tainted"
y = x.each_slice
sink(x)

Copy link
Copy Markdown
Contributor

@hmac hmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work 👍

@nickrolfe nickrolfe merged commit 1eba827 into main Feb 9, 2022
@nickrolfe nickrolfe deleted the nickrolfe/array_flow_summaries branch February 9, 2022 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants