Skip to content

Conversation

@tausbn
Copy link
Contributor

@tausbn tausbn commented May 22, 2023

This adds application mode extraction queries, using the legacy implementation as a guideline. We expect the behaviour to be mostly equal, but with some differences

  • this includes an UninterestingToModelCharacteristic implementation that excludes some well-known frameworks. These would otherwise make up a large fraction of the candidates and we suggest removing them, as we now have framework mode.
  • this limits the number of negative examples to at most 100 per class, as the negative examples could otherwise easily outweigh the candidates by a factor of 15-20x — this seemed overkill, given that only a handful ever make it into our prompts.

When reviewing this, feel free to run the extraction on a DB locally and tell us if you spot anything that should/shouldn't be a candidate, or +/- example in your opinion.

We've split this up in a large number of commits that are mostly rather small, but a review would be easiest to do on the whole code. YMMV.

@github-actions github-actions bot added the Java label May 22, 2023
@jhelie
Copy link
Contributor

jhelie commented May 24, 2023

Drive by comment but I think CallContext should lift the wider [context][region] rather than the call itself?

@tausbn
Copy link
Contributor Author

tausbn commented May 24, 2023

Drive by comment but I think CallContext should lift the wider [context][region] rather than the call itself?

If I understand correctly, then this is something @kaeluka and I discussed, and were planning on doing (but didn't get around to implementing yet). 👍

Thanks for the reminder! 🙂

@kaeluka kaeluka force-pushed the tausbn/automodel-application-mode branch from 0e6a404 to 9a04124 Compare May 25, 2023 12:16
@kaeluka kaeluka force-pushed the tausbn/automodel-application-mode branch from 3641a76 to a89378d Compare May 26, 2023 11:20
@kaeluka
Copy link

kaeluka commented May 30, 2023

@adityasharad are you able to give this a review?

@kaeluka kaeluka marked this pull request as ready for review May 30, 2023 09:03
@kaeluka kaeluka requested a review from a team as a code owner May 30, 2023 09:03
@kaeluka
Copy link

kaeluka commented Jun 7, 2023

Following up on my above message, and based on the work in the sister PR github/codeml-automodel#117, here is more data on the changes in candidates for the application mode (assessed using the same android source suite):

candidates_duplicates.sarif.txt candidates_additional.sarif.txt candidates_missing.sarif.txt

@kaeluka @tausbn as discussed I think it's worth having a look at these (especially the additional candidates) before merging this PR - to sanity check everything is as we'd expect it.

The data that @jhelie sent us in this PR is the next thing we'll look into. Then, we still need to resolve @adityasharad's comment above. Then this should be mergeable.

@kaeluka
Copy link

kaeluka commented Jun 8, 2023

I started looking into the additional sinks in this PR, compared to the ones in the old branch; I found the majority of additional sinks is models where input = Argument[this]. The old branch skipped those, I think. This branch is right, IMO, to not skip them.

The rest of the additional sinks I've looked at all seem good and valid candidates, it's difficult to say why the old branch would've omitted them.

I'll send @jhelie a gist of my findings, but won't post it here in public.

@kaeluka
Copy link

kaeluka commented Jun 9, 2023

@tausbn and I have now also looked at the sinks we were missing.

Out of the 65 missing sinks:

  • 62 of them were dropped by the characteristic we have already removed from this PR in favour of revisiting later.
  • 3 of them were dropped due to a bug (nice catch, @jhelie 🥳) that we have fixed in the latest commit (b38bc52). The bug fix, for that DB, does only additionally produce those exact three candidates, and doesn't change the results otherwise.

With that, the investigation of the diffs that Jean has sent us is done in our opinion.

Additionally, I'll send Jean a complete gist of the differences as we have investigated — just like with the additional sinks last time.

@jhelie
Copy link
Contributor

jhelie commented Jun 9, 2023

Thanks @kaeluka, good news 👍 And what about the additional ones? Are they accounted for by #13372 ?

@kaeluka
Copy link

kaeluka commented Jun 12, 2023

Thanks @kaeluka, good news 👍 And what about the additional ones? Are they accounted for by #13372 ?

There may be some, but we didn't find any in the sample we spot-checked.

@jhelie
Copy link
Contributor

jhelie commented Jun 12, 2023

Oh apologies, I had missed your earlier message! #13239 (comment)

@jhelie
Copy link
Contributor

jhelie commented Jun 12, 2023

One last question on the topic then: there are not many duplicate candidates (193) but could they be indicative of some small issue with the way we identify/characterise the location of candidates?

@kaeluka
Copy link

kaeluka commented Jun 12, 2023

One last question on the topic then: there are not many duplicate candidates (193) but could they be indicative of some small issue with the way we identify/characterise the location of candidates?

Oh!! Those fell under the table. Apologies. Give me until early tomorrow — shouldn't take longer than that for a look 👀

@kaeluka
Copy link

kaeluka commented Jun 13, 2023

I've looked over these now and it appears that there aren't any duplicates in there. Some of the 'duplicates' are different sink candidates in similar locations, while others are similar candidates in different locations. If you disagree, please slack us a counterexample 👍

@kaeluka
Copy link

kaeluka commented Jun 13, 2023

I think this is now ready to merge, can we get an approve, @jhelie?

@jhelie
Copy link
Contributor

jhelie commented Jun 13, 2023

I've looked over these now and it appears that there aren't any duplicates in there. Some of the 'duplicates' are different sink candidates in similar locations, while others are similar candidates in different locations. If you disagree, please slack us a counterexample 👍

Thanks for having a look. Just to make sure I completely understand here are 2 examples:

"296e18418bb32f4c:1_0_1": {
        "ruleId": "java/ml/extract-automodel-application-candidates",
            "rule": {
            "id": "java/ml/extract-automodel-application-candidates",
            "index": 0,
            "toolComponent": {
                "index": 0
            }
        },
        "message": {
            "text": "command-injection, sql, ssrf, tainted-path\nrelated locations: [CallContext](1).\nmetadata: [package](2), [type](3), [subtypes](4), [name](5), [signature](6), [input](7)."
        },
        "locations": [
            {
                "physicalLocation": {
                    "artifactLocation": {
                        "uri": "app/src/main/java/com/vuldroid/application/SplashScreen.java",
                        "uriBaseId": "%SRCROOT%",
                        "index": 89
                    },
                    "region": {
                        "startLine": 58,
                        "startColumn": 9,
                        "endColumn": 91
                    },
                    "contextRegion": {
                        "startLine": 56,
                        "endLine": 60,
                        "snippet": {
                            "text": "\n    public void chekPermission(){\n        Dexter.withContext(this).withPermission(Manifest.permission.READ_EXTERNAL_STORAGE).withListener(new PermissionListener() {\n            @Override\n            public void onPermissionGranted(PermissionGrantedResponse permissionGrantedResponse) {\n"
                        }
                    }
                }
            }
        ],
        "partialFingerprints": {
            "primaryLocationLineHash": "296e18418bb32f4c:1",
            "primaryLocationStartColumnFingerprint": "0"
        },

and

 "296e18418bb32f4c:1_0_2": {
        "ruleId": "java/ml/extract-automodel-application-candidates",
            "rule": {
            "id": "java/ml/extract-automodel-application-candidates",
            "index": 0,
            "toolComponent": {
                "index": 0
            }
        },
        "message": {
            "text": "command-injection, sql, ssrf, tainted-path\nrelated locations: [CallContext](1).\nmetadata: [package](2), [type](3), [subtypes](4), [name](5), [signature](6), [input](7)."
        },
        "locations": [
            {
                "physicalLocation": {
                    "artifactLocation": {
                        "uri": "app/src/main/java/com/vuldroid/application/SplashScreen.java",
                        "uriBaseId": "%SRCROOT%",
                        "index": 89
                    },
                    "region": {
                        "startLine": 58,
                        "startColumn": 9,
                        "endColumn": 33
                    },
                    "contextRegion": {
                        "startLine": 56,
                        "endLine": 60,
                        "snippet": {
                            "text": "\n    public void chekPermission(){\n        Dexter.withContext(this).withPermission(Manifest.permission.READ_EXTERNAL_STORAGE).withListener(new PermissionListener() {\n            @Override\n            public void onPermissionGranted(PermissionGrantedResponse permissionGrantedResponse) {\n"
                        }
                    }
                }
            }
        ],
        "partialFingerprints": {
            "primaryLocationLineHash": "296e18418bb32f4c:1",
            "primaryLocationStartColumnFingerprint": "0"
        },

Initially I thought it was the same line copied but the location (including the file) is the same - the endColumn is the only thing changing.

The contextRegion for both starts with:

  public void chekPermission(){
        Dexter.withContext(this).withPermission(Manifest.permission.READ_EXTERNAL_STORAGE).withListener(new PermissionListener() {
            @Override
            public void onPermissionGranted(PermissionGrantedResponse permissionGrantedResponse) {

could you just clarify for me what are the 2 candidates that the region entries refer to?

edit: based on the name entry further in the sarif they are withPermission and withListener but I don't get why the "startColumn": 9 of withListener is the same as that of withPermission ?

@jhelie
Copy link
Contributor

jhelie commented Jun 13, 2023

I think this is now ready to merge, can we get an approve, @jhelie?

I haven't followed the latest rounds of discussion with @adityasharad so I'll let him do the ✅

@kaeluka
Copy link

kaeluka commented Jun 13, 2023

edit: based on the name entry further in the sarif they are ... and ... but I don't get why the "startColumn": 9 of withListener is the same as that of withPermission ?

If you look at the snippets, you'll see that the expressions literally start at the same location

@jhelie
Copy link
Contributor

jhelie commented Jun 13, 2023

If you look at the snippets, you'll see that the expressions literally start at the same location

I don't understand sorry: they are on the same line but not the same column?

@adityasharad
Copy link
Collaborator

adityasharad commented Jun 13, 2023

This is a weird example because of the chained calls (x.y(...).z(...)) and how such locations are described.

The first call is:

Dexter.withContext(this).withPermission(Manifest.permission.READ_EXTERNAL_STORAGE)

The second call is:

Dexter.withContext(this).withPermission(Manifest.permission.READ_EXTERNAL_STORAGE).withListener(new PermissionListener() {...})

The way the locations are constructed for these call expressions, both call expressions start on the same line and column, but end on different columns. The start column of each call to withX is (sensibly but somewhat counterintuitively) not the column just before the w, but the column just before the Dexter. because the qualifiers are included.

* A class representing nodes that are arguments to calls.
*/
private class ArgumentNode extends DataFlow::Node {
ArgumentNode() { this.asExpr() = [any(Call c).getAnArgument(), any(Call c).getQualifier()] }
Copy link
Contributor

@atorralba atorralba Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this. This isn't taking into account variadic arguments, which means that, for the following:

public void test(String a, String... b) {}
public void test2() { test("a", "b", "c"); }

The arguments of test will have the following positions and generate the respective MaD "input" strings, if I understand the current CodeQL implementation correctly:

Expr idx input
test(...)  -1 Argument[this]
"a" 0 Argument[0]
"b" 1 Argument[1]
"c" 2 Argument[2]

However, the signature of test is test(String,String[]), which means that Argument[2] won't match anything. In this case, both "b" and "c" should generate Argument[1] since they are part of the same variadic argument.

Note that the kind of DataFlow::Node that handles this isn't an ExprNode but rather an ImplicitVarargsArray. That's why using Node::asExpr won't work well in those cases.

Happy to handle this in a different PR if you want this merged ASAP, as long as you're aware of this shortcoming in the current implementation.

Copy link
Contributor

@jhelie jhelie Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @atorralba ! I've added your comment to this list so that we don't forget about it but my understanding is that it shouldn't prevent us from merging this for now.

Copy link
Contributor

@atorralba atorralba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the caveat explained above.

@jhelie
Copy link
Contributor

jhelie commented Jun 14, 2023

:shipit: 🚀

@jhelie jhelie merged commit 209f3e2 into main Jun 14, 2023
@jhelie jhelie deleted the tausbn/automodel-application-mode branch June 14, 2023 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Java no-change-note-required This PR does not need a change note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants