ATM: Extract training data #11263

tiferet · 2022-11-14T22:58:10Z

Implement the new query that selects data for training.

For now we include clauses that implement logic that is identical to the old queries, so that the final dataset is identical, but we also note as TODO items some constraints we may want to experiment with removing. We include a temporary wrapper query that converts the resulting data into the format expected by the endpoint pipeline.

This PR moves the couple of small pieces of ExtractEndpointData that are still needed into ExtractEndpointDataTraining.qll, and deletes ExtractEndpointDataTraining.ql, ExtractEndpointDataTraining.qll, and the associated test files.

Timing checks:

✅ KPI timing experiment: github/codeql-dca-main#8563
☑️ The local runtime of endpoint_large_scale/ExtractEndpointDataTraining is slightly impacted: Increased from about 4s to about 5s.

Closes github/ml-ql-adaptive-threat-modeling#2098

Implement the new query that selects data for training. For now we include clauses that implement logic that is identical to the old queries. Include a temporary wrapper query that converts the resulting data into the format expected by the endpoint pipeline. Move the small pieces of `ExtractEndpointData` that are still needed into `ExtractEndpointDataTraining.qll`.

Also remove the associated test files.

javascript/ql/experimental/adaptivethreatmodeling/modelbuilding/DebugResultInclusion.ql

tiferet · 2022-11-15T01:23:42Z

If the tests pass (they've been running for almost an hour now) and the KPI experiment is OK, then this PR will be ready for review when you get to work on Tuesday.

tiferet · 2022-11-15T01:26:12Z

Nice -- the checks finally passed ✅

tiferet · 2022-11-15T03:01:24Z

@kaeluka The KPI timing experiment is OK, so this PR is now ready for review 🏓

kaeluka

👍 this looks good to me. I've left a few nitpicks (in other words: not necessary to address before merging — feel free to address in your next PR).

I've tried to use the suggestions feature where possible to make this as effortless as possible on your part.

The one choice you should make before merging is what to do about my suggestion to delete the logic backing up the hasFlowFromSource value. If you accept the suggestion, you'd need to also fix the .expected file. You may ignore that suggestion, merge and I'll send a PR tomorrow that removes the flag after you merge this PR.

.../experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.ql

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

kaeluka · 2022-11-15T15:11:23Z

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

+  exists(endpoint.getFile().getRelativePath()) and
+  // Only select endpoints that can be part of a tainted flow: Constant expressions always evaluate to a constant
+  // primitive value. Therefore they can't ever appear in an alert, making them less interesting training examples.
+  // TODO: Experiment with removing this requirement.


Suggested change

// TODO: Experiment with removing this requirement.

// TODO: Turn this requirement into a characteristic.

(nitpick)

..right?

Hmm, I suppose we could. Would that characteristic have any (positive or negative) implications for any class? Or are you suggesting adding it as a characteristic with no implications, simply so that the modeling code could use it in type balancing?

Either way it would need to be done as an experiment, because it could impact ATM metrics, so we'd have to check whether it helps or hurts them.

kaeluka · 2022-11-15T15:11:23Z

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

+  // TODO: Experiment with removing this requirement.
+  not endpoint.asExpr() instanceof ConstantExpr and
+  // Do not select endpoints filtered out by end-to-end evaluation.
+  // TODO: Experiment with removing this requirement.


Suggested change

// TODO: Experiment with removing this requirement.

// TODO: Turn this requirement into a characteristic.

(nitpick)

Same question as above #11263 (comment)

kaeluka · 2022-11-15T15:11:23Z

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

+    endpointClass instanceof NegativeType and
+    exists(EndpointCharacteristic c |
+      c.getEndpoints(endpoint) and
+      c instanceof LikelyNotASinkCharacteristic


Meta comment, this is not criticism of the PR, just me thinking out loud;

I'm a bit unhappy about naming individual characteristic classes in here; I hope we'll be able to replace all of that stuff by only relying on the individual EndpointCharacteristic implementations abstractly in later PRs. For instance, by using the getImplication method (like exists(EndpointCharacteristic c | ... c.getImplication(...))).

It's totally expected that this sort of code is accumulating here for now, and you already left plenty of TODOs that indicate we want to move in the same direction here.

I agree, and in fact I'm pretty sure this entire clause needs to be removed. It doesn't make logical sense to me: If an endpoint has both a high-confidence reason to think it's not a sink and a medium-confidence reason to think it's not as sink, why would we exclude it from the training set? (See the last item here). I've put it in this PR because we want to reproduce the old data for now. Changing this logic needs to be tested end-to-end, because it might affect our metrics. After this PR is merged, we should start running experiments getting rid of the clauses we don't like and seeing if metrics improve (or at least stay the same).

kaeluka · 2022-11-15T15:11:23Z

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

+  query instanceof XssQuery and result instanceof XssAtm::Configuration
+}
+
+// TODO: Delete this once we are no longer surfacing `hasFlowFromSource`.


My understanding is that the data pipeline does not rely on the value of hasFlowFromSource at all. With this background, could we simply delete all this logic and hard-code a false and a comment saying it's deprecated above in lines 142-144..?

I understand this would break some of our unit tests, but that'd be a good reason to update the .expected files.

I just saw that someone has been saying the same thing in a NOTE above :-D Why don't we? I've double-checked with @TomBolton and he also agrees that hard-coding it should be no problem.

Also: this is likely to speed up the pipeline quite a bit <3

The endpoint pipeline passes this value on to the modeling code. IIRC, in the modeling code I experimented with using it to weigh samples, but concluded that this added no benefit. I'd have to dig into the modeling code and make sure it doesn't use hasFlowFromSource anywhere, and then I could do that.

I just saw that someone has been saying the same thing in a NOTE above

That was me saying this in the note 😄 The reason I didn't do this in the current PR is that it will require changes in the modeling code, which currently uses the concept of having flow from a source. I would rather check more carefully that it has no effect on training before making this change.

Incidentally, there's already an issue for this deprecation in the epic.

Meta comment: In all the cases above, my goal in this PR was to reproduce the current data exactly, so that we won't need to do end-to-end testing before merging this PR. Once this basic framework is merged, the next step would be to open a PR that adjusts the logic in all the ways we think it should be adjusted, then run end-to-end testing. If metrics aren't hurt we could merge that PR and run a partial update process. Until the orchestrator is fully functional, that's a non-trivial process, so I'd rather not do it many times unnecessarily. That's why I prefer a strict separation between PRs that change the framework but leave the data identical, and can be verified with PR checks, followed by a small number of targeted PRs that improve the underlying logic/data but require end-to-end testing.

I just opened an issue to track the followup experiments so we don't forget about them.

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

kaeluka · 2022-11-15T15:11:23Z

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

+
+/**
+ * Gets the ATM data flow configuration for the specified query.
+ * TODO: Delete this once we are no longer surfacing `hasFlowFromSource`.
+ */
+DataFlow::Configuration getDataFlowCfg(Query query) {
+  query instanceof NosqlInjectionQuery and result instanceof NosqlInjectionAtm::Configuration
+  or
+  query instanceof SqlInjectionQuery and result instanceof SqlInjectionAtm::Configuration
+  or
+  query instanceof TaintedPathQuery and result instanceof TaintedPathAtm::Configuration
+  or
+  query instanceof XssQuery and result instanceof XssAtm::Configuration
+}
+
+// TODO: Delete this once we are no longer surfacing `hasFlowFromSource`.
+module FlowFromSource {
+  predicate hasFlowFromSource(DataFlow::Node endpoint, Query q) {
+    exists(Configuration cfg | cfg.getQuery() = q | cfg.hasFlow(_, endpoint))
+  }
+
+  /**
+   * A data flow configuration that replicates the data flow configuration for a specific query, but
+   * replaces the set of sinks with the set of endpoints we're extracting.
+   *
+   * We use this to find out when there is flow to a particular endpoint from a known source.
+   *
+   * This configuration behaves in a very similar way to the `ForwardExploringConfiguration` class
+   * from the CodeQL standard libraries for JavaScript.
+   */
+  private class Configuration extends DataFlow::Configuration {
+    Query q;
+
+    Configuration() { this = getDataFlowCfg(q) }
+
+    Query getQuery() { result = q }
+
+    /** Holds if `sink` is an endpoint we're extracting. */
+    override predicate isSink(DataFlow::Node sink) { any() }
+
+    /** Holds if `sink` is an endpoint we're extracting. */
+    override predicate isSink(DataFlow::Node sink, DataFlow::FlowLabel lbl) { exists(lbl) }
+  }
+}


Suggested change

/**

* Gets the ATM data flow configuration for the specified query.

* TODO: Delete this once we are no longer surfacing `hasFlowFromSource`.

*/

DataFlow::Configuration getDataFlowCfg(Query query) {

query instanceof NosqlInjectionQuery and result instanceof NosqlInjectionAtm::Configuration

or

query instanceof SqlInjectionQuery and result instanceof SqlInjectionAtm::Configuration

or

query instanceof TaintedPathQuery and result instanceof TaintedPathAtm::Configuration

or

query instanceof XssQuery and result instanceof XssAtm::Configuration

}

// TODO: Delete this once we are no longer surfacing `hasFlowFromSource`.

module FlowFromSource {

predicate hasFlowFromSource(DataFlow::Node endpoint, Query q) {

exists(Configuration cfg | cfg.getQuery() = q | cfg.hasFlow(_, endpoint))

}

/**

* A data flow configuration that replicates the data flow configuration for a specific query, but

* replaces the set of sinks with the set of endpoints we're extracting.

*

* We use this to find out when there is flow to a particular endpoint from a known source.

*

* This configuration behaves in a very similar way to the `ForwardExploringConfiguration` class

* from the CodeQL standard libraries for JavaScript.

*/

private class Configuration extends DataFlow::Configuration {

Query q;

Configuration() { this = getDataFlowCfg(q) }

Query getQuery() { result = q }

/** Holds if `sink` is an endpoint we're extracting. */

override predicate isSink(DataFlow::Node sink) { any() }

/** Holds if `sink` is an endpoint we're extracting. */

override predicate isSink(DataFlow::Node sink, DataFlow::FlowLabel lbl) { exists(lbl) }

}

}

suggestion: delete this, and hard-code above (see other suggestion)

kaeluka · 2022-11-15T15:11:23Z

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

+      // NOTE: We don't use hasFlowFromSource in training, so we could just hardcode it to be false.
+      key = "hasFlowFromSource" and
+      (
+        if FlowFromSource::hasFlowFromSource(endpoint, query)
+        then value = "true"
+        else value = "false"
+      ) and
+      valueType = "boolean"


Suggested change

// NOTE: We don't use hasFlowFromSource in training, so we could just hardcode it to be false.

key = "hasFlowFromSource" and

(

if FlowFromSource::hasFlowFromSource(endpoint, query)

then value = "true"

else value = "false"

) and

valueType = "boolean"

key = "hasFlowFromSource" and value = "false" and valueType = "boolean"

When accepting this, also make sure to see the last suggestion in the file (deleting the module) 👍

tiferet · 2022-11-15T17:01:39Z

Meta comment: In all the cases above, my goal in this PR was to reproduce the current data exactly, so that we won't need to do end-to-end testing before merging this PR. Once this basic framework is merged, the next step would be to open a PR that adjusts the logic in all the ways we think it should be adjusted, then run end-to-end testing. If metrics aren't hurt we could merge that PR and run a partial update process. Until the orchestrator is fully functional, that's a non-trivial process, so I'd rather not do it many times unnecessarily. That's why I prefer a strict separation between PRs that change the framework but leave the data identical, and can be verified with PR checks, followed by a small number of targeted PRs that improve the underlying logic/data but require end-to-end testing.

Co-authored-by: Stephan Brandauer <kaeluka@github.com>

kaeluka

Still lgtm 👍

tiferet · 2022-11-15T19:55:26Z

Still lgtm 👍

Thank you! I had some followup questions about some of your comments (e.g. #11263 (comment)), so let's keep discussing them in this PR even after it's merged. When we reach a conclusion I'll implement it in a subsequent PR.

tiferet added 2 commits Nov 14, 2022

Delete ExtractEndpointData.

b47723d

Also remove the associated test files.

github-actions bot added the ATM label Nov 14, 2022

github-code-scanning bot found potential problems Nov 14, 2022

View changes

javascript/ql/experimental/adaptivethreatmodeling/modelbuilding/DebugResultInclusion.ql Fixed Show fixed Hide fixed

tiferet added 2 commits Nov 14, 2022

Fix import errors in DebugResultInclusion.ql

6b7612f

Fix non-ascii character in docs

9ecff07

tiferet marked this pull request as ready for review Nov 15, 2022

tiferet requested a review from a team as a code owner Nov 15, 2022

tiferet requested review from kaeluka and removed request for a team Nov 15, 2022

owen-mc changed the title ~~Extract training data~~ ATM: Extract training data Nov 15, 2022

kaeluka previously approved these changes Nov 15, 2022

View changes

Apply suggestions from code review

092e019

Co-authored-by: Stephan Brandauer <kaeluka@github.com>

tiferet dismissed kaeluka’s stale review via 092e019 Nov 15, 2022

tiferet requested a review from kaeluka Nov 15, 2022

Apply suggestion from code review

fc078a4

kaeluka approved these changes Nov 15, 2022

View changes

tiferet merged commit 710b215 into main Nov 15, 2022
11 checks passed

tiferet deleted the tiferet/extract-training-data branch Nov 15, 2022

ATM: Extract training data #11263

ATM: Extract training data #11263

tiferet commented Nov 14, 2022 •

edited

tiferet commented Nov 15, 2022

tiferet commented Nov 15, 2022 •

edited

tiferet commented Nov 15, 2022

kaeluka left a comment

kaeluka Nov 15, 2022

tiferet Nov 15, 2022

kaeluka Nov 15, 2022

tiferet Nov 15, 2022

kaeluka Nov 15, 2022

tiferet Nov 15, 2022

kaeluka Nov 15, 2022

tiferet Nov 15, 2022

tiferet Nov 15, 2022

tiferet Nov 15, 2022

tiferet Nov 15, 2022

kaeluka Nov 15, 2022

kaeluka Nov 15, 2022

tiferet commented Nov 15, 2022 •

edited

kaeluka left a comment

tiferet commented Nov 15, 2022

	// TODO: Experiment with removing this requirement.
	// TODO: Turn this requirement into a characteristic.

ATM: Extract training data #11263

ATM: Extract training data #11263

Conversation

tiferet commented Nov 14, 2022 • edited

tiferet commented Nov 15, 2022

tiferet commented Nov 15, 2022 • edited

tiferet commented Nov 15, 2022

kaeluka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiferet commented Nov 15, 2022 • edited

kaeluka left a comment

tiferet commented Nov 15, 2022

tiferet commented Nov 14, 2022 •

edited

tiferet commented Nov 15, 2022 •

edited

tiferet commented Nov 15, 2022 •

edited