JS: extract regexp literals for string concatenations by erik-krogh · Pull Request #6756 · github/codeql

erik-krogh · 2021-09-24T18:13:43Z

Parses every string-concatenation of constant strings into regular expressions.

CVE-2020-17480: TP/TN (when #6561 is merged).

esbena · 2021-11-02T13:10:57Z

Parses every string-concatenation of constant strings into regular expressions.

Even the subtrees?
Do we miss out on anything if we only do one parse of the following concatenation:

let pattern = 
  // 'abc'
  "abc" +
  // followed by any number of 'd's
  "d*" + 
  // 'e'
  "e";

(if anything, I would expect sub tree parses to be prone to FPs)

erik-krogh · 2021-11-02T13:23:17Z

Even the subtrees?

No, just the roots (at least that's what we try, but the syntax matching doesn't handle parentheses).

And for the example you show we shouldn't miss anything (and also shouldn't get any FPs).

esbena

LGTM, but I would prefer some more documentation (see comments)

esbena · 2021-11-02T13:42:43Z

      return '0' <= ch && ch <= '7';
    }

+    private String getStringConcatResult(Expression exp) {


Nit:

Suggested change

private String getStringConcatResult(Expression exp) {

/**

* Constant-folds simple string concatenations in `exp`.

*/

private String getStringConcatResult(Expression exp) {

esbena · 2021-11-02T14:00:00Z

+      if (extractedAsRegexp.contains(nd)) {
+        return key;
+      }
+      String rawString = getStringConcatResult(nd);


Nit: is this really a "raw" string? Isn't it more like a "folded string"?

asgerf

AFAICT we extract both the leaves and the root. It would be nice if we could ensure that a given piece of text is regexp-extracted at most once (and possibly that leaves are still extracted if the outermost BinaryExpression contains non-constant parts, although I'm not sure if this is even desirable).

asgerf · 2021-10-28T14:00:19Z

 import com.semmle.util.trap.TrapWriter;
 import com.semmle.util.trap.TrapWriter.Label;

+import com.semmle.util.files.FileLineOffsetCache;


Unused import

asgerf · 2021-11-02T13:30:44Z

+      return null;
+    }
+
+    private OffsetTranslation computeStringConcatOffset(Expression exp) {


This method should be merged with getStringConcatResult. Use Pair or a new class to return both results.

asgerf · 2021-11-02T13:31:06Z

+          return null;
+        }
+        int delta = be.getRight().getLoc().getStart().getOffset() - be.getLeft().getLoc().getStart().getOffset();
+        int offset = getStringConcatResult(be.getLeft()).length();


This recursive call to getStringConcatResult can be eliminated after merging the two methods, removing an N^2 trap.

asgerf · 2021-11-02T14:00:51Z

+      extractedAsRegexp.add(nd.getRight());
      visit(nd.getLeft(), key, 0);
      visit(nd.getRight(), key, 1);
+      if (extractedAsRegexp.contains(nd)) {


Could you factor the RegExp-extraction part into an appropriately named method and call into that when it should be extracted as a regexp? This code applies to all BinaryExpression instances and this bailout-style looks a bit out of place here.

asgerf · 2021-11-02T14:00:56Z

    }

+    // set to determine which BinaryExpression has been extracted as regexp
+    private Set<Expression> extractedAsRegexp = new HashSet<>();


This seems a bit heavy-handed for detecting the root BinaryExpression. Ideally this should be part of the Context class.

If for some reason you'd rather use the set, I'd suggest

Use a name like shouldNotExtractAsRegExp or parentWillExtractAsRegExp to make it clear how it is used.

Remove nodes again when they have been visited.

… binop

erik-krogh · 2021-11-03T13:20:53Z

AFAICT we extract both the leaves and the root.

That is correct.
In the talk about sub-trees above I was talking about binary-expressions that weren't the root.

For now I've kept extraction of all the leaves.
It's simpler, and I think we at most risk some duplicate alerts.

asgerf · 2021-11-09T12:55:41Z

For now I've kept extraction of all the leaves.
It's simpler, and I think we at most risk some duplicate alerts.

It may seem simpler here and now, but having multiple AST nodes with the same location/toString value sounds like a nightmare to debug against.

The fix should be a one-liner:

    @Override
    public Label visit(Literal nd, Context c) {
      Label key = super.visit(nd, c);
      String source = nd.getLoc().getSource();
      String valueString = nd.getStringValue();

      trapwriter.addTuple("literals", valueString, source, key);
      if (nd.isRegExp()) {
        OffsetTranslation offsets = new OffsetTranslation();
        offsets.set(0, 1); // skip the initial '/'
        regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), offsets, nd, false);
-     } else if (nd.isStringLiteral() && !c.isInsideType() && nd.getRaw().length() < 1000) {
+     } else if (nd.isStringLiteral() && !c.isInsideType() && nd.getRaw().length() < 1000 && !c.isBinopOperand()) {
        regexpExtractor.extract(valueString, makeStringLiteralOffsets(nd.getRaw()), nd, true);

This means we also won't extract regexps that are concatenated with an unknown string, but it was already FP-risky to analyze partially unknown regexps, so I'd be fine with that.

erik-krogh · 2021-11-10T13:11:20Z

This means we also won't extract regexps that are concatenated with an unknown string, but it was already FP-risky to analyze partially unknown regexps, so I'd be fine with that.

I'll try it out, and run an evaluation on it.
But I think we might get some FNs.

…ring concat

erik-krogh · 2021-11-11T10:57:51Z

I'll try it out, and run an evaluation on it.
But I think we might get some FNs.

I was wrong.
The evaluation looks good.
And the two missing results were FPs that are now fixed.

esbena · 2021-11-12T07:45:35Z

+    int sl = sourceMap.getStart(term.getLoc().getStart().getColumn()).getLine();
+    int sc = sourceMap.getStart(term.getLoc().getStart().getColumn()).getColumn() + 1; // convert to 1-based
+    int el = sourceMap.getEnd(term.getLoc().getEnd().getColumn()).getLine();
+    int ec = sourceMap.getEnd(term.getLoc().getEnd().getColumn()).getColumn() - 1; // convert to inclusive


Is this right? The ec modifications used to be a noop:

ec += 1; // convert to 1-based ec -= 1; // convert to inclusive

I can't quite see if the use of sourceMap makes it correct.

It wasn't right, there was some off-by-one errors.

I'm not sure why that is, but there is some conversion between 0-based and 1-based columns, so that seems to be some of it.

I've fiddled around with it, and now I got something that works (but I'm not quite sure why it works).

I've manually checked all the locations emitted for RegExpTerms in multipart.js.
(I did that by running a test-query in VSCode, clicking results, and checking that the highlights were correct).

esbena · 2021-11-16T12:04:31Z

My comments have been addressed. @asgerf WDYT?

asgerf

LGTM 👍

erik-krogh added depends on internal PR This PR should only be merged in sync with an internal Semmle PR Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish labels Sep 24, 2021

github-actions Bot added the JS label Sep 24, 2021

erik-krogh force-pushed the extractBigReg branch from 892fca6 to 36e6c5c Compare September 27, 2021 09:14

erik-krogh force-pushed the extractBigReg branch from 0f9f523 to c34d338 Compare October 28, 2021 07:43

erik-krogh removed the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Oct 28, 2021

extract regexp literals from string concatenations

12305aa

erik-krogh force-pushed the extractBigReg branch from c34d338 to 12305aa Compare October 28, 2021 08:44

erik-krogh removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 28, 2021

erik-krogh marked this pull request as ready for review October 28, 2021 13:58

erik-krogh requested a review from a team as a code owner October 28, 2021 13:58

esbena previously approved these changes Nov 2, 2021

View reviewed changes

asgerf reviewed Nov 2, 2021

View reviewed changes

erik-krogh added 5 commits November 3, 2021 13:08

Merge branch 'main' into extractBigReg

9cf34f1

remove unused import

be46c1f

compute concatenated string and offset at the same time

1ba6f44

early exit if string becomes too big

737c747

use the context to determine whether or not a node is an operand of a…

7b0ebd3

… binop

erik-krogh dismissed esbena’s stale review via 7b0ebd3 November 3, 2021 13:13

add a docstring, and rename rawString -> foldedString

f01ee59

erik-krogh added 2 commits November 10, 2021 14:11

dont extract regular expressions from strings that are leaves in a st…

98da532

…ring concat

update expected output

9a11c13

esbena reviewed Nov 12, 2021

View reviewed changes

Merge branch 'main' into extractBigReg

80919e3

erik-krogh added 2 commits November 15, 2021 13:43

fix location off-by-ones with regexp parsing

2163648

update expected output

0023b88

asgerf approved these changes Nov 16, 2021

View reviewed changes

erik-krogh merged commit a7cd097 into github:main Nov 16, 2021

Conversation

erik-krogh commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esbena commented Nov 2, 2021

Uh oh!

erik-krogh commented Nov 2, 2021

Uh oh!

esbena left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asgerf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erik-krogh commented Nov 3, 2021

Uh oh!

asgerf commented Nov 9, 2021

Uh oh!

erik-krogh commented Nov 10, 2021

Uh oh!

erik-krogh commented Nov 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esbena commented Nov 16, 2021

Uh oh!

asgerf left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erik-krogh commented Sep 24, 2021 •

edited

Loading