Skip to content

JS: extract regexp literals for string concatenations#6756

Merged
erik-krogh merged 12 commits into
github:mainfrom
erik-krogh:extractBigReg
Nov 16, 2021
Merged

JS: extract regexp literals for string concatenations#6756
erik-krogh merged 12 commits into
github:mainfrom
erik-krogh:extractBigReg

Conversation

@erik-krogh
Copy link
Copy Markdown
Contributor

@erik-krogh erik-krogh commented Sep 24, 2021

Parses every string-concatenation of constant strings into regular expressions.

CVE-2020-17480: TP/TN (when #6561 is merged).

Evaluation looks good.

@erik-krogh erik-krogh added depends on internal PR This PR should only be merged in sync with an internal Semmle PR Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish labels Sep 24, 2021
@github-actions github-actions Bot added the JS label Sep 24, 2021
@erik-krogh erik-krogh removed the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Oct 28, 2021
@erik-krogh erik-krogh removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Oct 28, 2021
@erik-krogh erik-krogh marked this pull request as ready for review October 28, 2021 13:58
@erik-krogh erik-krogh requested a review from a team as a code owner October 28, 2021 13:58
@esbena
Copy link
Copy Markdown
Contributor

esbena commented Nov 2, 2021

Parses every string-concatenation of constant strings into regular expressions.

Even the subtrees?
Do we miss out on anything if we only do one parse of the following concatenation:

let pattern = 
  // 'abc'
  "abc" +
  // followed by any number of 'd's
  "d*" + 
  // 'e'
  "e";

(if anything, I would expect sub tree parses to be prone to FPs)

@erik-krogh
Copy link
Copy Markdown
Contributor Author

Even the subtrees?

No, just the roots (at least that's what we try, but the syntax matching doesn't handle parentheses).

And for the example you show we shouldn't miss anything (and also shouldn't get any FPs).

esbena
esbena previously approved these changes Nov 2, 2021
Copy link
Copy Markdown
Contributor

@esbena esbena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I would prefer some more documentation (see comments)

return '0' <= ch && ch <= '7';
}

private String getStringConcatResult(Expression exp) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
private String getStringConcatResult(Expression exp) {
/**
* Constant-folds simple string concatenations in `exp`.
*/
private String getStringConcatResult(Expression exp) {

Comment thread javascript/extractor/src/com/semmle/js/extractor/ASTExtractor.java Outdated
if (extractedAsRegexp.contains(nd)) {
return key;
}
String rawString = getStringConcatResult(nd);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is this really a "raw" string? Isn't it more like a "folded string"?

Copy link
Copy Markdown
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT we extract both the leaves and the root. It would be nice if we could ensure that a given piece of text is regexp-extracted at most once (and possibly that leaves are still extracted if the outermost BinaryExpression contains non-constant parts, although I'm not sure if this is even desirable).

import com.semmle.util.trap.TrapWriter;
import com.semmle.util.trap.TrapWriter.Label;

import com.semmle.util.files.FileLineOffsetCache;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import

return null;
}

private OffsetTranslation computeStringConcatOffset(Expression exp) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be merged with getStringConcatResult. Use Pair or a new class to return both results.

return null;
}
int delta = be.getRight().getLoc().getStart().getOffset() - be.getLeft().getLoc().getStart().getOffset();
int offset = getStringConcatResult(be.getLeft()).length();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursive call to getStringConcatResult can be eliminated after merging the two methods, removing an N^2 trap.

extractedAsRegexp.add(nd.getRight());
visit(nd.getLeft(), key, 0);
visit(nd.getRight(), key, 1);
if (extractedAsRegexp.contains(nd)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you factor the RegExp-extraction part into an appropriately named method and call into that when it should be extracted as a regexp? This code applies to all BinaryExpression instances and this bailout-style looks a bit out of place here.

}

// set to determine which BinaryExpression has been extracted as regexp
private Set<Expression> extractedAsRegexp = new HashSet<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit heavy-handed for detecting the root BinaryExpression. Ideally this should be part of the Context class.

If for some reason you'd rather use the set, I'd suggest

  • Use a name like shouldNotExtractAsRegExp or parentWillExtractAsRegExp to make it clear how it is used.
  • Remove nodes again when they have been visited.

@erik-krogh
Copy link
Copy Markdown
Contributor Author

AFAICT we extract both the leaves and the root.

That is correct.
In the talk about sub-trees above I was talking about binary-expressions that weren't the root.

For now I've kept extraction of all the leaves.
It's simpler, and I think we at most risk some duplicate alerts.

@asgerf
Copy link
Copy Markdown
Contributor

asgerf commented Nov 9, 2021

For now I've kept extraction of all the leaves.
It's simpler, and I think we at most risk some duplicate alerts.

It may seem simpler here and now, but having multiple AST nodes with the same location/toString value sounds like a nightmare to debug against.

The fix should be a one-liner:

    @Override
    public Label visit(Literal nd, Context c) {
      Label key = super.visit(nd, c);
      String source = nd.getLoc().getSource();
      String valueString = nd.getStringValue();

      trapwriter.addTuple("literals", valueString, source, key);
      if (nd.isRegExp()) {
        OffsetTranslation offsets = new OffsetTranslation();
        offsets.set(0, 1); // skip the initial '/'
        regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), offsets, nd, false);
-     } else if (nd.isStringLiteral() && !c.isInsideType() && nd.getRaw().length() < 1000) {
+     } else if (nd.isStringLiteral() && !c.isInsideType() && nd.getRaw().length() < 1000 && !c.isBinopOperand()) {
        regexpExtractor.extract(valueString, makeStringLiteralOffsets(nd.getRaw()), nd, true);

This means we also won't extract regexps that are concatenated with an unknown string, but it was already FP-risky to analyze partially unknown regexps, so I'd be fine with that.

@erik-krogh
Copy link
Copy Markdown
Contributor Author

This means we also won't extract regexps that are concatenated with an unknown string, but it was already FP-risky to analyze partially unknown regexps, so I'd be fine with that.

I'll try it out, and run an evaluation on it.
But I think we might get some FNs.

@erik-krogh
Copy link
Copy Markdown
Contributor Author

I'll try it out, and run an evaluation on it.
But I think we might get some FNs.

I was wrong.
The evaluation looks good.
And the two missing results were FPs that are now fixed.

int sl = sourceMap.getStart(term.getLoc().getStart().getColumn()).getLine();
int sc = sourceMap.getStart(term.getLoc().getStart().getColumn()).getColumn() + 1; // convert to 1-based
int el = sourceMap.getEnd(term.getLoc().getEnd().getColumn()).getLine();
int ec = sourceMap.getEnd(term.getLoc().getEnd().getColumn()).getColumn() - 1; // convert to inclusive
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? The ec modifications used to be a noop:

    ec += 1; // convert to 1-based
    ec -= 1; // convert to inclusive

I can't quite see if the use of sourceMap makes it correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't right, there was some off-by-one errors.

I'm not sure why that is, but there is some conversion between 0-based and 1-based columns, so that seems to be some of it.

I've fiddled around with it, and now I got something that works (but I'm not quite sure why it works).

I've manually checked all the locations emitted for RegExpTerms in multipart.js.
(I did that by running a test-query in VSCode, clicking results, and checking that the highlights were correct).

@esbena
Copy link
Copy Markdown
Contributor

esbena commented Nov 16, 2021

My comments have been addressed. @asgerf WDYT?

Copy link
Copy Markdown
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@erik-krogh erik-krogh merged commit a7cd097 into github:main Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants