JS: extract regexp literals for string concatenations#6756
Conversation
892fca6 to
36e6c5c
Compare
0f9f523 to
c34d338
Compare
c34d338 to
12305aa
Compare
Even the subtrees? let pattern =
// 'abc'
"abc" +
// followed by any number of 'd's
"d*" +
// 'e'
"e";(if anything, I would expect sub tree parses to be prone to FPs) |
No, just the roots (at least that's what we try, but the syntax matching doesn't handle parentheses). And for the example you show we shouldn't miss anything (and also shouldn't get any FPs). |
esbena
left a comment
There was a problem hiding this comment.
LGTM, but I would prefer some more documentation (see comments)
| return '0' <= ch && ch <= '7'; | ||
| } | ||
|
|
||
| private String getStringConcatResult(Expression exp) { |
There was a problem hiding this comment.
Nit:
| private String getStringConcatResult(Expression exp) { | |
| /** | |
| * Constant-folds simple string concatenations in `exp`. | |
| */ | |
| private String getStringConcatResult(Expression exp) { |
| if (extractedAsRegexp.contains(nd)) { | ||
| return key; | ||
| } | ||
| String rawString = getStringConcatResult(nd); |
There was a problem hiding this comment.
Nit: is this really a "raw" string? Isn't it more like a "folded string"?
asgerf
left a comment
There was a problem hiding this comment.
AFAICT we extract both the leaves and the root. It would be nice if we could ensure that a given piece of text is regexp-extracted at most once (and possibly that leaves are still extracted if the outermost BinaryExpression contains non-constant parts, although I'm not sure if this is even desirable).
| import com.semmle.util.trap.TrapWriter; | ||
| import com.semmle.util.trap.TrapWriter.Label; | ||
|
|
||
| import com.semmle.util.files.FileLineOffsetCache; |
| return null; | ||
| } | ||
|
|
||
| private OffsetTranslation computeStringConcatOffset(Expression exp) { |
There was a problem hiding this comment.
This method should be merged with getStringConcatResult. Use Pair or a new class to return both results.
| return null; | ||
| } | ||
| int delta = be.getRight().getLoc().getStart().getOffset() - be.getLeft().getLoc().getStart().getOffset(); | ||
| int offset = getStringConcatResult(be.getLeft()).length(); |
There was a problem hiding this comment.
This recursive call to getStringConcatResult can be eliminated after merging the two methods, removing an N^2 trap.
| extractedAsRegexp.add(nd.getRight()); | ||
| visit(nd.getLeft(), key, 0); | ||
| visit(nd.getRight(), key, 1); | ||
| if (extractedAsRegexp.contains(nd)) { |
There was a problem hiding this comment.
Could you factor the RegExp-extraction part into an appropriately named method and call into that when it should be extracted as a regexp? This code applies to all BinaryExpression instances and this bailout-style looks a bit out of place here.
| } | ||
|
|
||
| // set to determine which BinaryExpression has been extracted as regexp | ||
| private Set<Expression> extractedAsRegexp = new HashSet<>(); |
There was a problem hiding this comment.
This seems a bit heavy-handed for detecting the root BinaryExpression. Ideally this should be part of the Context class.
If for some reason you'd rather use the set, I'd suggest
- Use a name like
shouldNotExtractAsRegExporparentWillExtractAsRegExpto make it clear how it is used. - Remove nodes again when they have been visited.
That is correct. For now I've kept extraction of all the leaves. |
It may seem simpler here and now, but having multiple AST nodes with the same location/toString value sounds like a nightmare to debug against. The fix should be a one-liner: @Override
public Label visit(Literal nd, Context c) {
Label key = super.visit(nd, c);
String source = nd.getLoc().getSource();
String valueString = nd.getStringValue();
trapwriter.addTuple("literals", valueString, source, key);
if (nd.isRegExp()) {
OffsetTranslation offsets = new OffsetTranslation();
offsets.set(0, 1); // skip the initial '/'
regexpExtractor.extract(source.substring(1, source.lastIndexOf('/')), offsets, nd, false);
- } else if (nd.isStringLiteral() && !c.isInsideType() && nd.getRaw().length() < 1000) {
+ } else if (nd.isStringLiteral() && !c.isInsideType() && nd.getRaw().length() < 1000 && !c.isBinopOperand()) {
regexpExtractor.extract(valueString, makeStringLiteralOffsets(nd.getRaw()), nd, true);This means we also won't extract regexps that are concatenated with an unknown string, but it was already FP-risky to analyze partially unknown regexps, so I'd be fine with that. |
I'll try it out, and run an evaluation on it. |
I was wrong. |
| int sl = sourceMap.getStart(term.getLoc().getStart().getColumn()).getLine(); | ||
| int sc = sourceMap.getStart(term.getLoc().getStart().getColumn()).getColumn() + 1; // convert to 1-based | ||
| int el = sourceMap.getEnd(term.getLoc().getEnd().getColumn()).getLine(); | ||
| int ec = sourceMap.getEnd(term.getLoc().getEnd().getColumn()).getColumn() - 1; // convert to inclusive |
There was a problem hiding this comment.
Is this right? The ec modifications used to be a noop:
ec += 1; // convert to 1-based
ec -= 1; // convert to inclusive
I can't quite see if the use of sourceMap makes it correct.
There was a problem hiding this comment.
It wasn't right, there was some off-by-one errors.
I'm not sure why that is, but there is some conversion between 0-based and 1-based columns, so that seems to be some of it.
I've fiddled around with it, and now I got something that works (but I'm not quite sure why it works).
I've manually checked all the locations emitted for RegExpTerms in multipart.js.
(I did that by running a test-query in VSCode, clicking results, and checking that the highlights were correct).
|
My comments have been addressed. @asgerf WDYT? |
Parses every string-concatenation of constant strings into regular expressions.
CVE-2020-17480: TP/TN (when #6561 is merged).
Evaluation looks good.