Skip to content

Conversation

@geoffw0
Copy link
Contributor

@geoffw0 geoffw0 commented Jun 15, 2023

Adds a regular expressions library for Swift, consisting of:

  • Regex.qll, providing a RegexEval class for recognizing places where regular expression evaluation takes place, and an extension to the RegExp class for recognizing string literals that are used as regular expressions (via dataflow to the RegexEvals).
  • RegexTreeView.qll and internal.ParseRegex ported from Ruby (crudely, for now).
  • uses the shared regex libraries as well.
  • numerous / detailed tests (thanks Java / Ruby / others!).

Things I'd ideally like to address before merging (@erik-krogh I'd really appreciate some advice here):

  • which regex's are vulnerable sometimes depends on how they are evaluated (prefixMatch, firstMatch, wholeMatch...). How is this addressed in other languages?
  • we're missing some results that are caught in other languages. Is something in my work obviously wrong or missing?
  • there's one test case where QL evaluation times out (at 5 minutes for a test).
  • there are some TODO comments and commented out code to deal with.
  • the PR needs a change note.

Future work for future PRs (we now have issues tracking all of these):

  • add a ReDoS query using this library (I already have a branch for this).
    • other regexp queries are planned as well.
  • add support for Swift regular expression literals (/.*/).
    • this one's important.
    • but quite a lot more work + we can't currently test them as the feature is fairly new to Swift. :(
  • add support for the Swift regular expression builder RegexBuilder.
  • add support for functions that only evaluate a regular expression if options: .regularExpression is specified as an argument (i.e. model methods of StringProtocol and NSString + RegexUseConfig flow through the NSString constructor).
  • model more of the regexp features, escape sequences etc. described in https://developer.apple.com/documentation/foundation/nsregularexpression
  • parse the mode prefix on string regular expressions (see Java's getModeFromPrefix and the Swift test case beginning (?s)).
  • handle different types of matching (prefixMatch, firstMatch, wholeMatch) properly - see matchesFullString in the Java implementation.
  • review regex library test cases MISSING, SPURIOUS and hasParseFailure results.
  • from RegexTreeView.qll: "TODO: Handle named escapes".
  • from RegexTreeView.qll: "TODO: expand to cover more properties".
  • does Swift have a "free spacing flag" we should support (see Ruby hasFreeSpacingFlag)?

@geoffw0 geoffw0 added the Swift label Jun 15, 2023
@erik-krogh
Copy link
Contributor

  • Regex.qll, providing a RegexEval class for recognizing places where regular expression evaluation takes place, and an extension to the RegExp class for recognizing string literals that are used as regular expressions (via dataflow to the RegexEvals).

Try to check out the RegExpTracking.qll file in Ruby / Python. The python variant is way simpler, but Ruby is closer to what you'll need when you support regex literals.

which regex's are vulnerable sometimes depends on how they are evaluated (prefixMatch, firstMatch, wholeMatch...). How is this addressed in other languages?

See matchesFullString in the Java implementation.

we're missing some results that are caught in other languages. Is something in my work obviously wrong or missing?

Concrete (and small) example?
Different dialects of regex are slightly different, and the tests focus on the nasty parts that are different.
E.g. if you've copied a regex with [^] from JS, then that won't work in other languages.
You unfortunately need to figure out what Swift does in all the small corner cases of its regex dialect.

there's one test case where QL evaluation times out (at 5 minutes for a test).

Hard to know 🤷
In the pasts there has been a bunch of timeouts from various regex parsers producing an ambiguous parse of the regex.
Or maybe some incorrect/ambiguous parsing of unicode-escapes... (Also seen before)

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 15, 2023

Thanks for the feedback and for sharing your experiences. I have several areas to investigate now...

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 21, 2023

I'm wondering if \w / \d / \s are parsed correctly?

Could you try to reduce it down to a minimal example that still produces the same error, and give me a database containing only that regex?

I minified it as far as #"(\w*foobarfoobarfoobarfoobarfoobarfoobarfoobarfoobar)+"#. I'll investigate the parsing of \w.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 21, 2023

I'm also marking this PR 'ready for review' as I'd like to start getting reviews from the Swift team. The REDOS query itself can't be PR'd until we have at least a basic version of this library in place.

@geoffw0 geoffw0 marked this pull request as ready for review June 21, 2023 17:35
@geoffw0 geoffw0 requested a review from a team as a code owner June 21, 2023 17:35
@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 21, 2023

The above commit "Swift: Do regex locations more like Ruby does them." unexpectedly fixes the timeout issue. I wasn't aware hasLocationInfo would affect anything except the output of the parse.ql test. 🤷 🎉

@erik-krogh
Copy link
Contributor

I wasn't aware hasLocationInfo would affect anything except the output of the parse.ql test. 🤷 🎉

This code is why that matters:

/**
* Gets a string for the full location of `t`.
*/
bindingset[t]
pragma[inline_late]
string getTermLocationString(RegExpTerm t) {
exists(string file, int startLine, int startColumn, int endLine, int endColumn |
t.hasLocationInfo(file, startLine, startColumn, endLine, endColumn) and
result = file + ":" + startLine + ":" + startColumn + "-" + endLine + ":" + endColumn
)
}
/**
* Holds if `term` is the chosen canonical representative for all terms with string representation `str`.
* The string representation includes which flags are used with the regular expression.
*
* Using canonical representatives gives a huge performance boost when working with tuples containing multiple `InputSymbol`s.
* The number of `InputSymbol`s is decreased by 3 orders of magnitude or more in some larger benchmarks.
*/
private predicate isCanonicalTerm(RelevantRegExpTerm term, string str) {
term =
min(RelevantRegExpTerm t |
str = getCanonicalizationString(t)
|
t order by getTermLocationString(t), t.toString()
)
}

It selects a canonical representative based on locations (and toString()).
Because a pair of toString() and hasLocationInfo() should always be unique for a given term.
So if different terms have the same location, then my code can't distinguish the two, which makes things go bad.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 21, 2023

Thanks for the explanation.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments (mostly for my own understanding). If @erik-krogh is happy with the code then so am I 👍.

* Regex("(a|b).*").firstMatch(in: myString)
* ```
*/
abstract class RegexEval extends CallExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Is this meant to be extended by users if they want to model custom regex evaluations, or should we only expose a final view of this so that users can't extend it outside this file?

I suppose we shouldn't extend this class in our own queries since this would cause a bunch of re-evaluation (because I assume a bunch of things in the regex library is cached?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It isn't intended, but perhaps it should be. Do you think anything should change about the design (or perhaps just the documentation) to reflect this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to prevent users (and future us) from accidentally extending the set of regex evaluators we could replace

abstract class RegexEval extends CallExpr { ... }

/* ... */

private class AlwaysRegexEval extends RegexEval { ... }

with something like:

// This is now private to prevent users from accidentally extending the domain of the class.
private abstract class RegexEvalDomain extends CallExpr { ... }
// ... and we now expose only a final view of the class.
// This means that users writing `class MyRegexEval extends RegexEval { ... }` don't extend
// the domain of the class, but instead define a subclass (like they most likely intended).
final class RegexEval = RegexEvalDomain;

/* ... */

// And actual classes that need to extend the domain of the RegexEval class can do so by extending `RegexEvalDomain` (which you can only do in this file since the `RegexEvalDomain` class is private).
private class AlwaysRegexEval extends RegexEvalDomain { ... }

Note that "final extensions" is a very new QL feature so it won't be available until the next release. So we may want to wait with this change in order to not delay this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, but to be honest I don't yet see much reason to prevent users from extending the class if they want to.

Perhaps the two variables shouldn't be exposed in the public interface though:

  Expr regexInput;
  Expr stringInput;

as that does lock us in to that design somewhat?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, but to be honest I don't yet see much reason to prevent users from extending the class if they want to.

Yeah, it's certainly nice that it can be extended. We just need to ensure that none of our own queries in any suite extends this class as it invalidates (at the very least) the dataflow analysis that depends on the set of regex evaluators so that it'll be re-evaluated in the offending query.

Perhaps the two variables shouldn't be exposed in the public interface though:

 Expr regexInput;
 Expr stringInput;

as that does lock us in to that design somewhat?

That's true. Since the class is already abstract we could just require the presence of two abstract predicates getRegexInput() and getStringInput().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +135 to +137
not exists(int x, int y |
this.posixStyleNamedCharacterProperty(x, y, _) and e >= x and e < y
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like something that would be more efficiently expressed using rank, but if this doesn't perform poorly for other languages it's probably fine 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot. I'm reluctant to change this because (1) we don't currently have Swift test coverage for this edge case (you can remove the whole not exists(...) without any tests failing) and (2) I'm not sure what a clean solution with rank would look like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added some new test cases around this stuff. There are a couple of spurious parse failures around posix named character properties, but none of them are affected by this particular edge case (I'm struggling to think what would be). The output of parse.ql LGTM.

@erik-krogh
Copy link
Contributor

If @erik-krogh is happy with the code then so am I 👍.

I haven't really looked at the implementation, and I'm not that much into regex parsing.
I'm happy as long as all the tests pass and the implementation doesn't diverge too much from language it was copied from.
I trust Geoffrey on that last part.

Co-authored-by: Mathias Vorreiter Pedersen <mathiasvp@github.com>
@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 22, 2023

I'm happy as long as all the tests pass

Well most of the tests pass, there are still a few MISSING, SPURIOUS and hasParseFailure tags to be found. Notably the tests are adapted from Java (which had a more complete test set) rather than Ruby, which explains why this has been a challenge. We have follow-up work planned where test results are still less than ideal.

and the implementation doesn't diverge too much from language it was copied from.

RegexTreeView.qll and ParseRegex.qll were copied from Ruby and were not changed a lot, apart from removing regex literal support for now. We will want to add this back in at some point and will use the Ruby implementation as a guide when we do.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 22, 2023

Just merged in main and fixed an utterly trivial merge conflict in swift/ql/lib/qlpack.yml.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jun 22, 2023

Fixed the check failure (missing QLDoc in shared code I didn't actually touch...).

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@erik-krogh erik-krogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big picture looks OK.

But I haven't looked into anything in detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants