Swift: Regular expressions library. #13470

geoffw0 · 2023-06-15T12:30:28Z

Adds a regular expressions library for Swift, consisting of:

Regex.qll, providing a RegexEval class for recognizing places where regular expression evaluation takes place, and an extension to the RegExp class for recognizing string literals that are used as regular expressions (via dataflow to the RegexEvals).
RegexTreeView.qll and internal.ParseRegex ported from Ruby (crudely, for now).
uses the shared regex libraries as well.
numerous / detailed tests (thanks Java / Ruby / others!).

Things I'd ideally like to address before merging (@erik-krogh I'd really appreciate some advice here):

which regex's are vulnerable sometimes depends on how they are evaluated (prefixMatch, firstMatch, wholeMatch...). How is this addressed in other languages?
we're missing some results that are caught in other languages. Is something in my work obviously wrong or missing?
there's one test case where QL evaluation times out (at 5 minutes for a test).
there are some TODO comments and commented out code to deal with.
the PR needs a change note.

Future work for future PRs (we now have issues tracking all of these):

add a ReDoS query using this library (I already have a branch for this).
- other regexp queries are planned as well.
add support for Swift regular expression literals (/.*/).
- this one's important.
- but quite a lot more work + we can't currently test them as the feature is fairly new to Swift. :(
add support for the Swift regular expression builder RegexBuilder.
add support for functions that only evaluate a regular expression if options: .regularExpression is specified as an argument (i.e. model methods of StringProtocol and NSString + RegexUseConfig flow through the NSString constructor).
model more of the regexp features, escape sequences etc. described in https://developer.apple.com/documentation/foundation/nsregularexpression
parse the mode prefix on string regular expressions (see Java's getModeFromPrefix and the Swift test case beginning (?s)).
handle different types of matching (prefixMatch, firstMatch, wholeMatch) properly - see matchesFullString in the Java implementation.
review regex library test cases MISSING, SPURIOUS and hasParseFailure results.
from RegexTreeView.qll: "TODO: Handle named escapes".
from RegexTreeView.qll: "TODO: expand to cover more properties".
does Swift have a "free spacing flag" we should support (see Ruby hasFreeSpacingFlag)?

…ome issues).

…n't obscure what's going on).

relevant variants, remove some duplicates, add the testing script also.

swift/ql/lib/codeql/swift/regex/internal/ParseRegex.qll

swift/ql/test/library-tests/regex/regex.ql

swift/ql/test/library-tests/regex/redos_variants.swift

erik-krogh · 2023-06-15T13:01:43Z

Regex.qll, providing a RegexEval class for recognizing places where regular expression evaluation takes place, and an extension to the RegExp class for recognizing string literals that are used as regular expressions (via dataflow to the RegexEvals).

Try to check out the RegExpTracking.qll file in Ruby / Python. The python variant is way simpler, but Ruby is closer to what you'll need when you support regex literals.

which regex's are vulnerable sometimes depends on how they are evaluated (prefixMatch, firstMatch, wholeMatch...). How is this addressed in other languages?

See matchesFullString in the Java implementation.

we're missing some results that are caught in other languages. Is something in my work obviously wrong or missing?

Concrete (and small) example?
Different dialects of regex are slightly different, and the tests focus on the nasty parts that are different.
E.g. if you've copied a regex with [^] from JS, then that won't work in other languages.
You unfortunately need to figure out what Swift does in all the small corner cases of its regex dialect.

there's one test case where QL evaluation times out (at 5 minutes for a test).

Hard to know 🤷
In the pasts there has been a bunch of timeouts from various regex parsers producing an ambiguous parse of the regex.
Or maybe some incorrect/ambiguous parsing of unicode-escapes... (Also seen before)

geoffw0 · 2023-06-15T13:35:24Z

Thanks for the feedback and for sharing your experiences. I have several areas to investigate now...

…ssions.

…east, not yet.

…n Swift.

geoffw0 · 2023-06-21T17:32:04Z

I'm wondering if \w / \d / \s are parsed correctly?

Could you try to reduce it down to a minimal example that still produces the same error, and give me a database containing only that regex?

I minified it as far as #"(\w*foobarfoobarfoobarfoobarfoobarfoobarfoobarfoobar)+"#. I'll investigate the parsing of \w.

geoffw0 · 2023-06-21T17:35:25Z

I'm also marking this PR 'ready for review' as I'd like to start getting reviews from the Swift team. The REDOS query itself can't be PR'd until we have at least a basic version of this library in place.

…ore like Ruby does them."

geoffw0 · 2023-06-21T18:01:56Z

The above commit "Swift: Do regex locations more like Ruby does them." unexpectedly fixes the timeout issue. I wasn't aware hasLocationInfo would affect anything except the output of the parse.ql test. 🤷 🎉

erik-krogh · 2023-06-21T18:08:16Z

I wasn't aware hasLocationInfo would affect anything except the output of the parse.ql test. 🤷 🎉

This code is why that matters:

codeql/shared/regex/codeql/regex/nfa/NfaUtils.qll

Lines 147 to 173 in 5afdaf8

    
             /** 
        
              * Gets a string for the full location of `t`. 
        
              */ 
        
             bindingset[t] 
        
             pragma[inline_late] 
        
             string getTermLocationString(RegExpTerm t) { 
        
               exists(string file, int startLine, int startColumn, int endLine, int endColumn | 
        
                 t.hasLocationInfo(file, startLine, startColumn, endLine, endColumn) and 
        
                 result = file + ":" + startLine + ":" + startColumn + "-" + endLine + ":" + endColumn 
        
               ) 
        
             } 
        
             /** 
        
              * Holds if `term` is the chosen canonical representative for all terms with string representation `str`. 
        
              * The string representation includes which flags are used with the regular expression. 
        
              * 
        
              * Using canonical representatives gives a huge performance boost when working with tuples containing multiple `InputSymbol`s. 
        
              * The number of `InputSymbol`s is decreased by 3 orders of magnitude or more in some larger benchmarks. 
        
              */ 
        
             private predicate isCanonicalTerm(RelevantRegExpTerm term, string str) { 
        
               term = 
        
                 min(RelevantRegExpTerm t | 
        
                   str = getCanonicalizationString(t) 
        
                 | 
        
                   t order by getTermLocationString(t), t.toString() 
        
                 ) 
        
             }

It selects a canonical representative based on locations (and toString()).
Because a pair of toString() and hasLocationInfo() should always be unique for a given term.
So if different terms have the same location, then my code can't distinguish the two, which makes things go bad.

geoffw0 · 2023-06-21T20:08:59Z

Thanks for the explanation.

MathiasVP

A few comments (mostly for my own understanding). If @erik-krogh is happy with the code then so am I 👍.

MathiasVP · 2023-06-22T09:04:59Z

swift/ql/lib/codeql/swift/regex/Regex.qll

+ * Regex("(a|b).*").firstMatch(in: myString)
+ * ```
+ */
+abstract class RegexEval extends CallExpr {


Out of curiosity: Is this meant to be extended by users if they want to model custom regex evaluations, or should we only expose a final view of this so that users can't extend it outside this file?

I suppose we shouldn't extend this class in our own queries since this would cause a bunch of re-evaluation (because I assume a bunch of things in the regex library is cached?)

Good question. It isn't intended, but perhaps it should be. Do you think anything should change about the design (or perhaps just the documentation) to reflect this?

If we want to prevent users (and future us) from accidentally extending the set of regex evaluators we could replace

abstract class RegexEval extends CallExpr { ... } /* ... */ private class AlwaysRegexEval extends RegexEval { ... }

with something like:

// This is now private to prevent users from accidentally extending the domain of the class. private abstract class RegexEvalDomain extends CallExpr { ... } // ... and we now expose only a final view of the class. // This means that users writing `class MyRegexEval extends RegexEval { ... }` don't extend // the domain of the class, but instead define a subclass (like they most likely intended). final class RegexEval = RegexEvalDomain; /* ... */ // And actual classes that need to extend the domain of the RegexEval class can do so by extending `RegexEvalDomain` (which you can only do in this file since the `RegexEvalDomain` class is private). private class AlwaysRegexEval extends RegexEvalDomain { ... }

Note that "final extensions" is a very new QL feature so it won't be available until the next release. So we may want to wait with this change in order to not delay this PR.

That's interesting, but to be honest I don't yet see much reason to prevent users from extending the class if they want to.

Perhaps the two variables shouldn't be exposed in the public interface though:

Expr regexInput; Expr stringInput;

as that does lock us in to that design somewhat?

That's interesting, but to be honest I don't yet see much reason to prevent users from extending the class if they want to.

Yeah, it's certainly nice that it can be extended. We just need to ensure that none of our own queries in any suite extends this class as it invalidates (at the very least) the dataflow analysis that depends on the set of regex evaluators so that it'll be re-evaluated in the offending query.

Perhaps the two variables shouldn't be exposed in the public interface though:

Expr regexInput; Expr stringInput;

as that does lock us in to that design somewhat?

That's true. Since the class is already abstract we could just require the presence of two abstract predicates getRegexInput() and getStringInput().

MathiasVP · 2023-06-22T10:12:05Z

swift/ql/lib/codeql/swift/regex/internal/ParseRegex.qll

+          not exists(int x, int y |
+            this.posixStyleNamedCharacterProperty(x, y, _) and e >= x and e < y
+          )


This looks like something that would be more efficiently expressed using rank, but if this doesn't perform poorly for other languages it's probably fine 👍.

Good spot. I'm reluctant to change this because (1) we don't currently have Swift test coverage for this edge case (you can remove the whole not exists(...) without any tests failing) and (2) I'm not sure what a clean solution with rank would look like.

I've just added some new test cases around this stuff. There are a couple of spurious parse failures around posix named character properties, but none of them are affected by this particular edge case (I'm struggling to think what would be). The output of parse.ql LGTM.

swift/ql/lib/codeql/swift/regex/internal/ParseRegex.qll

erik-krogh · 2023-06-22T10:45:20Z

If @erik-krogh is happy with the code then so am I 👍.

I haven't really looked at the implementation, and I'm not that much into regex parsing.
I'm happy as long as all the tests pass and the implementation doesn't diverge too much from language it was copied from.
I trust Geoffrey on that last part.

Co-authored-by: Mathias Vorreiter Pedersen <mathiasvp@github.com>

geoffw0 · 2023-06-22T11:16:14Z

I'm happy as long as all the tests pass

Well most of the tests pass, there are still a few MISSING, SPURIOUS and hasParseFailure tags to be found. Notably the tests are adapted from Java (which had a more complete test set) rather than Ruby, which explains why this has been a challenge. We have follow-up work planned where test results are still less than ideal.

and the implementation doesn't diverge too much from language it was copied from.

RegexTreeView.qll and ParseRegex.qll were copied from Ruby and were not changed a lot, apart from removing regex literal support for now. We will want to add this back in at some point and will use the Ruby implementation as a guide when we do.

geoffw0 · 2023-06-22T11:22:55Z

Just merged in main and fixed an utterly trivial merge conflict in swift/ql/lib/qlpack.yml.

geoffw0 · 2023-06-22T16:20:52Z

Fixed the check failure (missing QLDoc in shared code I didn't actually touch...).

MathiasVP

LGTM!

erik-krogh

The big picture looks OK.

But I haven't looked into anything in detail.

geoffw0 added 20 commits June 5, 2023 23:55

Swift: Create test cases for a regular expression library.

c994b4b

Swift: Add a simple Regex library.

e04f6bf

Swift: Test the library.

053bf9a

Swift: Add regular expressions to SummaryStats.ql.

f7860a3

Swift: Create library test cases for REDOS vulnerable regexs.

9601134

Swift: Copy some library files from Ruby (as advised).

8ec3779

Swift: Trivial changes to get it compiling.

5f85b74

Swift: Include the shared regex pack in Swift.

d4c3e9e

Swift: Add REDOS analysis to the library test.

1e290b4

Swift: Add regex sources to the library.

7e9d73b

Swift: Add the cases from the (Ruby) qhelp to the library tests.

712c3cc

Swift: Identify strings that are used in regular expressions properly.

2ccbdbd

Swift: Add real world test cases.

c540568

Swift: Import more test cases from other languages (this highlights s…

44eb7bf

…ome issues).

Swift: Flag parse failures in the test.

63ab478

Swift: Escape the test cases in a better way (so escape characters do…

f93bf6a

…n't obscure what's going on).

Swift: Annotate tests based on real ereal execution findings. Add some

8e8a9c8

relevant variants, remove some duplicates, add the testing script also.

Swift: Lots of small fixes / cleanup.

91b2de2

Swift: Autoformat + fix test indentation.

4a06394

Swift: Add another test case.

9e9ef42

geoffw0 added the Swift label Jun 15, 2023

github-advanced-security bot found potential problems Jun 15, 2023

View reviewed changes

erik-krogh reviewed Jun 15, 2023

View reviewed changes

swift/ql/test/library-tests/regex/redos_variants.swift Outdated Show resolved Hide resolved

erik-krogh reviewed Jun 15, 2023

View reviewed changes

swift/ql/test/library-tests/regex/redos_variants.swift Outdated Show resolved Hide resolved

Swift: Fix QL-for-QL warnings.

9b9b4a1

geoffw0 added 3 commits June 15, 2023 20:51

Swift: Add a test case for \Uhhhhhhhh character escapes.

05939bd

Swift: Add support for \Uhhhhhhhh escaped characters in regular expre…

49dfe5d

…ssions.

Swift: Add support for \u{hhhhhh} escaped characters in regular expre…

355793f

…ssions.

geoffw0 added 4 commits June 21, 2023 18:11

Swift: Test some edge cases for locations.

e127030

Swift: We don't need the location components logic inRegExpTerm, at l…

5a99007

…east, not yet.

Swift: Do regex locations more like Ruby does them.

5449bdc

Swift: Remove another bit of code that doesn't currently make sense i…

925477e

…n Swift.

geoffw0 marked this pull request as ready for review June 21, 2023 17:35

geoffw0 requested a review from a team as a code owner June 21, 2023 17:35

Swift: The perf. issue is fixed by above commit "Do regex locations m…

d3af8c5

…ore like Ruby does them."

MathiasVP reviewed Jun 22, 2023

View reviewed changes

Update swift/ql/lib/codeql/swift/regex/internal/ParseRegex.qll

90499c0

Co-authored-by: Mathias Vorreiter Pedersen <mathiasvp@github.com>

Merge branch 'main' into swiftregex

e6695e3

geoffw0 added 2 commits June 22, 2023 13:59

Swift: Correct QLDoc error.

c17de99

Shared: QLDoc NfaUtils::Make::State::hasLocationInfo.

a8aa335

geoffw0 added 3 commits June 23, 2023 14:11

Swift: Add some test cases aimed at regex parsing correctness.

8f69b2a

Swift: Fix typo in a comment.

987ca61

Swift: Make the RegexEval interface cleaner.

5cffa59

MathiasVP approved these changes Jun 23, 2023

View reviewed changes

erik-krogh approved these changes Jun 23, 2023

View reviewed changes

geoffw0 merged commit ca71d48 into github:main Jun 23, 2023

This was referenced Jun 23, 2023

Swift: Query for REDOS (Regular Expression Denial Of Service) #13548

Merged

Swift: Query for bad HTML filtering regexps #13549

Merged

github deleted a comment from Sugarpie86 Jun 26, 2023

Swift: Regular expressions library. #13470

Swift: Regular expressions library. #13470

Uh oh!

Conversation

geoffw0 commented Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erik-krogh commented Jun 15, 2023

Uh oh!

geoffw0 commented Jun 15, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

erik-krogh commented Jun 21, 2023

Uh oh!

geoffw0 commented Jun 21, 2023

Uh oh!

MathiasVP left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erik-krogh commented Jun 22, 2023

Uh oh!

geoffw0 commented Jun 22, 2023

Uh oh!

geoffw0 commented Jun 22, 2023

Uh oh!

geoffw0 commented Jun 22, 2023

Uh oh!

MathiasVP left a comment

Choose a reason for hiding this comment

Uh oh!

erik-krogh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

geoffw0 commented Jun 15, 2023 •

edited

Loading