Simplify or optimize regexes with polynomial time worst cases#44197
Conversation
|
@typescript-bot perf test this |
|
Heya @weswigham, I've started to run the perf test suite on this PR at bbb2dee. You can monitor the build here. Update: The results are in! |
|
@weswigham Here they are:Comparison Report - master..44197
System
Hosts
Scenarios
Developer Information: |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looks like perf is either the same or ever so slightly positive, kinda as expected. |
amcasey
left a comment
There was a problem hiding this comment.
LGTM. Are there instructions somewhere for checking whether a regex is bad?
Not yet! We'll probably have to wait until the research is published for a formal tool. (I have a prerelease build, but the output is pretty researchy). There are some other, existing tools tied to other research papers, but they involving checking against a DB of known-bad regexes online, or identifying known-bad substructures, which doesn't really help as much when making novel regexes. @robmcl4 may have more concrete plans for the software, I can't really speak for him there. |
Orrrr, you could rewrite it in Rust today and be instantly famous. |
Co-authored-by: David Michon <dmichon-msft@users.noreply.github.com>
A friend of mine, @robmcl4, is a PhD student as UCSB and is doing some security research involving regex performance. In the course of that, he's developed a kind of fuzzer for v8's regex engine that's pretty good at identifying regexes that have polynomial or worse worst-case time in specifically v8's regex implementation. In doing said research, he happened to run it over the regexes in our codebase (along with many other codebases).
In this PR, I fix all the regexes said fuzzer indicates may be problematic so they are more performant. (I checked any replacements don't also do very poorly using said fuzzer).
Most of the time, fixing the perf just means using less regexes, however one particular example stands out! A regex we use a ton in multiple places,
/\s+$/g, actually has terrible performance when used to trim whitespace from a string with a lot of leading whitespace (check the benchmark here). Lines having leading whitespace is pretty common, so I'm hoping these fixes might actually have some small but measurable real-world perf gains.