First: Thanks to the maintainers and community for all the hard work on Pygments!
An HTML file with a large, unclosed script block on a single line will cause very poor lexing performance, just to spit out a stream of Error tokens. This isn't too bad if the line is short, less than a few KB, but for long lines, dozens of KB or more, this makes things extremely slow, I believe O(n^2) in the line length.
For example, this snippet:
is pretty fast when running cat example | pygmentize -l html
but an example like
<script>hihihihi...for 30KB
is extremely slow. I think there's an O(n^2) thing going on:
# little test rig with cat spits out <script> followed by "hi" for some number of repetitions
$ time cat <(echo -n "<script>") <(for x in $(seq 1 2000); do echo -n hi; done) | pygmentize -l html > /dev/null
real 0m0.302s
user 0m0.286s
sys 0m0.040s
$ time cat <(echo -n "<script>") <(for x in $(seq 1 4000); do echo -n hi; done) | pygmentize -l html > /dev/null
real 0m0.709s
user 0m0.683s
sys 0m0.052s
$ time cat <(echo -n "<script>") <(for x in $(seq 1 8000); do echo -n hi; done) | pygmentize -l html > /dev/null
real 0m2.368s
user 0m2.365s
sys 0m0.037s
$ time cat <(echo -n "<script>") <(for x in $(seq 1 16000); do echo -n hi; done) | pygmentize -l html > /dev/null
real 0m8.922s
user 0m8.934s
sys 0m0.042s
$ time cat <(echo -n "<script>") <(for x in $(seq 1 32000); do echo -n hi; done) | pygmentize -l html > /dev/null
real 0m36.664s
user 0m36.661s
sys 0m0.077s
Notice we double the length each time, but the time to run goes up by a factor of nearly 4.
I think this is a legitimate issue that needs addressing because:
- it's not uncommon to have large, obfuscated javascript snippets on a single line
- it's quite possible to see an incomplete snippet show up in, e.g. a diff hunk or some other incomplete snippet
In such cases I think we at least want to fail quickly (36 seconds runtime for 36KB seems way too long just to spit out an Error token for each character), but ideally parse as much as we can as javascript.
I found this in a diff hunk with a ~400KB line with no closing </script>. My app timed out after 10 minutes.
I'll follow up with some more thoughts on why this happens and ideas for a fix.
First: Thanks to the maintainers and community for all the hard work on Pygments!
An HTML file with a large, unclosed script block on a single line will cause very poor lexing performance, just to spit out a stream of Error tokens. This isn't too bad if the line is short, less than a few KB, but for long lines, dozens of KB or more, this makes things extremely slow, I believe O(n^2) in the line length.
For example, this snippet:
is pretty fast when running
cat example | pygmentize -l htmlbut an example like
is extremely slow. I think there's an O(n^2) thing going on:
Notice we double the length each time, but the time to run goes up by a factor of nearly 4.
I think this is a legitimate issue that needs addressing because:
In such cases I think we at least want to fail quickly (36 seconds runtime for 36KB seems way too long just to spit out an Error token for each character), but ideally parse as much as we can as javascript.
I found this in a diff hunk with a ~400KB line with no closing </script>. My app timed out after 10 minutes.
I'll follow up with some more thoughts on why this happens and ideas for a fix.