Skip to content

Optimize double-width safe_substr when all double-width.#114

Merged
schlessera merged 8 commits into
masterfrom
double_width_optimize
Aug 1, 2017
Merged

Optimize double-width safe_substr when all double-width.#114
schlessera merged 8 commits into
masterfrom
double_width_optimize

Conversation

@gitlost
Copy link
Copy Markdown
Contributor

@gitlost gitlost commented Jul 29, 2017

Related #111

Looking at the @miya0001's test table at #111 (review) it struck me that most of the time a column with double-width chars will have a number of entries with double-width content only, which suggests this simple optimization which checks for this using a preg_match_all() and just halves the length if so.

Crude benchmarking suggests a performance win if the percentage of such entries is above 10% or so, and only a small (2%) penalty if it's less than that, and a major win (100s of %) if it's anything above 50%, which I'm guessing is most likely to be the case - @miya0001 could you comment on whether this is likely in your experience for real data?

@miya0001
Copy link
Copy Markdown
Member

miya0001 commented Jul 29, 2017

@gitlost

Test table at #111 is using a theme unit testing data Japanese version.
https://www.evernote.com/l/ABW2q3EZHXVLw46aqapMXaQJpH8IcymzIZMB/image.png

Following is a result of my blog.

Almost all posts in my blog is writing about programming, so it has many single-width chars.
I think that general blogs are not so, but I think that numbers are often used.

Anyway I found a new problem, if string has a single-width chars with odd number of times, the table column will be slidden. 😓

@miya0001
Copy link
Copy Markdown
Member

miya0001 commented Jul 29, 2017

@gitlost
Copy link
Copy Markdown
Contributor Author

gitlost commented Jul 29, 2017

Another excellent catch. I think the problem with the current code is that it's rounding up instead of rounding down. I'm going to push (what I think is) a fix for that now, incorporating your test. If you could check it over and perhaps add more tests it would be great.

How do you think about this idea?

I think it works great, though doing it that way is going to be slow.

@miya0001
Copy link
Copy Markdown
Member

If you could check it over and perhaps add more tests it would be great.

Sure! I'll send PR.

@gitlost
Copy link
Copy Markdown
Contributor Author

gitlost commented Jul 29, 2017

Thanks @miya0001 for extra test (and for everything else!).

Based on your feedback I'm leaning towards not bothering with the "optimization" but just keeping the fix, particularly as I didn't realize that half-width characters such as were much used, but I'll let it stew for a day or two... Also if there's any more issues with the current solution then using yours would make sense...

@miya0001
Copy link
Copy Markdown
Member

@gitlost

"カ" is a Halfwidth Katakana with UTF-8.
I am not sure, but I guess that there are single-width multibyte chars in the World, Halfwidth Katakana is one of them.
I guess this patch is working fine for those chars for now.

I am worrying about the regex, it looks very complexity.
If we have more problem, I am thinking that we have to bring another solution like mine.

But it looks working fine, I think we can merge this PR. 👍

Thanks! 😄

@schlessera schlessera added this to the 0.11.6 milestone Aug 1, 2017
@schlessera schlessera merged commit 37cf6e1 into master Aug 1, 2017
@schlessera schlessera deleted the double_width_optimize branch August 1, 2017 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants