Skip to content

Optimize string comparisons#5018

Merged
ethomson merged 1 commit intolibgit2:masterfrom
romkatv:strings
Apr 4, 2019
Merged

Optimize string comparisons#5018
ethomson merged 1 commit intolibgit2:masterfrom
romkatv:strings

Conversation

@romkatv
Copy link
Copy Markdown
Contributor

@romkatv romkatv commented Mar 14, 2019

See #4230 (comment) for context. This is diff optimization PR 1/3. The smallest of the three both in terms of code delta and impact.

Three hand-rolled string comparison functions -- git__prefixcmp, git__strcmp and git__strncmp -- are showing up on the CPU profile when scanning working directory of chromium repository on Linux. This PR optimizes git__prefixcmp and replaces the other two functions with standard equivalents (unless GIT_WIN32 is defined).

Test:

cd ~/libgit2
mkdir build
cd build
cmake \
  -DUSE_ICONV=OFF           \
  -DBUILD_CLAR=ON           \
  -DUSE_SSH=OFF             \
  -DUSE_HTTPS=OFF           \
  -DTHREADSAFE=OFF          \
  -DBUILD_SHARED_LIBS=OFF   \
  -DUSE_EXT_HTTP_PARSER=OFF \
  -DUSE_BUNDLED_ZLIB=ON     \
  -DBUILD_EXAMPLES=ON       \
  -DCMAKE_BUILD_TYPE=Release ..
make -j 20
./libgit2_clar

Benchmark:

git clone https://github.com/chromium/chromium.git ~/chromium
cd ~/chromium
~/libgit2/build/examples/lg2 status >/dev/null &&
  time (repeat 100; do ~/libgit2/build/examples/lg2 status >/dev/null; done)
  • Baseline: 112.11s user 62.75s system 99% cpu 2:55.15 total
  • With this PR: 104.80s user 62.93s system 99% cpu 2:47.96 total

A modest 4% improvement. Working directory traversal increases more than that. The benchmark spends a lot of time doing other things, so the apparent impact of the PR is diluted.

Note:

  • Built and tested only on Ubuntu 18.04 with gcc 8.2. I didn't even attempt to compile on Windows.
  • Benchmarked only on chromium repository on a single machine running Ubuntu 18.04.

Comment thread src/util.h Outdated
#else
return strncmp(a, b, sz);
#endif
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if the libc git_strcmp would actually be any faster. Also, our git__strncmp implementation is nearly the same as that of musl libc, so there's no performance gain to be expected compared to that. That being said, glibc uses an unrolled loop for strncmp while n > 4 to enable the compiler to vectorize code, which is probably why you do see performance differences.

So because of this I'm fine with your proposal in general. Personally, I'd vote for having defines though instead of inlined code. E.g.

#ifdef GIT_WIN32
GIT_INLINE(int) git__strcmp(const char *a, const char *b)
{
    ....
}
#else
# define git__strcmp strcmp
#endif

By the way, why can't we use strcmp and strncmp on Win32 platforms?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @ethomson, strcmp does locale-sensitive comparision on windows: #4230 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to validate that this is still happening... this started in a prerelease of the 2015 (IIRC) C runtime. It's very possible that the change got reverted before it shipped and we've been unnecessarily shipping our own strcmp for years.

I haven't looked at the assembler output but yeah, I assume that there's vectorization opportunities that we've not taken. In any case, let me look into this in case we can ⚡️ it on win32 also. I'll 👀 this weekend (sorry, busy day today).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if the libc git_strcmp would actually be any faster.

It speeds up my benchmark by 0.85%.

So because of this I'm fine with your proposal in general. Personally, I'd vote for having defines though instead of inlined code.

I usually avoid having the same symbol being a function in one configuration and a macro in another. It can cause trouble when signatures diverge, or when there is hiding involved. And in general fewer macros is better.

However, I'll do anything you ask me to. Shall I change the code the way you suggested?

I'll eyes this weekend (sorry, busy day today).

@ethomson I'm waiting for you to come back. It's not urgent. Just so you know that the ball is in your court.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if the libc git_strcmp would actually be any faster.
It speeds up my benchmark by 0.85%.

Huh, I wonder why that is. I guess you didn't take a lot at the generated assembly, right? No need to do so if you didn't, I'm simply curious.

However, I'll do anything you ask me to. Shall I change the code the way you suggested?

I'd say we should first wait for @ethomson. No need to make you perform additional work when the requirements aren't quite clear yet :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove git__strcmp and git__strncmp and replace all calls to them with the standard equivalents?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh. I don't think the code churn would be worth it. Instead, I'd just remove the self-rolled implementations and unconditionally #define git__strcmp strcmp

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. And after much trial and error I managed to merge upstream changes and squash my commits into one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @ethomson, any remarks?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the #defines are okay. IIRC, in the past we could not simply do this since we take function pointers to git__strcmp (in status.c for example) so that we can switch easily between git__strcmp and git__strcasecmp. The problem is that we run into calling convention problems, depending on how we're built vs how the MSVCRT runtime is built.

It looks like since we removed the ability to build as stdcall that we're 👍 here now.

@romkatv
Copy link
Copy Markdown
Contributor Author

romkatv commented Mar 19, 2019

I guess you didn't take a lot at the generated assembly, right?

Right, I haven't looked at the assembly. This PR was produced with the following methodology.

  1. Run time lg2 status on chromium repo.
  2. Generate CPU profile for lg2 status in chromium repo. I used perf + perf_data_converter + pprof for this.
  3. Take one of the hand-rolled string comparison functions from the top CPU consumers and replace it with a standard function.
  4. Run time lg2 status on chromium repo. If it's faster than before, keep the change. Otherwise revert it.
  5. Repeat.

@ethomson
Copy link
Copy Markdown
Member

ethomson commented Apr 4, 2019

Thanks for the fix @romkatv!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants