stash: avoid recomputing tree when committing worktree#5113
stash: avoid recomputing tree when committing worktree#5113ethomson merged 2 commits intolibgit2:masterfrom
Conversation
|
Fixes #3910 |
| goto cleanup; | ||
| if ((error = git_repository_index(&r_index, repo) < 0) || | ||
| (error = git_index_new(&i_index)) < 0 || | ||
| (error = git_index__fill(i_index, &r_index->entries) < 0) || |
There was a problem hiding this comment.
Hmm, I haven't done a deep analysis here, but is there not a behavior difference? Previously, we'd create an index from the commit tree, now we're using the existing index. So if there are staged changes, those will now be included in i_index while they were not before.
(Apologies if there's already a guard here to avoid that).
I think that we have a stat-preserving way to make these match if that's needed. I'm reasonably sure that git_index_read_tree will actually preserve the stat cache. So adding that back after filling i_index with stat information should keep i_index identical to i_tree with the correct stat information (for matching entries).
But maybe I'm off base here and this function only gets called when i_index == HEAD, in which case, there's no reason to worry about that. Indeed, I would expect unit tests to catch this case, but I wanted to mention it while it was top of mind for me.
I can do a more detailed 👀 soon.
There was a problem hiding this comment.
Any chance you're free to double check this soon 😄 @ethomson.
|
On Fri, Jun 14, 2019 at 06:07:36AM -0700, Edward Thomson wrote:
ethomson commented on this pull request.
parents[0] = b_commit;
parents[1] = i_commit;
parents[2] = u_commit;
- if ((error = git_commit_tree(&i_tree, i_commit)) < 0)
- goto cleanup;
-
- if ((error = git_index_new(&i_index)) < 0 ||
- (error = git_repository__cvar(&ignorecase, repo, GIT_CVAR_IGNORECASE)) < 0)
- goto cleanup;
+ if ((error = git_repository_index(&r_index, repo) < 0) ||
+ (error = git_index_new(&i_index)) < 0 ||
+ (error = git_index__fill(i_index, &r_index->entries) < 0) ||
Hmm, I haven't done a deep analysis here, but is there not a
behavior difference? Previously, we'd create an index from the
commit tree, now we're using the existing index. So if there
are staged changes, those will now be included in `i_index`
while they were not before.
(Apologies if there's already a guard here to avoid that).
I think that we have a stat-preserving way to make these match
if that's needed. I'm reasonably sure that
`git_index_read_tree` will actually preserve the stat cache.
So adding that back after filling `i_index` with stat
information should keep `i_index` identical to `i_tree` with
the correct stat information (for matching entries).
But maybe I'm off base here and this function only gets called
when `i_index` == HEAD, in which case, there's no reason to
worry about that. Indeed, I would expect unit tests to catch
this case, but I wanted to mention it while it was top of mind
for me.
I can do a more detailed 👀 soon.
I _think_ we're fine here, even though I initially wondered about
the exact same thing. Note: the `i_commit` passed in to
`commit_worktree` is actually created from the current repo's
index. Thus by using `i_tree` from `i_index`, we were essentially
already using the contents from the index. Now we just avoid
going via the tree, but directly copy the index including all its
stats caches.
|
|
Ping :) |
tiennou
left a comment
There was a problem hiding this comment.
Just a minor documentation nitpick, as I don't know my way around indexes anyways. Kudos for the nice improvement though.
| return 0; | ||
| } | ||
|
|
||
| int lg2_stash(git_repository *repo, int argc, char *argv[]) |
There was a problem hiding this comment.
Nothing standing out here codewise: I just want to point out that Docurium seemingly chokes on examples with no docblocks (add.c has that problem). But it could be something else — I haven't looked at the example processing system at all.
There was a problem hiding this comment.
"add.c" does have docblocks, though, so I'd expect it to be something else. Does this break Docurium in a way that needs to be fixed before we merge this or can we just proceed? I'm obviously asking out of sheer lazyness 🙄
There was a problem hiding this comment.
No worries, it's just a reminder (mostly to me), it's a Docurium bug in any case, and — on the offchance it is docblock-related — it doesn't cause anything else than a blank example page.
IIRC most examples have a "main" docblock at the top that describes "usage", that'd be fine by me. Or this can be merged, and I'll check when I get near Docurium again.
|
Too many branches, deleted the wrong one by accident 🙄 |
|
Rebased to fix conflicts |
Implement a new example that resembles the git-stash(1) command. Right now, it only provides the apply, list, save and pop subcommands without any options. This example is mostly used to test libgit2's stashing performance on big repositories.
When creating a new stash, we need to create there separate commits storing differences stored in the index, untracked changes as well as differences in the working directory. The first two will only be done conditionally if the equivalent options "git stash --keep-index --include-untracked" are being passed to `git_stash_save`, but even when only creating a stash of worktree changes we're much slower than git.git. Using our new stash example: $ time git stash Saved working directory and index state WIP on (no branch): 2f7d9d47575e Linux 5.1.7 real 0m0.528s user 0m0.309s sys 0m0.381s $ time lg2 stash real 0m27.165s user 0m13.645s sys 0m6.403s As can be seen, libgit2 is more than 50x slower than git.git! When creating the stash commit that includes all worktree changes, we create a completely new index to prepare for the new commit and populate it with the entries contained in the index' tree. Here comes the catch: by populating the index with a tree's contents, we do not have any stat caches in the index. This means that we have to re-validate every single file from the worktree and see whether it has changed. The issue can be fixed by populating the new index with the repo's existing index instead of with the tree. This retains all stat cache information, and thus we really only need to check files that have changed stat information. This is semantically equivalent to what we previously did: previously, we used the tree of the commit computed from the index. Now we're just using the index directly. And, in fact, the cache is doing wonders: time lg2 stash real 0m1.836s user 0m1.166s sys 0m0.663s We're now performing 15x faster than before and are only 3x slower than git.git now.
|
OK, I've read through our tests and they seem very thorough. And I've run this through manual testing, with a variety of changes in the index and working directory, and this matches the behavior of git in all of them. Thanks for doing this @pks-t ! |
|
This fix was awesome. Can --untracked-changes be optimized as well? |
They probably could be optimized, but that'd require a proper test case that shows that we perform much worse than e.g. the official git implementation. Please feel free to create an issue with a reproducer, this would make it much more likely for anybody to tackle the issue. :) |
When creating a new stash, we need to create there separate
commits storing differences stored in the index, untracked
changes as well as differences in the working directory. The
first two will only be done conditionally if the equivalent
options "git stash --keep-index --include-untracked" are being
passed to
git_stash_save, but even when only creating a stashof worktree changes we're much slower than git.git. Using our new
stash example:
As can be seen, libgit2 is more than 50x slower than git.git!
When creating the stash commit that includes all worktree
changes, we create a completely new index to prepare for the new
commit and populate it with the entries contained in the index'
tree. Here comes the catch: by populating the index with a tree's
contents, we do not have any stat caches in the index. This means
that we have to re-validate every single file from the worktree
and see whether it has changed.
The issue can be fixed by populating the new index with the
repo's existing index instead of with the tree. This retains all
stat cache information, and thus we really only need to check
files that have changed stat information. This is semantically
equivalent to what we previously did: previously, we used the
tree of the commit computed from the index. Now we're just using
the index directly.
And, in fact, the cache is doing wonders:
We're now performing 15x faster than before and are only 3x
slower than git.git now.
Note that this also contains a new stash example. I'd been too lazy to come up with a more complex setup, so I decided to just implement "lg2 stash" and use that to test performance.