stash: avoid recomputing tree when committing worktree by pks-t · Pull Request #5113 · libgit2/libgit2

pks-t · 2019-06-14T11:03:44Z

When creating a new stash, we need to create there separate
commits storing differences stored in the index, untracked
changes as well as differences in the working directory. The
first two will only be done conditionally if the equivalent
options "git stash --keep-index --include-untracked" are being
passed to git_stash_save, but even when only creating a stash
of worktree changes we're much slower than git.git. Using our new
stash example:

    $ time git stash
    Saved working directory and index state WIP on (no branch): 2f7d9d47575e Linux 5.1.7

    real    0m0.528s
    user    0m0.309s
    sys     0m0.381s

    $ time lg2 stash

    real    0m27.165s
    user    0m13.645s
    sys     0m6.403s

As can be seen, libgit2 is more than 50x slower than git.git!

When creating the stash commit that includes all worktree
changes, we create a completely new index to prepare for the new
commit and populate it with the entries contained in the index'
tree. Here comes the catch: by populating the index with a tree's
contents, we do not have any stat caches in the index. This means
that we have to re-validate every single file from the worktree
and see whether it has changed.

The issue can be fixed by populating the new index with the
repo's existing index instead of with the tree. This retains all
stat cache information, and thus we really only need to check
files that have changed stat information. This is semantically
equivalent to what we previously did: previously, we used the
tree of the commit computed from the index. Now we're just using
the index directly.

And, in fact, the cache is doing wonders:

    time lg2 stash

    real    0m1.836s
    user    0m1.166s
    sys     0m0.663s

We're now performing 15x faster than before and are only 3x
slower than git.git now.

Note that this also contains a new stash example. I'd been too lazy to come up with a more complex setup, so I decided to just implement "lg2 stash" and use that to test performance.

pks-t · 2019-06-14T11:04:43Z

Fixes #3910

ethomson · 2019-06-14T13:07:19Z

-		goto cleanup;
+	if ((error = git_repository_index(&r_index, repo) < 0) ||
+	    (error = git_index_new(&i_index)) < 0 ||
+	    (error = git_index__fill(i_index, &r_index->entries) < 0) ||


Hmm, I haven't done a deep analysis here, but is there not a behavior difference? Previously, we'd create an index from the commit tree, now we're using the existing index. So if there are staged changes, those will now be included in i_index while they were not before.

(Apologies if there's already a guard here to avoid that).

I think that we have a stat-preserving way to make these match if that's needed. I'm reasonably sure that git_index_read_tree will actually preserve the stat cache. So adding that back after filling i_index with stat information should keep i_index identical to i_tree with the correct stat information (for matching entries).

But maybe I'm off base here and this function only gets called when i_index == HEAD, in which case, there's no reason to worry about that. Indeed, I would expect unit tests to catch this case, but I wanted to mention it while it was top of mind for me.

I can do a more detailed 👀 soon.

Any chance you're free to double check this soon 😄 @ethomson.

pks-t · 2019-06-14T13:56:47Z

On Fri, Jun 14, 2019 at 06:07:36AM -0700, Edward Thomson wrote: ethomson commented on this pull request. parents[0] = b_commit; parents[1] = i_commit; parents[2] = u_commit; - if ((error = git_commit_tree(&i_tree, i_commit)) < 0) - goto cleanup; - - if ((error = git_index_new(&i_index)) < 0 || - (error = git_repository__cvar(&ignorecase, repo, GIT_CVAR_IGNORECASE)) < 0) - goto cleanup; + if ((error = git_repository_index(&r_index, repo) < 0) || + (error = git_index_new(&i_index)) < 0 || + (error = git_index__fill(i_index, &r_index->entries) < 0) || Hmm, I haven't done a deep analysis here, but is there not a behavior difference? Previously, we'd create an index from the commit tree, now we're using the existing index. So if there are staged changes, those will now be included in `i_index` while they were not before. (Apologies if there's already a guard here to avoid that). I think that we have a stat-preserving way to make these match if that's needed. I'm reasonably sure that `git_index_read_tree` will actually preserve the stat cache. So adding that back after filling `i_index` with stat information should keep `i_index` identical to `i_tree` with the correct stat information (for matching entries). But maybe I'm off base here and this function only gets called when `i_index` == HEAD, in which case, there's no reason to worry about that. Indeed, I would expect unit tests to catch this case, but I wanted to mention it while it was top of mind for me. I can do a more detailed 👀 soon.

I _think_ we're fine here, even though I initially wondered about the exact same thing. Note: the `i_commit` passed in to `commit_worktree` is actually created from the current repo's index. Thus by using `i_tree` from `i_index`, we were essentially already using the contents from the index. Now we just avoid going via the tree, but directly copy the index including all its stats caches.

pks-t · 2019-06-24T16:00:11Z

Ping :)

tiennou

Just a minor documentation nitpick, as I don't know my way around indexes anyways. Kudos for the nice improvement though.

tiennou · 2019-06-26T12:43:55Z

+	return 0;
+}
+
+int lg2_stash(git_repository *repo, int argc, char *argv[])


Nothing standing out here codewise: I just want to point out that Docurium seemingly chokes on examples with no docblocks (add.c has that problem). But it could be something else — I haven't looked at the example processing system at all.

"add.c" does have docblocks, though, so I'd expect it to be something else. Does this break Docurium in a way that needs to be fixed before we merge this or can we just proceed? I'm obviously asking out of sheer lazyness 🙄

No worries, it's just a reminder (mostly to me), it's a Docurium bug in any case, and — on the offchance it is docblock-related — it doesn't cause anything else than a blank example page.

IIRC most examples have a "main" docblock at the top that describes "usage", that'd be fine by me. Or this can be merged, and I'll check when I get near Docurium again.

pks-t · 2019-07-11T19:10:32Z

Too many branches, deleted the wrong one by accident 🙄

pks-t · 2019-07-20T16:47:34Z

Rebased to fix conflicts

Implement a new example that resembles the git-stash(1) command. Right now, it only provides the apply, list, save and pop subcommands without any options. This example is mostly used to test libgit2's stashing performance on big repositories.

When creating a new stash, we need to create there separate commits storing differences stored in the index, untracked changes as well as differences in the working directory. The first two will only be done conditionally if the equivalent options "git stash --keep-index --include-untracked" are being passed to `git_stash_save`, but even when only creating a stash of worktree changes we're much slower than git.git. Using our new stash example: $ time git stash Saved working directory and index state WIP on (no branch): 2f7d9d47575e Linux 5.1.7 real 0m0.528s user 0m0.309s sys 0m0.381s $ time lg2 stash real 0m27.165s user 0m13.645s sys 0m6.403s As can be seen, libgit2 is more than 50x slower than git.git! When creating the stash commit that includes all worktree changes, we create a completely new index to prepare for the new commit and populate it with the entries contained in the index' tree. Here comes the catch: by populating the index with a tree's contents, we do not have any stat caches in the index. This means that we have to re-validate every single file from the worktree and see whether it has changed. The issue can be fixed by populating the new index with the repo's existing index instead of with the tree. This retains all stat cache information, and thus we really only need to check files that have changed stat information. This is semantically equivalent to what we previously did: previously, we used the tree of the commit computed from the index. Now we're just using the index directly. And, in fact, the cache is doing wonders: time lg2 stash real 0m1.836s user 0m1.166s sys 0m0.663s We're now performing 15x faster than before and are only 3x slower than git.git now.

ethomson · 2019-08-11T22:42:35Z

OK, I've read through our tests and they seem very thorough. And I've run this through manual testing, with a variety of changes in the index and working directory, and this matches the behavior of git in all of them.

Thanks for doing this @pks-t !

gjcampbell · 2019-09-26T14:22:54Z

This fix was awesome. Can --untracked-changes be optimized as well?

pks-t · 2019-09-27T10:22:11Z

This fix was awesome. Can --untracked-changes be optimized as well?

They probably could be optimized, but that'd require a proper test case that shows that we perform much worse than e.g. the official git implementation. Please feel free to create an issue with a reproducer, this would make it much more likely for anybody to tackle the issue. :)

pks-t mentioned this pull request Jun 14, 2019

git_stash_save really slow on large repositories #3910

Closed

pks-t force-pushed the pks/stash-perf branch from 23c6dbb to 40b4989 Compare June 14, 2019 11:31

ethomson reviewed Jun 14, 2019

View reviewed changes

tiennou reviewed Jun 26, 2019

View reviewed changes

pks-t closed this Jul 11, 2019

pks-t deleted the pks/stash-perf branch July 11, 2019 19:06

pks-t restored the pks/stash-perf branch July 11, 2019 19:10

pks-t reopened this Jul 11, 2019

pks-t force-pushed the pks/stash-perf branch from 40b4989 to 2071f81 Compare July 20, 2019 16:47

pks-t added 2 commits July 20, 2019 19:10

examples: implement git-stash example

88731e3

Implement a new example that resembles the git-stash(1) command. Right now, it only provides the apply, list, save and pop subcommands without any options. This example is mostly used to test libgit2's stashing performance on big repositories.

pks-t force-pushed the pks/stash-perf branch from 2071f81 to a7d32d6 Compare July 20, 2019 17:11

ethomson merged commit 5774b2b into libgit2:master Aug 11, 2019

pks-t deleted the pks/stash-perf branch August 23, 2019 10:40

snyk-bot mentioned this pull request Feb 23, 2020

[Snyk] Upgrade nodegit from 0.4.1 to 0.26.4 saurabharch/Breezeblocks#1

Open

snyk-bot mentioned this pull request Apr 22, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 aminatakonate000/Graviton-App#4

Open

snyk-bot mentioned this pull request May 5, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 Barnstorm-Online/ngp-openapi-generator#1

Open

Conversation

pks-t commented Jun 14, 2019

Uh oh!

pks-t commented Jun 14, 2019

Uh oh!

ethomson Jun 14, 2019

Choose a reason for hiding this comment

Uh oh!

implausible Jul 2, 2019

Choose a reason for hiding this comment

Uh oh!

pks-t commented Jun 14, 2019 via email

Uh oh!

pks-t commented Jun 24, 2019

Uh oh!

tiennou left a comment

Choose a reason for hiding this comment

Uh oh!

tiennou Jun 26, 2019

Choose a reason for hiding this comment

Uh oh!

pks-t Jun 27, 2019

Choose a reason for hiding this comment

Uh oh!

tiennou Jun 27, 2019

Choose a reason for hiding this comment

Uh oh!

pks-t commented Jul 11, 2019

Uh oh!

pks-t commented Jul 20, 2019

Uh oh!

ethomson commented Aug 11, 2019

Uh oh!

gjcampbell commented Sep 26, 2019

Uh oh!

pks-t commented Sep 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pks-t commented Sep 27, 2019 •

edited

Loading