Skip to content

stash: avoid recomputing tree when committing worktree#5113

Merged
ethomson merged 2 commits intolibgit2:masterfrom
pks-t:pks/stash-perf
Aug 11, 2019
Merged

stash: avoid recomputing tree when committing worktree#5113
ethomson merged 2 commits intolibgit2:masterfrom
pks-t:pks/stash-perf

Conversation

@pks-t
Copy link
Copy Markdown
Member

@pks-t pks-t commented Jun 14, 2019

When creating a new stash, we need to create there separate
commits storing differences stored in the index, untracked
changes as well as differences in the working directory. The
first two will only be done conditionally if the equivalent
options "git stash --keep-index --include-untracked" are being
passed to git_stash_save, but even when only creating a stash
of worktree changes we're much slower than git.git. Using our new
stash example:

    $ time git stash
    Saved working directory and index state WIP on (no branch): 2f7d9d47575e Linux 5.1.7

    real    0m0.528s
    user    0m0.309s
    sys     0m0.381s

    $ time lg2 stash

    real    0m27.165s
    user    0m13.645s
    sys     0m6.403s

As can be seen, libgit2 is more than 50x slower than git.git!

When creating the stash commit that includes all worktree
changes, we create a completely new index to prepare for the new
commit and populate it with the entries contained in the index'
tree. Here comes the catch: by populating the index with a tree's
contents, we do not have any stat caches in the index. This means
that we have to re-validate every single file from the worktree
and see whether it has changed.

The issue can be fixed by populating the new index with the
repo's existing index instead of with the tree. This retains all
stat cache information, and thus we really only need to check
files that have changed stat information. This is semantically
equivalent to what we previously did: previously, we used the
tree of the commit computed from the index. Now we're just using
the index directly.

And, in fact, the cache is doing wonders:

    time lg2 stash

    real    0m1.836s
    user    0m1.166s
    sys     0m0.663s

We're now performing 15x faster than before and are only 3x
slower than git.git now.


Note that this also contains a new stash example. I'd been too lazy to come up with a more complex setup, so I decided to just implement "lg2 stash" and use that to test performance.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jun 14, 2019

Fixes #3910

Comment thread src/stash.c
goto cleanup;
if ((error = git_repository_index(&r_index, repo) < 0) ||
(error = git_index_new(&i_index)) < 0 ||
(error = git_index__fill(i_index, &r_index->entries) < 0) ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I haven't done a deep analysis here, but is there not a behavior difference? Previously, we'd create an index from the commit tree, now we're using the existing index. So if there are staged changes, those will now be included in i_index while they were not before.

(Apologies if there's already a guard here to avoid that).

I think that we have a stat-preserving way to make these match if that's needed. I'm reasonably sure that git_index_read_tree will actually preserve the stat cache. So adding that back after filling i_index with stat information should keep i_index identical to i_tree with the correct stat information (for matching entries).

But maybe I'm off base here and this function only gets called when i_index == HEAD, in which case, there's no reason to worry about that. Indeed, I would expect unit tests to catch this case, but I wanted to mention it while it was top of mind for me.

I can do a more detailed 👀 soon.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you're free to double check this soon 😄 @ethomson.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jun 14, 2019 via email

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jun 24, 2019

Ping :)

Copy link
Copy Markdown
Contributor

@tiennou tiennou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor documentation nitpick, as I don't know my way around indexes anyways. Kudos for the nice improvement though.

Comment thread examples/stash.c
return 0;
}

int lg2_stash(git_repository *repo, int argc, char *argv[])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing standing out here codewise: I just want to point out that Docurium seemingly chokes on examples with no docblocks (add.c has that problem). But it could be something else — I haven't looked at the example processing system at all.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"add.c" does have docblocks, though, so I'd expect it to be something else. Does this break Docurium in a way that needs to be fixed before we merge this or can we just proceed? I'm obviously asking out of sheer lazyness 🙄

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, it's just a reminder (mostly to me), it's a Docurium bug in any case, and — on the offchance it is docblock-related — it doesn't cause anything else than a blank example page.

IIRC most examples have a "main" docblock at the top that describes "usage", that'd be fine by me. Or this can be merged, and I'll check when I get near Docurium again.

@pks-t pks-t closed this Jul 11, 2019
@pks-t pks-t deleted the pks/stash-perf branch July 11, 2019 19:06
@pks-t pks-t restored the pks/stash-perf branch July 11, 2019 19:10
@pks-t pks-t reopened this Jul 11, 2019
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jul 11, 2019

Too many branches, deleted the wrong one by accident 🙄

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jul 20, 2019

Rebased to fix conflicts

pks-t added 2 commits July 20, 2019 19:10
Implement a new example that resembles the git-stash(1) command.
Right now, it only provides the apply, list, save and pop
subcommands without any options.

This example is mostly used to test libgit2's stashing
performance on big repositories.
When creating a new stash, we need to create there separate
commits storing differences stored in the index, untracked
changes as well as differences in the working directory. The
first two will only be done conditionally if the equivalent
options "git stash --keep-index --include-untracked" are being
passed to `git_stash_save`, but even when only creating a stash
of worktree changes we're much slower than git.git. Using our new
stash example:

	$ time git stash
	Saved working directory and index state WIP on (no branch): 2f7d9d47575e Linux 5.1.7

	real    0m0.528s
	user    0m0.309s
	sys     0m0.381s

	$ time lg2 stash

	real    0m27.165s
	user    0m13.645s
	sys     0m6.403s

As can be seen, libgit2 is more than 50x slower than git.git!

When creating the stash commit that includes all worktree
changes, we create a completely new index to prepare for the new
commit and populate it with the entries contained in the index'
tree. Here comes the catch: by populating the index with a tree's
contents, we do not have any stat caches in the index. This means
that we have to re-validate every single file from the worktree
and see whether it has changed.

The issue can be fixed by populating the new index with the
repo's existing index instead of with the tree. This retains all
stat cache information, and thus we really only need to check
files that have changed stat information. This is semantically
equivalent to what we previously did: previously, we used the
tree of the commit computed from the index. Now we're just using
the index directly.

And, in fact, the cache is doing wonders:

	time lg2 stash

	real    0m1.836s
	user    0m1.166s
	sys     0m0.663s

We're now performing 15x faster than before and are only 3x
slower than git.git now.
@ethomson
Copy link
Copy Markdown
Member

OK, I've read through our tests and they seem very thorough. And I've run this through manual testing, with a variety of changes in the index and working directory, and this matches the behavior of git in all of them.

Thanks for doing this @pks-t !

@ethomson ethomson merged commit 5774b2b into libgit2:master Aug 11, 2019
@pks-t pks-t deleted the pks/stash-perf branch August 23, 2019 10:40
@gjcampbell
Copy link
Copy Markdown

This fix was awesome. Can --untracked-changes be optimized as well?

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Sep 27, 2019

This fix was awesome. Can --untracked-changes be optimized as well?

They probably could be optimized, but that'd require a proper test case that shows that we perform much worse than e.g. the official git implementation. Please feel free to create an issue with a reproducer, this would make it much more likely for anybody to tackle the issue. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants