revwalk: avoid walking the entire history when output is unsorted#4606
revwalk: avoid walking the entire history when output is unsorted#4606
Conversation
As part of reducing our divergence from git, its code for revwalk was ported into our codebase. A detail about when to limit the list was lost and we ended up always calling that code. Limiting the list means performing the walk and creating the final list of commits to be output during the preparation stage. This is unavoidable when sorting and when there are negative refs. We did this even when asked for unsorted output with no negative refs, which you might do to retrieve something like the "last 10 commits on HEAD" for a nominally unsorted meaning of "last". This commit adds and sets a flag indicating when we do need to limit the list, letting us avoid doing so when we can. The previously mentioned query thus no longer loads the entire history of the project during the prepare stage, but loads it iteratively during the walk.
pks-t
left a comment
There was a problem hiding this comment.
Cool! Out of curiosity: have you done any benchmarks by how much this actually speeds up listings?
| } | ||
| } | ||
|
|
||
| static int get_revision(git_commit_list_node **out, git_revwalk *walk, git_commit_list **list) |
There was a problem hiding this comment.
There's not really a difference to get_one_revision right now, is there?
There was a problem hiding this comment.
Right now it just calls the lower-level function but extra features would get built here.
There was a problem hiding this comment.
I'd say we should avoid having two equivalent function calls for now, though. It can still easily be extended later when implementing said extra features
| int error; | ||
| git_commit_list_node *commit; | ||
|
|
||
| while(true) { |
There was a problem hiding this comment.
This loop is useless right now -- we always return during the first loop
There was a problem hiding this comment.
Yeah, part of the porting. Here is where simplification handling happens in git. It's probably worth removing it to reduce confusion.
| } | ||
|
|
||
| if (sort_mode != GIT_SORT_NONE) | ||
| walk->limited = 1; |
There was a problem hiding this comment.
This has the limitation that we will stay limited when somebody is switching sorting to GIT_SORT_NONE afterwards, again. But handling that case correctly would need another flag, so I guess that's okay
There was a problem hiding this comment.
Handling that would need something like else if (!walk->did_hide) which is not too bad, but would be quite an edge case where you're double-guessing what sorting you want for a single walk.
| return error; | ||
| } | ||
|
|
||
| static int revwalk_next_reverse(git_commit_list_node **object_out, git_revwalk *walk) |
There was a problem hiding this comment.
Wouldn't revwalk_next_reverse need the get_revision treatment, as well?
There was a problem hiding this comment.
The reverse iterator does the full walk itself in prepare_walk by iterating via whatever other iterator we have set up and inserts them in this iterator.
We could potentially convert it into the same kind of loop but like with the pure time sort, we already have everything in the list.
That depends on how big your history is. With this an "unsorted" walk without negative refs will only read however many commits you walk down (plus some overhead for the parents that end up in there) instead of reading the whole list. My benchmark was to load up the github repository and call |
We don't currently need to have anything that's different between `get_revision` and `get_one_revision` so let's just remove the inner function and make the code more straightforward.
|
I just pushed up the simplification removing the extra layer of function call that doesn't do anything. I think it'll be fine to skip the trying to guess whether we can disable limiting when changing the sorting method back to |
| int error; | ||
| git_commit_list_node *commit; | ||
|
|
||
| commit = git_commit_list_pop(list); |
There was a problem hiding this comment.
Could you clean up the formatting in this function?
|
I have a really big repository (many GB) and repository.Commits.Take(100) takes around 200 seconds to run. Compare to git log -n 100 which takes about 1 second. The culprit is this line in libgit2sharp: On the same repository, Diff.Compare(oldTree, newTree) takes over 200 seconds in libgit2sharp, but only 1 second from the commandline. |
|
@nebosite: What was the version of libgit2 you have been testing with? Did it already include these changes from this PR here? If so, then you should probably create another issue with a simple reproducer that shows the problem |
|
I was using a same-day clone of master. Not sure I can reproduce this without a really large repo. I'm guessing the behavior might be exponential. |
As part of reducing our divergence from git, its code for revwalk was ported
into our codebase. A detail about when to limit the list was lost and we ended
up always calling that code.
Limiting the list means performing the walk and creating the final list of
commits to be output during the preparation stage. This is unavoidable when
sorting and when there are negative refs.
We did this even when asked for unsorted output with no negative refs, which you
might do to retrieve something like the "last 10 commits on HEAD" for a
nominally unsorted meaning of "last".
This commit adds and sets a flag indicating when we do need to limit the list,
letting us avoid doing so when we can. The previously mentioned query thus no
longer loads the entire history of the project during the prepare stage, but
loads it iteratively during the walk.