Skip to content

revwalk: avoid walking the entire history when output is unsorted#4606

Merged
ethomson merged 3 commits intomasterfrom
cmn/revwalk-iteration
Jun 18, 2018
Merged

revwalk: avoid walking the entire history when output is unsorted#4606
ethomson merged 3 commits intomasterfrom
cmn/revwalk-iteration

Conversation

@carlosmn
Copy link
Copy Markdown
Member

@carlosmn carlosmn commented Apr 1, 2018

As part of reducing our divergence from git, its code for revwalk was ported
into our codebase. A detail about when to limit the list was lost and we ended
up always calling that code.

Limiting the list means performing the walk and creating the final list of
commits to be output during the preparation stage. This is unavoidable when
sorting and when there are negative refs.

We did this even when asked for unsorted output with no negative refs, which you
might do to retrieve something like the "last 10 commits on HEAD" for a
nominally unsorted meaning of "last".

This commit adds and sets a flag indicating when we do need to limit the list,
letting us avoid doing so when we can. The previously mentioned query thus no
longer loads the entire history of the project during the prepare stage, but
loads it iteratively during the walk.

As part of reducing our divergence from git, its code for revwalk was ported
into our codebase. A detail about when to limit the list was lost and we ended
up always calling that code.

Limiting the list means performing the walk and creating the final list of
commits to be output during the preparation stage. This is unavoidable when
sorting and when there are negative refs.

We did this even when asked for unsorted output with no negative refs, which you
might do to retrieve something like the "last 10 commits on HEAD" for a
nominally unsorted meaning of "last".

This commit adds and sets a flag indicating when we do need to limit the list,
letting us avoid doing so when we can. The previously mentioned query thus no
longer loads the entire history of the project during the prepare stage, but
loads it iteratively during the walk.
Copy link
Copy Markdown
Member

@pks-t pks-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Out of curiosity: have you done any benchmarks by how much this actually speeds up listings?

Comment thread src/revwalk.c Outdated
}
}

static int get_revision(git_commit_list_node **out, git_revwalk *walk, git_commit_list **list)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not really a difference to get_one_revision right now, is there?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it just calls the lower-level function but extra features would get built here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we should avoid having two equivalent function calls for now, though. It can still easily be extended later when implementing said extra features

Comment thread src/revwalk.c Outdated
int error;
git_commit_list_node *commit;

while(true) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is useless right now -- we always return during the first loop

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, part of the porting. Here is where simplification handling happens in git. It's probably worth removing it to reduce confusion.

Comment thread src/revwalk.c
}

if (sort_mode != GIT_SORT_NONE)
walk->limited = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the limitation that we will stay limited when somebody is switching sorting to GIT_SORT_NONE afterwards, again. But handling that case correctly would need another flag, so I guess that's okay

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling that would need something like else if (!walk->did_hide) which is not too bad, but would be quite an edge case where you're double-guessing what sorting you want for a single walk.

Comment thread src/revwalk.c
return error;
}

static int revwalk_next_reverse(git_commit_list_node **object_out, git_revwalk *walk)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't revwalk_next_reverse need the get_revision treatment, as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverse iterator does the full walk itself in prepare_walk by iterating via whatever other iterator we have set up and inserts them in this iterator.

We could potentially convert it into the same kind of loop but like with the pure time sort, we already have everything in the list.

@carlosmn
Copy link
Copy Markdown
Member Author

carlosmn commented Apr 3, 2018

have you done any benchmarks by how much this actually speeds up listings?

That depends on how big your history is. With this an "unsorted" walk without negative refs will only read however many commits you walk down (plus some overhead for the parents that end up in there) instead of reading the whole list.

My benchmark was to load up the github repository and call git_revwalk_next 11 times. Without the patch it takes about 7s and with it it's about 20ms or so.

We don't currently need to have anything that's different between `get_revision`
and `get_one_revision` so let's just remove the inner function and make the code
more straightforward.
@carlosmn
Copy link
Copy Markdown
Member Author

I just pushed up the simplification removing the extra layer of function call that doesn't do anything. I think it'll be fine to skip the trying to guess whether we can disable limiting when changing the sorting method back to NONE.

Comment thread src/revwalk.c Outdated
int error;
git_commit_list_node *commit;

commit = git_commit_list_pop(list);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clean up the formatting in this function?

@ethomson
Copy link
Copy Markdown
Member

Since @carlosmn updated this to address @pks-t's concerns, I went ahead and fixed up the formatting problems myself.

@nebosite
Copy link
Copy Markdown

nebosite commented Dec 18, 2018

I have a really big repository (many GB) and repository.Commits.Take(100) takes around 200 seconds to run. Compare to git log -n 100 which takes about 1 second. The culprit is this line in libgit2sharp:

        int res = NativeMethods.git_revwalk_next(out ret, walker);

On the same repository, Diff.Compare(oldTree, newTree) takes over 200 seconds in libgit2sharp, but only 1 second from the commandline.

@pks-t
Copy link
Copy Markdown
Member

pks-t commented Dec 19, 2018

@nebosite: What was the version of libgit2 you have been testing with? Did it already include these changes from this PR here? If so, then you should probably create another issue with a simple reproducer that shows the problem

@nebosite
Copy link
Copy Markdown

I was using a same-day clone of master. Not sure I can reproduce this without a really large repo. I'm guessing the behavior might be exponential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants