Skip to content

Fix path computations for compressed index entries#4236

Merged
ethomson merged 14 commits intolibgit2:masterfrom
pks-t:pks/index-v4-fixes
Jun 7, 2017
Merged

Fix path computations for compressed index entries#4236
ethomson merged 14 commits intolibgit2:masterfrom
pks-t:pks/index-v4-fixes

Conversation

@pks-t
Copy link
Copy Markdown
Member

@pks-t pks-t commented May 10, 2017

This fixes the test case from #4234. We got the path calculation for compressed index entries completely wrong, where we ended up with a wrong path and with a wrong index entry size. This actually hints at a bigger problem: we obviously cannot have any tests at all regarding index v4, as that should've immediately shown that we do not really support this.

I'm off for today, but will amend tests later on.

@pks-t pks-t mentioned this pull request May 10, 2017
@pks-t pks-t force-pushed the pks/index-v4-fixes branch 2 times, most recently from a573890 to df3f464 Compare May 12, 2017 08:06
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented May 12, 2017

Found and fixed more issues with computing index paths, as we did not update the index entry against which we were compressing paths against. Also added a test to verify we correctly read an index of version 4.

I somehow doubt that index::version::can_write_v4 really checks what we think it does. It did read an index from disk and then used git_index_write, subsequently reading entries by path from the re-opened index. Seeing we weren't able to read these paths at all, I guess we simply wrote uncompressed paths into the index, which by definition is not an index v4. So our write support has to be broken, as well.

@pks-t pks-t force-pushed the pks/index-v4-fixes branch from df3f464 to 9dc59ed Compare May 19, 2017 12:26
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented May 19, 2017

Fixed a lot of issues with writing index v4 entries. I think there's still one issue left with us overallocating memory for compressed index entries when writing, causing us to write additional NULs into the index. While this works for us as we're able to read those entries, right-padding with NULs is not allowed in index v4. So there's a bug on both the reading and writing sides of our code.

@pks-t pks-t mentioned this pull request May 19, 2017
8 tasks
@pks-t pks-t force-pushed the pks/index-v4-fixes branch from 9dc59ed to bc267fa Compare May 22, 2017 11:58
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented May 22, 2017

I fixed two final errors.

First, the git_varint_encode function miscalculated required space, causing us to underallocate for big numbers and overallocate for small numbers.

Second, we were not NUL-terminating the path string for compressed index entries, it only worked by accident because the first field of the next entry frequently starts with '\x00'. I've added a few asserts to verify current behavior, and they're not being triggered anymore. I'm not aware of any additional bugs in compressed index entries currently.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented May 22, 2017

The failing tests are due to my shitty test to verify we are actually writing index v4 with path compression. I simply check the size of the resulting written index, which seems to be non-portable. I've got no real other idea how to verify that we're actually writing an index v4 with path compression. though.

Comment thread src/index.c Outdated
GITERR_CHECK_ALLOC_ADD(&tmp_path_len, shared, len + 1);
tmp_path = git__malloc(tmp_path_len);
GITERR_CHECK_ALLOC_ADD(&path_len, prefix_len, suffix_len);
GITERR_CHECK_ALLOC_ADD(&path_len, prefix_len, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be GITERR_CHECK_ALLOC_ADD(&path_len, path_len, 1);. As is this is truncating the size of path_len to prefix_len + 1 instead of prefix_len + suffix_len + 1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, obviously. Thank?!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, obviously. Thanks!

Comment thread tests/index/version.c
for (i = 'a'; i <= 'z'; i++) {
for (j = 'a'; j < 'z'; j++) {
path[ARRAY_SIZE(path) - 3] = i;
path[ARRAY_SIZE(path) - 2] = j;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path is not null-terminated here. path[ARRAY_SIZE(path) - 1] = '\0'; ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also correct, will fix.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

pks-t added 14 commits June 6, 2017 09:33
When encoding varints to a buffer, we want to remain sure that the
required buffer space does not exceed what is actually available. Our
current check does not do the right thing, though, in that it does not
honor that our `pos` variable counts the position down instead of up. As
such, we will require too much memory for small varints and not enough
memory for big varints.

Fix the issue by correctly calculating the required size as
`(sizeof(varint) - pos)`. Add a test which failed before.
The index version 4 introduced compressed path names for the entries.
From the git.git index-format documentation:

    At the beginning of an entry, an integer N in the variable width
    encoding [...] is stored, followed by a NUL-terminated string S.
    Removing N bytes from the end of the path name for the previous
    entry, and replacing it with the string S yields the path name for
    this entry.

But instead of stripping N bytes from the previous path's string and
using the remaining prefix, we were instead simply concatenating the
previous path with the current entry path, which is obviously wrong.

Fix the issue by correctly copying the first N bytes of the previous
entry only and concatenating the result with our current entry's path.
To calculate the path of a compressed index entry, we need to know the
preceding entry's path. While we do actually set the first predecessor
correctly to "", we fail to update this while reading the entries.

Fix the issue by updating `last` inside of the loop. Previously, we've
been passing a double-pointer to `read_entry`, which it didn't update.
As it is more obvious to update the pointer inside the loop itself,
though, we can simply convert it to a normal pointer.
The last written disk entry is currently being written inside of the
function `write_disk_entry`. Make behavior a bit more obviously by
instead setting it inside of `write_entries` while iterating all
entries.
Create a new function `index_entry_size` which encapsulates the logic to
calculate how much space is needed for an index entry, whether it is
simple/extended or compressed/uncompressed. This can later be re-used by
our code writing index entries.
Our code to write index entries to disk does not check whether the
entry that is to be written should use prefix compression for the path.
As such, we were overallocating memory and added bogus right-padding
into the resulting index entries. As there is no padding allowed in the
index version 4 format, this should actually result in an invalid index.

Fix this by re-using the newly extracted `index_entry_size` function.
All index entry size computations are now performed in
`index_entry_size`. As such, we do not need the file-scope macros for
computing these sizes anymore. Remove them and move the `entry_size`
macro into the `index_entry_size` function.
We have a check in place whether the index has enough data left for the
required footer after reading an index entry, but this was only used for
uncompressed entries. Move the check down a bit so that it is executed
for both compressed and uncompressed index entries.
When using compressed index entries, each entry's path is preceded by a
varint encoding how long the shared prefix with the previous index entry
actually is. We currently encode a length of `(path_len - same_len)`,
which is doubly wrong. First, `path_len` is already set to `path_len -
same_len` previously. Second, we want to encode the shared prefix rather
than the un-shared suffix length.

Fix this by using `same_len` as the varint value instead.
In our code writing index entries, we carry around a `disk_size`
representing how much memory we have in total and pass this value to
`git_encode_varint` to do bounds checks. This does not make much sense,
as at the time when passing on this variable it is already out of date.
Fix this by subtracting used memory from `disk_size` as we go along.
Furthermore, assert we've actually got enough space left to do the final
path memcpy.
The init and cleanup functions for test suites are usually prepended to
our actual tests. The index::version test suite does not adhere to this
stile. Fix this.
While we have a simple test to determine whether we can write an index
of version 4, we never verified that we are able to read this kind of
index (and in fact, we were not able to do so). Add a new repository
which has an index of version 4. This repository is then read from a new
test.
While we do have a test which checks whether a written index of version
4 has the correct version set, we do not check whether this actually
enables path compression for index entries. This commit adds a new test
by adding a number of index entries with equal path prefixes to the
index and subsequently flushing that to disk. With suffix compression
enabled by index version 4, only the last few bytes of these paths will
actually have to be written to the index, saving a lot of disk space.
For the test, differences are about an order of magnitude, allowing us
to easily verify without taking a deeper look at actual on-disk
contents.
The current write test does not trigger some edge-cases in the index
version 4 path compression code. Rewrite the test to start off the an
empty standard repository, creating index entries with interesting paths
itself. This allows for more fine-grained control over checked paths.
Furthermore, we now also verify that entry paths are actually
reconstructed correctly.
@pks-t pks-t force-pushed the pks/index-v4-fixes branch from 45d6861 to 92d3ea4 Compare June 6, 2017 07:38
@ethomson ethomson merged commit 3bc95cf into libgit2:master Jun 7, 2017
@pks-t pks-t deleted the pks/index-v4-fixes branch June 12, 2017 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants