Fix path computations for compressed index entries#4236
Fix path computations for compressed index entries#4236ethomson merged 14 commits intolibgit2:masterfrom
Conversation
a573890 to
df3f464
Compare
|
Found and fixed more issues with computing index paths, as we did not update the index entry against which we were compressing paths against. Also added a test to verify we correctly read an index of version 4. I somehow doubt that |
|
Fixed a lot of issues with writing index v4 entries. I think there's still one issue left with us overallocating memory for compressed index entries when writing, causing us to write additional NULs into the index. While this works for us as we're able to read those entries, right-padding with NULs is not allowed in index v4. So there's a bug on both the reading and writing sides of our code. |
|
I fixed two final errors. First, the Second, we were not NUL-terminating the path string for compressed index entries, it only worked by accident because the first field of the next entry frequently starts with '\x00'. I've added a few asserts to verify current behavior, and they're not being triggered anymore. I'm not aware of any additional bugs in compressed index entries currently. |
|
The failing tests are due to my shitty test to verify we are actually writing index v4 with path compression. I simply check the size of the resulting written index, which seems to be non-portable. I've got no real other idea how to verify that we're actually writing an index v4 with path compression. though. |
| GITERR_CHECK_ALLOC_ADD(&tmp_path_len, shared, len + 1); | ||
| tmp_path = git__malloc(tmp_path_len); | ||
| GITERR_CHECK_ALLOC_ADD(&path_len, prefix_len, suffix_len); | ||
| GITERR_CHECK_ALLOC_ADD(&path_len, prefix_len, 1); |
There was a problem hiding this comment.
This needs to be GITERR_CHECK_ALLOC_ADD(&path_len, path_len, 1);. As is this is truncating the size of path_len to prefix_len + 1 instead of prefix_len + suffix_len + 1.
| for (i = 'a'; i <= 'z'; i++) { | ||
| for (j = 'a'; j < 'z'; j++) { | ||
| path[ARRAY_SIZE(path) - 3] = i; | ||
| path[ARRAY_SIZE(path) - 2] = j; |
There was a problem hiding this comment.
path is not null-terminated here. path[ARRAY_SIZE(path) - 1] = '\0'; ?
When encoding varints to a buffer, we want to remain sure that the required buffer space does not exceed what is actually available. Our current check does not do the right thing, though, in that it does not honor that our `pos` variable counts the position down instead of up. As such, we will require too much memory for small varints and not enough memory for big varints. Fix the issue by correctly calculating the required size as `(sizeof(varint) - pos)`. Add a test which failed before.
The index version 4 introduced compressed path names for the entries.
From the git.git index-format documentation:
At the beginning of an entry, an integer N in the variable width
encoding [...] is stored, followed by a NUL-terminated string S.
Removing N bytes from the end of the path name for the previous
entry, and replacing it with the string S yields the path name for
this entry.
But instead of stripping N bytes from the previous path's string and
using the remaining prefix, we were instead simply concatenating the
previous path with the current entry path, which is obviously wrong.
Fix the issue by correctly copying the first N bytes of the previous
entry only and concatenating the result with our current entry's path.
To calculate the path of a compressed index entry, we need to know the preceding entry's path. While we do actually set the first predecessor correctly to "", we fail to update this while reading the entries. Fix the issue by updating `last` inside of the loop. Previously, we've been passing a double-pointer to `read_entry`, which it didn't update. As it is more obvious to update the pointer inside the loop itself, though, we can simply convert it to a normal pointer.
The last written disk entry is currently being written inside of the function `write_disk_entry`. Make behavior a bit more obviously by instead setting it inside of `write_entries` while iterating all entries.
Create a new function `index_entry_size` which encapsulates the logic to calculate how much space is needed for an index entry, whether it is simple/extended or compressed/uncompressed. This can later be re-used by our code writing index entries.
Our code to write index entries to disk does not check whether the entry that is to be written should use prefix compression for the path. As such, we were overallocating memory and added bogus right-padding into the resulting index entries. As there is no padding allowed in the index version 4 format, this should actually result in an invalid index. Fix this by re-using the newly extracted `index_entry_size` function.
All index entry size computations are now performed in `index_entry_size`. As such, we do not need the file-scope macros for computing these sizes anymore. Remove them and move the `entry_size` macro into the `index_entry_size` function.
We have a check in place whether the index has enough data left for the required footer after reading an index entry, but this was only used for uncompressed entries. Move the check down a bit so that it is executed for both compressed and uncompressed index entries.
When using compressed index entries, each entry's path is preceded by a varint encoding how long the shared prefix with the previous index entry actually is. We currently encode a length of `(path_len - same_len)`, which is doubly wrong. First, `path_len` is already set to `path_len - same_len` previously. Second, we want to encode the shared prefix rather than the un-shared suffix length. Fix this by using `same_len` as the varint value instead.
In our code writing index entries, we carry around a `disk_size` representing how much memory we have in total and pass this value to `git_encode_varint` to do bounds checks. This does not make much sense, as at the time when passing on this variable it is already out of date. Fix this by subtracting used memory from `disk_size` as we go along. Furthermore, assert we've actually got enough space left to do the final path memcpy.
The init and cleanup functions for test suites are usually prepended to our actual tests. The index::version test suite does not adhere to this stile. Fix this.
While we have a simple test to determine whether we can write an index of version 4, we never verified that we are able to read this kind of index (and in fact, we were not able to do so). Add a new repository which has an index of version 4. This repository is then read from a new test.
While we do have a test which checks whether a written index of version 4 has the correct version set, we do not check whether this actually enables path compression for index entries. This commit adds a new test by adding a number of index entries with equal path prefixes to the index and subsequently flushing that to disk. With suffix compression enabled by index version 4, only the last few bytes of these paths will actually have to be written to the index, saving a lot of disk space. For the test, differences are about an order of magnitude, allowing us to easily verify without taking a deeper look at actual on-disk contents.
The current write test does not trigger some edge-cases in the index version 4 path compression code. Rewrite the test to start off the an empty standard repository, creating index entries with interesting paths itself. This allows for more fine-grained control over checked paths. Furthermore, we now also verify that entry paths are actually reconstructed correctly.
This fixes the test case from #4234. We got the path calculation for compressed index entries completely wrong, where we ended up with a wrong path and with a wrong index entry size. This actually hints at a bigger problem: we obviously cannot have any tests at all regarding index v4, as that should've immediately shown that we do not really support this.
I'm off for today, but will amend tests later on.