Conversation
|
Oooooooh, I'm excited to review this. 😀 |
34fe828 to
42d2667
Compare
de0a156 to
c709b97
Compare
|
The commits have been rewritten to have proper commit messages now. CI passes now, too, so this is ready for review |
|
By the way, short hint: please have a special look at the commit "cache: use iteration interface for cache eviction". It changes the way we behave with regards to cache eviction, where we would previously evict starting at a random offset and now we evict starting at the first entry. |
tiennou
left a comment
There was a problem hiding this comment.
Only minor comments from me, as it's a welcome cleanup overall anyways 😉.
| int git_idxmap_new(git_idxmap **out) | ||
| { | ||
| if ((*map = kh_init(idx)) == NULL) { | ||
| if ((*out = kh_init(idx)) == NULL) { |
There was a problem hiding this comment.
Maybe
*out = kh_init(idx);
GITERR_CHECK_ALLOC(*out);
return 0;as it removes the need for a if/oom dance ?
I'd slap on an assert(out) on top as well…
| * @parameter map map containing the elements | ||
| * @return number of elements in the map | ||
| */ | ||
| size_t git_offmap_size(git_offmap *map); |
There was a problem hiding this comment.
Nitpicky: my preference goes to count (or length) over size, as I usually tend to think of size as a number-of-bytes vs. a number-of-elements.
There was a problem hiding this comment.
Yeah, we have precedent for both, so I was kind of on the edge. Arrays have git_array_size, vectors have git_vector_length, pqueues have git_pqueue_size. @ethomson, care to decide this one?
There was a problem hiding this comment.
And buffer has both size and asize but we ask for its len. 😢
I feel like we most commonly use size. I don't love that - it always feels to me like the allocated size, not the filled / used size of a struct - but I think that consistency is good here.
There was a problem hiding this comment.
Yeah, had the same feeling. Thanks for chiming in!
| return kh_size(map); | ||
| } | ||
|
|
||
| void *git_strmap_get(git_strmap *map, const char *key) |
| kh_clear(idxicase, map); | ||
| } | ||
|
|
||
| int git_idxmap_resize(git_idxmap *map, size_t size) |
There was a problem hiding this comment.
size => count nitpick: _grow ?
(the commit message is missing a "due" on the 2nd line).
There was a problem hiding this comment.
In this example I disagree. Interestingly enough, kh_resize lets you resize maps to a smaller size. Nobody should be doing that, but this is why I don't want to name it grow ;)
| * GIT_ITEROVER if no entries are left. A negative error | ||
| * code otherwise. | ||
| */ | ||
| int git_offmap_iterate(git_offmap *map, size_t *iter, git_off_t *key, void **value); |
There was a problem hiding this comment.
This does not "conform" to the style of having "out" parameters at the beginning.
(git_off_t *key, void **value, size_t *iter, git_offmap *map) ?
| * Iterate over entries of the map. | ||
| * | ||
| * This functions allows to iterate over all key-value entries of | ||
| * the map. The current item is stored in the `iter` variable and |
| for (__i = git_oidmap_begin(h); __i != git_oidmap_end(h); ++__i) { \ | ||
| if (!git_oidmap_has_data(h,__i)) continue; \ | ||
| (vvar) = git_oidmap_value_at(h,__i); \ | ||
| #define git_oidmap_foreach_value(h, vvar, code) { size_t __i = 0; \ |
There was a problem hiding this comment.
That's unrelated, but I dislike code being an argument… (à la git_array_foreach).
There was a problem hiding this comment.
I totally agree. We could just transform the while-loop into a for loop, initializing __i in the loop initializer. But let's do this after this PR got merged, it already has a lot of code churn.
| } | ||
|
|
||
| void test_core_strmap__1(void) | ||
| void test_core_strmap__inserted_strings_can_be_retrieved(void) |
| { | ||
| uint32_t seed = rand(); | ||
| size_t evict_count = 8; | ||
| size_t evict_count = 8, i; |
There was a problem hiding this comment.
commitmsg: s/completley/completely/.
| return; | ||
| } | ||
|
|
||
| i = 0; |
There was a problem hiding this comment.
I might be completely wrong, but what about initializing this to random(0, git_oidmap_size(cache->map) - evict_count) ?
There was a problem hiding this comment.
Yeah, we sure could do this, but only because we do know the map's implementation details. We'd also need to wrap-around when reaching git_oidmap_size. But it sure keeps the original intention a lot better than my refactored code.
It just feels like the wrong data structure for the task at hand, at least in this specific cache eviction scenario. I'm not familiar with the code though, so I cannot say anything more to it. @ethomson, you've got any comment?
There was a problem hiding this comment.
Hrmf. No? If I'm reading this correctly, we walk through the cache evicting items in-order, the difference is the starting point. It feels like if we're going to go to any lengths to change our eviction strategy that we should actually have a strategy based on age or something and not the ordering of the oid. Starting at a random place in the list seems like it might be slightly better than always evicting starting at the beginning of the map (I can imagine the latter enabling some pathological cases) but they both sort of suck anyway.
(This might be what you were referring to @pks-t with the "wrong data structure" comment.)
There was a problem hiding this comment.
It feels like if we're going to go to any lengths to change our eviction strategy that we should actually have a strategy based on age or something and not the ordering of the oid.
That's exactly my thought here, too. An LRU cache might make a lot more sense.It would require us to keep track of the cached objects twice, though: once in the LRU queue for cache eviction and once in the map for quick lookup via OID.
So the question is how to proceed right now. I don't feel like introducing the random(0, git_oidmap_size(cache->map) - evict_count) thing, as it again ties us to the map implementation details. I know it's unlikely that it'll change anyway soon, but it just doesn't feel right. If you say that'd be required to get this merged, then fine, so be it and I'll just use @tiennou's proposal. But in that case, we should definitely revisit and fix it soon.
There was a problem hiding this comment.
No, I don't think that this is significantly worse (in terms of cache eviction) than what's being replaced. 0, random, it's pretty mediocre either way. I'll file an issue but let's merge this now.
c709b97 to
b9c6d04
Compare
|
Rebased on latest master to fix conflicts and address most of @tiennou's comments. Thanks for reviewing! |
b9c6d04 to
ceed892
Compare
|
Rebased to fix havoc caused by #4917 :) |
|
One place that we get a boatloat of warnings on Win32 is in the khash users. Right now we very often cast between I'm a bit surprised that there's no warnings on gcc/clang with the implicit cast, but alas. I noticed that we're still using |
|
On Sat, Jan 26, 2019 at 01:18:43PM -0800, Edward Thomson wrote:
One place that we get a boatloat of warnings on Win32 is in the khash users. Right now we very often cast between `size_t` and `khiter_t`. But `khiter_t` is actually defined as 32 bits.
I'm a bit surprised that there's no warnings on gcc/clang with the implicit cast, but alas. I noticed that we're still using `size_t` for indices here; it would be nice if we started using `khiter_t` consistently to avoid that.
I explicitly chose not to expose any implementation details of
the khash stuff so that we can completley avoid including
"khash.h" in our own header files. I'll try to come up with an
alternative solution, though.
|
|
True. This is definitely not worse than what we have now in the khiter_t/size_t differences, and more encapsulated. It’s not necessarily something we need to deal with before merging.
I’m crazy busy this week but I’ll try to find some time to take a look at this more closely.
… On Jan 28, 2019, at 08:44, Patrick Steinhardt ***@***.***> wrote:
On Sat, Jan 26, 2019 at 01:18:43PM -0800, Edward Thomson wrote:
> One place that we get a boatloat of warnings on Win32 is in the khash users. Right now we very often cast between `size_t` and `khiter_t`. But `khiter_t` is actually defined as 32 bits.
>
> I'm a bit surprised that there's no warnings on gcc/clang with the implicit cast, but alas. I noticed that we're still using `size_t` for indices here; it would be nice if we started using `khiter_t` consistently to avoid that.
I explicitly chose not to expose any implementation details of
the khash stuff so that we can completley avoid including
"khash.h" in our own header files. I'll try to come up with an
alternative solution, though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
On Mon, Jan 28, 2019 at 12:51:37AM -0800, Edward Thomson wrote:
I’m crazy busy this week but I’ll try to find some time to take a look at this more closely.
No worries. I'll also be busy due to the upcoming Git Merge, so
there's no need to hurry :)
|
Currently, the lifecycle functions for maps (allocation, deallocation, resize) are not named in a uniform way and do not have a uniform function signature. Rename the functions to fix that, and stick to libgit2's naming scheme of saying `git_foo_new`. This results in the following new interface for allocation: - `int git_<t>map_new(git_<t>map **out)` to allocate a new map, returning an error code if we ran out of memory - `void git_<t>map_free(git_<t>map *map)` to free a map - `void git_<t>map_clear(git<t>map *map)` to remove all entries from a map This commit also fixes all existing callers.
There currently exist two different function names for getting the entry count of maps, where offmaps offset and string maps use `num_entries` and OID maps use `size`. In most programming languages with built-in map types, this is simply called `size`, which is also shorter to type. Thus, this commit renames the other two functions `num_entries` to match the common way and adjusts all callers.
The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce a new high-level function `git_strmap_get` that takes a map and a key and returns a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.
Currently, one would use the function `git_strmap_insert` to insert key/value pairs into a map. This function has historically been a macro, which is why its syntax is kind of weird: instead of returning an error code directly, it instead has to be passed a pointer to where the return value shall be stored. This does not match libgit2's common idiom of directly returning error codes. Introduce a new function `git_strmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert all callers of `git_strmap_insert` to make use of it.
The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce a new high-level function `git_oidmap_get` that takes a map and a key and returns a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.
Currently, one would use either `git_oidmap_insert` to insert key/value pairs into a map or `git_oidmap_put` to insert a key only. These function have historically been macros, which is why their syntax is kind of weird: instead of returning an error code directly, they instead have to be passed a pointer to where the return value shall be stored. This does not match libgit2's common idiom of directly returning error codes.Furthermore, `git_oidmap_put` is tightly coupled with implementation details of the map as it exposes the index of inserted entries. Introduce a new function `git_oidmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert all trivial callers of `git_oidmap_insert` and `git_oidmap_put` to make use of it.
The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce a new high-level function `git_offmap_get` that takes a map and a key and returns a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.
Currently, there is only one caller that adds entries into an offset map, and this caller first uses `git_offmap_put` to add a key and then set the value at the returned index by using `git_offmap_set_value_at`. This is just too tighlty coupled with implementation details of the map as it exposes the index of inserted entries, which we really do not care about at all. Introduce a new function `git_offmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert the caller to make use of it instead.
The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce new high-level functions `git_idxmap_get` and `git_idxmap_icase_get` that take a map and a key and return a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.
Currently, one would use the function `git_idxmap_insert` to insert key/value pairs into a map. This function has historically been a macro, which is why its syntax is kind of weird: instead of returning an error code directly, it instead has to be passed a pointer to where the return value shall be stored. This does not match libgit2's common idiom of directly returning error codes. Introduce a new function `git_idxmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert all callers of `git_idxmap_insert` to make use of it.
The currently existing function `git_idxmap_resize` and `git_idxmap_icase_resize` do not return any error codes at all due to their previous implementation making use of a macro. Due to that, it is impossible to see whether the resize operation might have failed due to an out-of-memory situation. Fix this by providing a proper error code. Adjust callers to make use of it.
Currently, the delete functions of maps do not provide a return value. Like this, it is impossible to tell whether the entry has really been deleted or not. Change the implementation to provide either a return value of zero if the entry has been successfully deleted or `GIT_ENOTFOUND` if the key could not be found. Convert callers to the `delete_at` functions to instead use this higher-level interface.
Some callers were still using the tightly-coupled pattern of `lookup_index` and `valid_index` to verify that an entry exists in a map. Instead, use the more high-level `exists` functions to decouple map users from its implementation.
Currently, our headers need to leak some implementation details of maps due to their direct use of indices in the implementation of their foreach macros. This makes it impossible to completely hide the map structures away, and also makes it impossible to include the khash implementation header in the C files of the respective map only. This is now being fixed by providing a high-level iteration interface `map_iterate`, which takes as inputs the map that shall be iterated over, an iterator as well as the locations where keys and values shall be put into. For simplicity's sake, the iterator is a simple `size_t` that shall initialized to `0` on the first call. All existing foreach macros are then adjusted to make use of this new function.
To compute whether there are objects missing in a packfile, the indexer keeps around a map of OIDs that it still expects to see. This map does not store any values at all, but in fact the keys are owned by the map itself. Right now, we free these keys by iterating over the map and freeing the key itself, which is kind of awkward as keys are expected to be constant. We can make this a bit prettier by inserting the OID as value, too. As we already store the `NULL` pointer either way, this does not increase memory usage, but makes the code a tad more clear. Furthermore, we convert the previously existing map iteration via indices to make use of an iterator, instead.
To relieve us from memory pressure, we may regularly call `cache_evict_entries` to remove some entries from it. Unfortunately, our cache does not support a least-recently-used mode or something similar, which is why we evict entries completeley at random right now. Thing is, this is only possible due to the map interfaces exposing the entry indices, and we intend to completely remove those to decouple map users from map implementations. As soon as that is done, we are unable to do this random eviction anymore. Convert this to make use of an iterator for now. Obviously, there is no random eviction possible like that anymore, but we'll always start by evicting from the beginning of the map. Due to hashing, one may hope that the selected buckets will be evicted at least in some way unpredictably. But more likely than not, this will not be the case. But let's see what happens and if any users complain about degraded performance. If so, we might come up with a different scheme than random removal, e.g. by using an LRU cache.
Remove the low-level interface that was exposing implementation details of `git_strmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.
Remove the low-level interface that was exposing implementation details of `git_offmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.
Remove the low-level interface that was exposing implementation details of `git_oidmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.
Remove the low-level interface that was exposing implementation details of `git_idxmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.
ceed892 to
df42f36
Compare
|
Rebased to verify that no new map users of the old API have been introduced in the meantime. |
|
@ethomson: you got some time soonish to get this reviewed? Compiling with MSVC again reminded me of all the warnings that our current use of khash generates :/ |
|
Huge improvement. Thanks! 🎉 |
This is a complete rewrite of our map APIs. Previosuly, most places were using operations based on indices of the map entries, which is error prone, too tightly coupled with implementation details of the map and hard to read. This PR rewrites the API of all maps (str, oid, off, idx/idxicase) and introduces a set of high-level functions that act on those maps:
The endgame of this PR goes on to delete the old API based on indices completely.
Note: please do not merge yet. While it builds and passes all tests, I still need to go through the commits and fix up the commit messages.