Skip to content

High-level map APIs#4901

Merged
ethomson merged 20 commits intolibgit2:masterfrom
pks-t:pks/uniform-map-api
Feb 22, 2019
Merged

High-level map APIs#4901
ethomson merged 20 commits intolibgit2:masterfrom
pks-t:pks/uniform-map-api

Conversation

@pks-t
Copy link
Copy Markdown
Member

@pks-t pks-t commented Dec 1, 2018

This is a complete rewrite of our map APIs. Previosuly, most places were using operations based on indices of the map entries, which is error prone, too tightly coupled with implementation details of the map and hard to read. This PR rewrites the API of all maps (str, oid, off, idx/idxicase) and introduces a set of high-level functions that act on those maps:

  • new/free/realloc/clear to handle memory and lifecycle
  • get/set to retrieve entries by key or to insert/update the value associated with a key
  • delete to remove a key/value pair by key
  • exists to check whether a key exists
  • iterate to iterate over all key/value pairs

The endgame of this PR goes on to delete the old API based on indices completely.

Note: please do not merge yet. While it builds and passes all tests, I still need to go through the commits and fix up the commit messages.

@ethomson
Copy link
Copy Markdown
Member

ethomson commented Dec 1, 2018

Oooooooh, I'm excited to review this. 😀

@pks-t pks-t force-pushed the pks/uniform-map-api branch from 34fe828 to 42d2667 Compare December 17, 2018 08:59
@pks-t pks-t changed the title [WIP] High-level map APIs High-level map APIs Dec 17, 2018
@pks-t pks-t force-pushed the pks/uniform-map-api branch 5 times, most recently from de0a156 to c709b97 Compare December 17, 2018 10:49
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Dec 17, 2018

The commits have been rewritten to have proper commit messages now. CI passes now, too, so this is ready for review

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Dec 17, 2018

By the way, short hint: please have a special look at the commit "cache: use iteration interface for cache eviction". It changes the way we behave with regards to cache eviction, where we would previously evict starting at a random offset and now we evict starting at the first entry.

Copy link
Copy Markdown
Contributor

@tiennou tiennou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments from me, as it's a welcome cleanup overall anyways 😉.

Comment thread src/idxmap.c Outdated
int git_idxmap_new(git_idxmap **out)
{
if ((*map = kh_init(idx)) == NULL) {
if ((*out = kh_init(idx)) == NULL) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

*out = kh_init(idx);
GITERR_CHECK_ALLOC(*out);

return 0;

as it removes the need for a if/oom dance ?

I'd slap on an assert(out) on top as well…

Comment thread src/offmap.h
* @parameter map map containing the elements
* @return number of elements in the map
*/
size_t git_offmap_size(git_offmap *map);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky: my preference goes to count (or length) over size, as I usually tend to think of size as a number-of-bytes vs. a number-of-elements.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have precedent for both, so I was kind of on the edge. Arrays have git_array_size, vectors have git_vector_length, pqueues have git_pqueue_size. @ethomson, care to decide this one?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And buffer has both size and asize but we ask for its len. 😢

I feel like we most commonly use size. I don't love that - it always feels to me like the allocated size, not the filled / used size of a struct - but I think that consistency is good here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, had the same feeling. Thanks for chiming in!

Comment thread src/strmap.c
return kh_size(map);
}

void *git_strmap_get(git_strmap *map, const char *key)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment thread src/idxmap.c
kh_clear(idxicase, map);
}

int git_idxmap_resize(git_idxmap *map, size_t size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size => count nitpick: _grow ?

(the commit message is missing a "due" on the 2nd line).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example I disagree. Interestingly enough, kh_resize lets you resize maps to a smaller size. Nobody should be doing that, but this is why I don't want to name it grow ;)

Comment thread src/offmap.h Outdated
* GIT_ITEROVER if no entries are left. A negative error
* code otherwise.
*/
int git_offmap_iterate(git_offmap *map, size_t *iter, git_off_t *key, void **value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not "conform" to the style of having "out" parameters at the beginning.

(git_off_t *key, void **value, size_t *iter, git_offmap *map) ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Comment thread src/offmap.h Outdated
* Iterate over entries of the map.
*
* This functions allows to iterate over all key-value entries of
* the map. The current item is stored in the `iter` variable and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/item/position/ ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Comment thread src/oidmap.h
for (__i = git_oidmap_begin(h); __i != git_oidmap_end(h); ++__i) { \
if (!git_oidmap_has_data(h,__i)) continue; \
(vvar) = git_oidmap_value_at(h,__i); \
#define git_oidmap_foreach_value(h, vvar, code) { size_t __i = 0; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unrelated, but I dislike code being an argument… (à la git_array_foreach).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree. We could just transform the while-loop into a for loop, initializing __i in the loop initializer. But let's do this after this PR got merged, it already has a lot of code churn.

Comment thread tests/core/strmap.c
}

void test_core_strmap__1(void)
void test_core_strmap__inserted_strings_can_be_retrieved(void)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Comment thread src/cache.c
{
uint32_t seed = rand();
size_t evict_count = 8;
size_t evict_count = 8, i;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commitmsg: s/completley/completely/.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread src/cache.c
return;
}

i = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be completely wrong, but what about initializing this to random(0, git_oidmap_size(cache->map) - evict_count) ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we sure could do this, but only because we do know the map's implementation details. We'd also need to wrap-around when reaching git_oidmap_size. But it sure keeps the original intention a lot better than my refactored code.

It just feels like the wrong data structure for the task at hand, at least in this specific cache eviction scenario. I'm not familiar with the code though, so I cannot say anything more to it. @ethomson, you've got any comment?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmf. No? If I'm reading this correctly, we walk through the cache evicting items in-order, the difference is the starting point. It feels like if we're going to go to any lengths to change our eviction strategy that we should actually have a strategy based on age or something and not the ordering of the oid. Starting at a random place in the list seems like it might be slightly better than always evicting starting at the beginning of the map (I can imagine the latter enabling some pathological cases) but they both sort of suck anyway.

(This might be what you were referring to @pks-t with the "wrong data structure" comment.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like if we're going to go to any lengths to change our eviction strategy that we should actually have a strategy based on age or something and not the ordering of the oid.

That's exactly my thought here, too. An LRU cache might make a lot more sense.It would require us to keep track of the cached objects twice, though: once in the LRU queue for cache eviction and once in the map for quick lookup via OID.

So the question is how to proceed right now. I don't feel like introducing the random(0, git_oidmap_size(cache->map) - evict_count) thing, as it again ties us to the map implementation details. I know it's unlikely that it'll change anyway soon, but it just doesn't feel right. If you say that'd be required to get this merged, then fine, so be it and I'll just use @tiennou's proposal. But in that case, we should definitely revisit and fix it soon.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that this is significantly worse (in terms of cache eviction) than what's being replaced. 0, random, it's pretty mediocre either way. I'll file an issue but let's merge this now.

@pks-t pks-t force-pushed the pks/uniform-map-api branch from c709b97 to b9c6d04 Compare January 21, 2019 12:50
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jan 21, 2019

Rebased on latest master to fix conflicts and address most of @tiennou's comments. Thanks for reviewing!

@pks-t pks-t force-pushed the pks/uniform-map-api branch from b9c6d04 to ceed892 Compare January 23, 2019 09:52
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jan 23, 2019

Rebased to fix havoc caused by #4917 :)

@ethomson
Copy link
Copy Markdown
Member

One place that we get a boatloat of warnings on Win32 is in the khash users. Right now we very often cast between size_t and khiter_t. But khiter_t is actually defined as 32 bits.

I'm a bit surprised that there's no warnings on gcc/clang with the implicit cast, but alas. I noticed that we're still using size_t for indices here; it would be nice if we started using khiter_t consistently to avoid that.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jan 28, 2019 via email

@ethomson
Copy link
Copy Markdown
Member

ethomson commented Jan 28, 2019 via email

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jan 28, 2019 via email

pks-t added 11 commits February 15, 2019 13:16
Currently, the lifecycle functions for maps (allocation, deallocation, resize)
are not named in a uniform way and do not have a uniform function signature.
Rename the functions to fix that, and stick to libgit2's naming scheme of saying
`git_foo_new`. This results in the following new interface for allocation:

- `int git_<t>map_new(git_<t>map **out)` to allocate a new map, returning an
  error code if we ran out of memory

- `void git_<t>map_free(git_<t>map *map)` to free a map

- `void git_<t>map_clear(git<t>map *map)` to remove all entries from a map

This commit also fixes all existing callers.
There currently exist two different function names for getting the entry count
of maps, where offmaps offset and string maps use `num_entries` and OID maps use
`size`. In most programming languages with built-in map types, this is simply
called `size`, which is also shorter to type. Thus, this commit renames the
other two functions `num_entries` to match the common way and adjusts all
callers.
The current way of looking up an entry from a map is tightly coupled with the
map implementation, as one first has to look up the index of the key and then
retrieve the associated value by using the index. As a caller, you usually do
not care about any indices at all, though, so this is more complicated than
really necessary. Furthermore, it invites for errors to happen if the correct
error checking sequence is not being followed.

Introduce a new high-level function `git_strmap_get` that takes a map and a key
and returns a pointer to the associated value if such a key exists. Otherwise,
a `NULL` pointer is returned. Adjust all callers that can trivially be
converted.
Currently, one would use the function `git_strmap_insert` to insert key/value
pairs into a map. This function has historically been a macro, which is why its
syntax is kind of weird: instead of returning an error code directly, it instead
has to be passed a pointer to where the return value shall be stored. This does
not match libgit2's common idiom of directly returning error codes.

Introduce a new function `git_strmap_set`, which takes as parameters the map,
key and value and directly returns an error code. Convert all callers of
`git_strmap_insert` to make use of it.
The current way of looking up an entry from a map is tightly coupled with the
map implementation, as one first has to look up the index of the key and then
retrieve the associated value by using the index. As a caller, you usually do
not care about any indices at all, though, so this is more complicated than
really necessary. Furthermore, it invites for errors to happen if the correct
error checking sequence is not being followed.

Introduce a new high-level function `git_oidmap_get` that takes a map and a key
and returns a pointer to the associated value if such a key exists. Otherwise,
a `NULL` pointer is returned. Adjust all callers that can trivially be
converted.
Currently, one would use either `git_oidmap_insert` to insert key/value pairs
into a map or `git_oidmap_put` to insert a key only. These function have
historically been macros, which is why their syntax is kind of weird: instead of
returning an error code directly, they instead have to be passed a pointer to
where the return value shall be stored. This does not match libgit2's common
idiom of directly returning error codes.Furthermore, `git_oidmap_put` is tightly
coupled with implementation details of the map as it exposes the index of
inserted entries.

Introduce a new function `git_oidmap_set`, which takes as parameters the map,
key and value and directly returns an error code. Convert all trivial callers of
`git_oidmap_insert` and `git_oidmap_put` to make use of it.
The current way of looking up an entry from a map is tightly coupled with the
map implementation, as one first has to look up the index of the key and then
retrieve the associated value by using the index. As a caller, you usually do
not care about any indices at all, though, so this is more complicated than
really necessary. Furthermore, it invites for errors to happen if the correct
error checking sequence is not being followed.

Introduce a new high-level function `git_offmap_get` that takes a map and a key
and returns a pointer to the associated value if such a key exists. Otherwise,
a `NULL` pointer is returned. Adjust all callers that can trivially be
converted.
Currently, there is only one caller that adds entries into an offset map, and
this caller first uses `git_offmap_put` to add a key and then set the value at
the returned index by using `git_offmap_set_value_at`. This is just too tighlty
coupled with implementation details of the map as it exposes the index of
inserted entries, which we really do not care about at all.

Introduce a new function `git_offmap_set`, which takes as parameters the map,
key and value and directly returns an error code. Convert the caller to make use
of it instead.
The current way of looking up an entry from a map is tightly coupled with the
map implementation, as one first has to look up the index of the key and then
retrieve the associated value by using the index. As a caller, you usually do
not care about any indices at all, though, so this is more complicated than
really necessary. Furthermore, it invites for errors to happen if the correct
error checking sequence is not being followed.

Introduce new high-level functions `git_idxmap_get` and `git_idxmap_icase_get`
that take a map and a key and return a pointer to the associated value if such a
key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can
trivially be converted.
Currently, one would use the function `git_idxmap_insert` to insert key/value
pairs into a map. This function has historically been a macro, which is why its
syntax is kind of weird: instead of returning an error code directly, it instead
has to be passed a pointer to where the return value shall be stored. This does
not match libgit2's common idiom of directly returning error codes.

Introduce a new function `git_idxmap_set`, which takes as parameters the map,
key and value and directly returns an error code. Convert all callers of
`git_idxmap_insert` to make use of it.
The currently existing function `git_idxmap_resize` and
`git_idxmap_icase_resize` do not return any error codes at all due to their
previous implementation making use of a macro. Due to that, it is impossible to
see whether the resize operation might have failed due to an out-of-memory
situation.

Fix this by providing a proper error code. Adjust callers to make use of it.
Currently, the delete functions of maps do not provide a return value. Like
this, it is impossible to tell whether the entry has really been deleted or not.
Change the implementation to provide either a return value of zero if the entry
has been successfully deleted or `GIT_ENOTFOUND` if the key could not be found.

Convert callers to the `delete_at` functions to instead use this higher-level
interface.
Some callers were still using the tightly-coupled pattern of `lookup_index` and
`valid_index` to verify that an entry exists in a map. Instead, use the more
high-level `exists` functions to decouple map users from its implementation.
Currently, our headers need to leak some implementation details of maps due to
their direct use of indices in the implementation of their foreach macros. This
makes it impossible to completely hide the map structures away, and also makes
it impossible to include the khash implementation header in the C files of the
respective map only.

This is now being fixed by providing a high-level iteration interface
`map_iterate`, which takes as inputs the map that shall be iterated over, an
iterator as well as the locations where keys and values shall be put into. For
simplicity's sake, the iterator is a simple `size_t` that shall initialized to
`0` on the first call. All existing foreach macros are then adjusted to make use
of this new function.
To compute whether there are objects missing in a packfile, the indexer keeps
around a map of OIDs that it still expects to see. This map does not store any
values at all, but in fact the keys are owned by the map itself. Right now, we
free these keys by iterating over the map and freeing the key itself, which is
kind of awkward as keys are expected to be constant.

We can make this a bit prettier by inserting the OID as value, too. As we
already store the `NULL` pointer either way, this does not increase memory
usage, but makes the code a tad more clear. Furthermore, we convert the
previously existing map iteration via indices to make use of an iterator,
instead.
To relieve us from memory pressure, we may regularly call `cache_evict_entries`
to remove some entries from it. Unfortunately, our cache does not support a
least-recently-used mode or something similar, which is why we evict entries
completeley at random right now. Thing is, this is only possible due to the map
interfaces exposing the entry indices, and we intend to completely remove those
to decouple map users from map implementations. As soon as that is done, we are
unable to do this random eviction anymore.

Convert this to make use of an iterator for now. Obviously, there is no random
eviction possible like that anymore, but we'll always start by evicting from the
beginning of the map. Due to hashing, one may hope that the selected buckets
will be evicted at least in some way unpredictably. But more likely than not,
this will not be the case. But let's see what happens and if any users complain
about degraded performance. If so, we might come up with a different scheme than
random removal, e.g. by using an LRU cache.
Remove the low-level interface that was exposing implementation details of
`git_strmap` to callers. From now on, only the high-level functions shall be
used to retrieve or modify values of a map. Adjust remaining existing callers.
Remove the low-level interface that was exposing implementation details of
`git_offmap` to callers. From now on, only the high-level functions shall be
used to retrieve or modify values of a map. Adjust remaining existing callers.
Remove the low-level interface that was exposing implementation details of
`git_oidmap` to callers. From now on, only the high-level functions shall be
used to retrieve or modify values of a map. Adjust remaining existing callers.
Remove the low-level interface that was exposing implementation details of
`git_idxmap` to callers. From now on, only the high-level functions shall be
used to retrieve or modify values of a map. Adjust remaining existing callers.
@pks-t pks-t force-pushed the pks/uniform-map-api branch from ceed892 to df42f36 Compare February 15, 2019 12:18
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Feb 15, 2019

Rebased to verify that no new map users of the old API have been introduced in the meantime.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Feb 21, 2019

@ethomson: you got some time soonish to get this reviewed? Compiling with MSVC again reminded me of all the warnings that our current use of khash generates :/

@ethomson ethomson merged commit 4069f92 into libgit2:master Feb 22, 2019
@ethomson
Copy link
Copy Markdown
Member

Huge improvement. Thanks! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants