High-level map APIs by pks-t · Pull Request #4901 · libgit2/libgit2

pks-t · 2018-12-01T10:02:31Z

This is a complete rewrite of our map APIs. Previosuly, most places were using operations based on indices of the map entries, which is error prone, too tightly coupled with implementation details of the map and hard to read. This PR rewrites the API of all maps (str, oid, off, idx/idxicase) and introduces a set of high-level functions that act on those maps:

new/free/realloc/clear to handle memory and lifecycle
get/set to retrieve entries by key or to insert/update the value associated with a key
delete to remove a key/value pair by key
exists to check whether a key exists
iterate to iterate over all key/value pairs

The endgame of this PR goes on to delete the old API based on indices completely.

Note: please do not merge yet. While it builds and passes all tests, I still need to go through the commits and fix up the commit messages.

ethomson · 2018-12-01T10:04:45Z

Oooooooh, I'm excited to review this. 😀

pks-t · 2018-12-17T11:11:24Z

The commits have been rewritten to have proper commit messages now. CI passes now, too, so this is ready for review

pks-t · 2018-12-17T11:12:58Z

By the way, short hint: please have a special look at the commit "cache: use iteration interface for cache eviction". It changes the way we behave with regards to cache eviction, where we would previously evict starting at a random offset and now we evict starting at the first entry.

tiennou

Only minor comments from me, as it's a welcome cleanup overall anyways 😉.

tiennou · 2019-01-09T14:33:13Z

+int git_idxmap_new(git_idxmap **out)
 {
-	if ((*map = kh_init(idx)) == NULL) {
+	if ((*out = kh_init(idx)) == NULL) {


Maybe

*out = kh_init(idx); GITERR_CHECK_ALLOC(*out); return 0;

as it removes the need for a if/oom dance ?

I'd slap on an assert(out) on top as well…

tiennou · 2019-01-09T14:38:43Z

+ * @parameter map map containing the elements
+ * @return number of elements in the map
+ */
+size_t git_offmap_size(git_offmap *map);


Nitpicky: my preference goes to count (or length) over size, as I usually tend to think of size as a number-of-bytes vs. a number-of-elements.

Yeah, we have precedent for both, so I was kind of on the edge. Arrays have git_array_size, vectors have git_vector_length, pqueues have git_pqueue_size. @ethomson, care to decide this one?

And buffer has both size and asize but we ask for its len. 😢

I feel like we most commonly use size. I don't love that - it always feels to me like the allocated size, not the filled / used size of a struct - but I think that consistency is good here.

Yeah, had the same feeling. Thanks for chiming in!

tiennou · 2019-01-09T14:39:50Z

 	return kh_size(map);
 }

+void *git_strmap_get(git_strmap *map, const char *key)


tiennou · 2019-01-09T14:50:07Z

 	kh_clear(idxicase, map);
 }

+int git_idxmap_resize(git_idxmap *map, size_t size)


size => count nitpick: _grow ?

(the commit message is missing a "due" on the 2nd line).

In this example I disagree. Interestingly enough, kh_resize lets you resize maps to a smaller size. Nobody should be doing that, but this is why I don't want to name it grow ;)

tiennou · 2019-01-09T14:59:11Z

+ *        GIT_ITEROVER if no entries are left. A negative error
+ *        code otherwise.
+ */
+int git_offmap_iterate(git_offmap *map, size_t *iter, git_off_t *key, void **value);


This does not "conform" to the style of having "out" parameters at the beginning.

(git_off_t *key, void **value, size_t *iter, git_offmap *map) ?

tiennou · 2019-01-09T14:59:36Z

+ * Iterate over entries of the map.
+ *
+ * This functions allows to iterate over all key-value entries of
+ * the map. The current item is stored in the `iter` variable and


s/item/position/ ?

tiennou · 2019-01-09T15:02:12Z

-	for (__i = git_oidmap_begin(h); __i != git_oidmap_end(h); ++__i) {	\
-		if (!git_oidmap_has_data(h,__i)) continue;			\
-		(vvar) = git_oidmap_value_at(h,__i);				\
+#define git_oidmap_foreach_value(h, vvar, code) { size_t __i = 0;		\


That's unrelated, but I dislike code being an argument… (à la git_array_foreach).

I totally agree. We could just transform the while-loop into a for loop, initializing __i in the loop initializer. But let's do this after this PR got merged, it already has a lot of code churn.

tiennou · 2019-01-09T15:03:09Z

 }

-void test_core_strmap__1(void)
+void test_core_strmap__inserted_strings_can_be_retrieved(void)


tiennou · 2019-01-09T15:05:20Z

 {
-	uint32_t seed = rand();
-	size_t evict_count = 8;
+	size_t evict_count = 8, i;


commitmsg: s/completley/completely/.

tiennou · 2019-01-09T15:16:26Z

 		return;
 	}

+	i = 0;


I might be completely wrong, but what about initializing this to random(0, git_oidmap_size(cache->map) - evict_count) ?

Yeah, we sure could do this, but only because we do know the map's implementation details. We'd also need to wrap-around when reaching git_oidmap_size. But it sure keeps the original intention a lot better than my refactored code.

It just feels like the wrong data structure for the task at hand, at least in this specific cache eviction scenario. I'm not familiar with the code though, so I cannot say anything more to it. @ethomson, you've got any comment?

Hrmf. No? If I'm reading this correctly, we walk through the cache evicting items in-order, the difference is the starting point. It feels like if we're going to go to any lengths to change our eviction strategy that we should actually have a strategy based on age or something and not the ordering of the oid. Starting at a random place in the list seems like it might be slightly better than always evicting starting at the beginning of the map (I can imagine the latter enabling some pathological cases) but they both sort of suck anyway.

(This might be what you were referring to @pks-t with the "wrong data structure" comment.)

It feels like if we're going to go to any lengths to change our eviction strategy that we should actually have a strategy based on age or something and not the ordering of the oid.

That's exactly my thought here, too. An LRU cache might make a lot more sense.It would require us to keep track of the cached objects twice, though: once in the LRU queue for cache eviction and once in the map for quick lookup via OID.

So the question is how to proceed right now. I don't feel like introducing the random(0, git_oidmap_size(cache->map) - evict_count) thing, as it again ties us to the map implementation details. I know it's unlikely that it'll change anyway soon, but it just doesn't feel right. If you say that'd be required to get this merged, then fine, so be it and I'll just use @tiennou's proposal. But in that case, we should definitely revisit and fix it soon.

No, I don't think that this is significantly worse (in terms of cache eviction) than what's being replaced. 0, random, it's pretty mediocre either way. I'll file an issue but let's merge this now.

pks-t · 2019-01-21T12:51:40Z

Rebased on latest master to fix conflicts and address most of @tiennou's comments. Thanks for reviewing!

pks-t · 2019-01-23T09:54:22Z

Rebased to fix havoc caused by #4917 :)

ethomson · 2019-01-26T21:18:39Z

One place that we get a boatloat of warnings on Win32 is in the khash users. Right now we very often cast between size_t and khiter_t. But khiter_t is actually defined as 32 bits.

I'm a bit surprised that there's no warnings on gcc/clang with the implicit cast, but alas. I noticed that we're still using size_t for indices here; it would be nice if we started using khiter_t consistently to avoid that.

pks-t · 2019-01-28T08:44:05Z

On Sat, Jan 26, 2019 at 01:18:43PM -0800, Edward Thomson wrote: One place that we get a boatloat of warnings on Win32 is in the khash users. Right now we very often cast between `size_t` and `khiter_t`. But `khiter_t` is actually defined as 32 bits. I'm a bit surprised that there's no warnings on gcc/clang with the implicit cast, but alas. I noticed that we're still using `size_t` for indices here; it would be nice if we started using `khiter_t` consistently to avoid that.

I explicitly chose not to expose any implementation details of the khash stuff so that we can completley avoid including "khash.h" in our own header files. I'll try to come up with an alternative solution, though.

ethomson · 2019-01-28T08:51:32Z

True. This is definitely not worse than what we have now in the khiter_t/size_t differences, and more encapsulated. It’s not necessarily something we need to deal with before merging. I’m crazy busy this week but I’ll try to find some time to take a look at this more closely.

…

On Jan 28, 2019, at 08:44, Patrick Steinhardt ***@***.***> wrote: On Sat, Jan 26, 2019 at 01:18:43PM -0800, Edward Thomson wrote: > One place that we get a boatloat of warnings on Win32 is in the khash users. Right now we very often cast between `size_t` and `khiter_t`. But `khiter_t` is actually defined as 32 bits. > > I'm a bit surprised that there's no warnings on gcc/clang with the implicit cast, but alas. I noticed that we're still using `size_t` for indices here; it would be nice if we started using `khiter_t` consistently to avoid that. I explicitly chose not to expose any implementation details of the khash stuff so that we can completley avoid including "khash.h" in our own header files. I'll try to come up with an alternative solution, though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

pks-t · 2019-01-28T08:54:37Z

On Mon, Jan 28, 2019 at 12:51:37AM -0800, Edward Thomson wrote: I’m crazy busy this week but I’ll try to find some time to take a look at this more closely.

No worries. I'll also be busy due to the upcoming Git Merge, so there's no need to hurry :)

Currently, the lifecycle functions for maps (allocation, deallocation, resize) are not named in a uniform way and do not have a uniform function signature. Rename the functions to fix that, and stick to libgit2's naming scheme of saying `git_foo_new`. This results in the following new interface for allocation: - `int git_<t>map_new(git_<t>map **out)` to allocate a new map, returning an error code if we ran out of memory - `void git_<t>map_free(git_<t>map *map)` to free a map - `void git_<t>map_clear(git<t>map *map)` to remove all entries from a map This commit also fixes all existing callers.

There currently exist two different function names for getting the entry count of maps, where offmaps offset and string maps use `num_entries` and OID maps use `size`. In most programming languages with built-in map types, this is simply called `size`, which is also shorter to type. Thus, this commit renames the other two functions `num_entries` to match the common way and adjusts all callers.

The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce a new high-level function `git_strmap_get` that takes a map and a key and returns a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.

Currently, one would use the function `git_strmap_insert` to insert key/value pairs into a map. This function has historically been a macro, which is why its syntax is kind of weird: instead of returning an error code directly, it instead has to be passed a pointer to where the return value shall be stored. This does not match libgit2's common idiom of directly returning error codes. Introduce a new function `git_strmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert all callers of `git_strmap_insert` to make use of it.

The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce a new high-level function `git_oidmap_get` that takes a map and a key and returns a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.

Currently, one would use either `git_oidmap_insert` to insert key/value pairs into a map or `git_oidmap_put` to insert a key only. These function have historically been macros, which is why their syntax is kind of weird: instead of returning an error code directly, they instead have to be passed a pointer to where the return value shall be stored. This does not match libgit2's common idiom of directly returning error codes.Furthermore, `git_oidmap_put` is tightly coupled with implementation details of the map as it exposes the index of inserted entries. Introduce a new function `git_oidmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert all trivial callers of `git_oidmap_insert` and `git_oidmap_put` to make use of it.

The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce a new high-level function `git_offmap_get` that takes a map and a key and returns a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.

Currently, there is only one caller that adds entries into an offset map, and this caller first uses `git_offmap_put` to add a key and then set the value at the returned index by using `git_offmap_set_value_at`. This is just too tighlty coupled with implementation details of the map as it exposes the index of inserted entries, which we really do not care about at all. Introduce a new function `git_offmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert the caller to make use of it instead.

The current way of looking up an entry from a map is tightly coupled with the map implementation, as one first has to look up the index of the key and then retrieve the associated value by using the index. As a caller, you usually do not care about any indices at all, though, so this is more complicated than really necessary. Furthermore, it invites for errors to happen if the correct error checking sequence is not being followed. Introduce new high-level functions `git_idxmap_get` and `git_idxmap_icase_get` that take a map and a key and return a pointer to the associated value if such a key exists. Otherwise, a `NULL` pointer is returned. Adjust all callers that can trivially be converted.

Currently, one would use the function `git_idxmap_insert` to insert key/value pairs into a map. This function has historically been a macro, which is why its syntax is kind of weird: instead of returning an error code directly, it instead has to be passed a pointer to where the return value shall be stored. This does not match libgit2's common idiom of directly returning error codes. Introduce a new function `git_idxmap_set`, which takes as parameters the map, key and value and directly returns an error code. Convert all callers of `git_idxmap_insert` to make use of it.

The currently existing function `git_idxmap_resize` and `git_idxmap_icase_resize` do not return any error codes at all due to their previous implementation making use of a macro. Due to that, it is impossible to see whether the resize operation might have failed due to an out-of-memory situation. Fix this by providing a proper error code. Adjust callers to make use of it.

Currently, the delete functions of maps do not provide a return value. Like this, it is impossible to tell whether the entry has really been deleted or not. Change the implementation to provide either a return value of zero if the entry has been successfully deleted or `GIT_ENOTFOUND` if the key could not be found. Convert callers to the `delete_at` functions to instead use this higher-level interface.

Some callers were still using the tightly-coupled pattern of `lookup_index` and `valid_index` to verify that an entry exists in a map. Instead, use the more high-level `exists` functions to decouple map users from its implementation.

Currently, our headers need to leak some implementation details of maps due to their direct use of indices in the implementation of their foreach macros. This makes it impossible to completely hide the map structures away, and also makes it impossible to include the khash implementation header in the C files of the respective map only. This is now being fixed by providing a high-level iteration interface `map_iterate`, which takes as inputs the map that shall be iterated over, an iterator as well as the locations where keys and values shall be put into. For simplicity's sake, the iterator is a simple `size_t` that shall initialized to `0` on the first call. All existing foreach macros are then adjusted to make use of this new function.

To compute whether there are objects missing in a packfile, the indexer keeps around a map of OIDs that it still expects to see. This map does not store any values at all, but in fact the keys are owned by the map itself. Right now, we free these keys by iterating over the map and freeing the key itself, which is kind of awkward as keys are expected to be constant. We can make this a bit prettier by inserting the OID as value, too. As we already store the `NULL` pointer either way, this does not increase memory usage, but makes the code a tad more clear. Furthermore, we convert the previously existing map iteration via indices to make use of an iterator, instead.

To relieve us from memory pressure, we may regularly call `cache_evict_entries` to remove some entries from it. Unfortunately, our cache does not support a least-recently-used mode or something similar, which is why we evict entries completeley at random right now. Thing is, this is only possible due to the map interfaces exposing the entry indices, and we intend to completely remove those to decouple map users from map implementations. As soon as that is done, we are unable to do this random eviction anymore. Convert this to make use of an iterator for now. Obviously, there is no random eviction possible like that anymore, but we'll always start by evicting from the beginning of the map. Due to hashing, one may hope that the selected buckets will be evicted at least in some way unpredictably. But more likely than not, this will not be the case. But let's see what happens and if any users complain about degraded performance. If so, we might come up with a different scheme than random removal, e.g. by using an LRU cache.

Remove the low-level interface that was exposing implementation details of `git_strmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

Remove the low-level interface that was exposing implementation details of `git_offmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

Remove the low-level interface that was exposing implementation details of `git_oidmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

Remove the low-level interface that was exposing implementation details of `git_idxmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

pks-t · 2019-02-15T12:18:09Z

Rebased to verify that no new map users of the old API have been introduced in the meantime.

pks-t · 2019-02-21T12:40:20Z

@ethomson: you got some time soonish to get this reviewed? Compiling with MSVC again reminded me of all the warnings that our current use of khash generates :/

ethomson · 2019-02-22T10:56:22Z

Huge improvement. Thanks! 🎉

pks-t force-pushed the pks/uniform-map-api branch from 34fe828 to 42d2667 Compare December 17, 2018 08:59

pks-t changed the title ~~[WIP] High-level map APIs~~ High-level map APIs Dec 17, 2018

pks-t force-pushed the pks/uniform-map-api branch 5 times, most recently from de0a156 to c709b97 Compare December 17, 2018 10:49

tiennou approved these changes Jan 9, 2019

View reviewed changes

pks-t force-pushed the pks/uniform-map-api branch from c709b97 to b9c6d04 Compare January 21, 2019 12:50

pks-t force-pushed the pks/uniform-map-api branch from b9c6d04 to ceed892 Compare January 23, 2019 09:52

pks-t added 11 commits February 15, 2019 13:16

pks-t added 9 commits February 15, 2019 13:16

strmap: remove legacy low-level interface

fdfabdc

Remove the low-level interface that was exposing implementation details of `git_strmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

offmap: remove legacy low-level interface

4713e7c

Remove the low-level interface that was exposing implementation details of `git_offmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

oidmap: remove legacy low-level interface

bd66925

Remove the low-level interface that was exposing implementation details of `git_oidmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

idxmap: remove legacy low-level interface

df42f36

Remove the low-level interface that was exposing implementation details of `git_idxmap` to callers. From now on, only the high-level functions shall be used to retrieve or modify values of a map. Adjust remaining existing callers.

pks-t force-pushed the pks/uniform-map-api branch from ceed892 to df42f36 Compare February 15, 2019 12:18

pks-t mentioned this pull request Feb 15, 2019

[RFC] Khash refactorings (pt. 2) #4417

Closed

ethomson merged commit 4069f92 into libgit2:master Feb 22, 2019

pks-t deleted the pks/uniform-map-api branch February 22, 2019 11:03

snyk-bot mentioned this pull request Feb 23, 2020

[Snyk] Upgrade nodegit from 0.4.1 to 0.26.4 saurabharch/Breezeblocks#1

Open

snyk-bot mentioned this pull request Apr 22, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 aminatakonate000/Graviton-App#4

Open

snyk-bot mentioned this pull request May 5, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 Barnstorm-Online/ngp-openapi-generator#1

Open

               		return;
               	}
+              	i = 0;

Conversation

pks-t commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethomson commented Dec 1, 2018

Uh oh!

pks-t commented Dec 17, 2018

Uh oh!

pks-t commented Dec 17, 2018

Uh oh!

tiennou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pks-t commented Jan 21, 2019

Uh oh!

pks-t commented Jan 23, 2019

Uh oh!

ethomson commented Jan 26, 2019

Uh oh!

pks-t commented Jan 28, 2019 via email

Uh oh!

ethomson commented Jan 28, 2019 via email

Uh oh!

pks-t commented Jan 28, 2019 via email

Uh oh!

pks-t commented Feb 15, 2019

Uh oh!

pks-t commented Feb 21, 2019

Uh oh!

ethomson commented Feb 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

pks-t commented Dec 1, 2018 •

edited

Loading