gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841
gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841vstinner wants to merge 3 commits into
Conversation
|
It's good to avoid the frozenset hash code. It's not a good hash function. You can check this by constructing subsets of This showed up when trying to construct "bad cases" for the xxHash-based tuple hashing. Raymond was made aware of it, but never got around to "doing something" about it. No idea how the Boost-inspired scheme would work. Its scrambler does do some high-to-low propagation (via right shifts), but xxHash's rotate is best-of-all (and we took care to ensure that all major compilers did emit a "rotate" instruction instead of the longer-winded portable C spelling we use). Long story short: properly validating a compound hash function in the context of how it plays with CPython's hash results for primitive types (which, apart from string hashes, make no attempt at creating "random-looking" results) can require weeks of work. I can't make time for that, and have less than no interest in doing it again anyway ;-) I do have confidence in the tuple hashing approach - which was hard won. |
|
what about changing the starting hash to avoid the collision with |
|
I don't expect it matters. The context is unlikely, and it wouldn't make much difference if it crops up. Collisions are actually pretty cheap on their own! Comparing objects of very different types for equality typically returns What does matter is "pileup": the number of distinct objects that all have the same hashcode. That leads to long collision chains, which kill hash-based performance. In the absence of that, no number of "just pairs" that collide can slow things down much. OTOH, I have no objection either to starting with different seeds. |
|
This change basically implements
I reused If someone proposes a better hash function for |
|
cc @corona10 |
|
Ya, but the code is getting ever more cryptic and mysterious as bit-fiddling tricks got copied from one module to another. The cardinal sin of Comments in the original correctly point out that it's aimed at propagating low-order bit changes to higher-order bits, but is blind to that ;propagation in the other direction is also important. In the forzendict context, that doesn't matter, because xxHash already does a good job of propagating changes in both directions. Indeed, calling |
MojoVampire
left a comment
There was a problem hiding this comment.
The include of pycore_tuple should have its comments updated to specify the additional constants you're borrowing from it (since it's not just _PyTuple_Recycle anymore).
Inline comments on performance and spec compliance.
| static Py_hash_t | ||
| frozendict_pair_hash(PyObject *key, PyObject *value) | ||
| { | ||
| Py_ssize_t len = 2; |
There was a problem hiding this comment.
Perhaps make this a static const since it's not intended to mutated, it's just a constant? I expect most optimizing compilers to realize they can just inline len regardless (being a local that they can see is never changed), but why not make sure of it, so acc += len ^ (_PyTuple_HASH_XXPRIME_5 ^ 3527539UL); is definitely just an addition, not a couple of xors first?
| @@ -8244,17 +8278,11 @@ frozendict_hash(PyObject *op) | |||
| PyObject *key, *value; // borrowed refs | |||
| Py_ssize_t pos = 0; | |||
| while (PyDict_Next(op, &pos, &key, &value)) { | |||
There was a problem hiding this comment.
Is there any reason not to do _PyDict_Next(op, &pos, NULL, &value, &keyhash), avoiding retrieving the key entirely (_PyDict_Next allows passing NULL for the key pointer, value pointer, or hash pointer; PyDict_Next itself is just calling it with NULL for the hash pointer) in favor of solely retrieving its hash? This then simplifies frozendict_pair_hash by letting it take the key hash rather than the key, saving a PyObject_Hash call and the associated error-checking, and letting you directly initialize Py_uhash_t acc = _PyTuple_HASH_XXROTATE(_PyTuple_HASH_XXPRIME_5 + keyhash * _PyTuple_HASHXXPRIME_2) * _PyTuple_HASH_XXPRIME_1; (maybe split up a bit, but you'd have everything you needed for all those steps up front, and could reduce eight lines to just 1-3). Sure, some types store their cached hash internally so PyObject_Hash is low cost for them, but dict's internal iteration guarantees zero cost, so why not take advantage of it?
| frozendict({"a": "b", False: True, True: "c"}), | ||
| ] | ||
| hashes = {hash(fd) for fd in cases} | ||
| self.assertEqual(len(hashes), len(cases)) |
There was a problem hiding this comment.
Perhaps simpler, just verify the PEP 814 guarantee that hash(fd) == hash(frozenset(fd.items()))? There's no strict rule that says none of these hashes can collide, especially with the vagaries of the seeded string hashes, but we can verify that the behavior matches the PEP 814 spec and trust that the frozenset and tuple hash algorithms are adequate. This also provides a good test to catch if someone does update the tuple or frozenset hashing in a way that would break compatibility.
frozendict_hashdoesnt match the PEP and might have too many collisions #149807