Skip to content

gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841

Open
vstinner wants to merge 3 commits into
python:mainfrom
vstinner:frozendict_pair_hash
Open

gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841
vstinner wants to merge 3 commits into
python:mainfrom
vstinner:frozendict_pair_hash

Conversation

@vstinner
Copy link
Copy Markdown
Member

@vstinner vstinner commented May 14, 2026

@tim-one
Copy link
Copy Markdown
Member

tim-one commented May 14, 2026

It's good to avoid the frozenset hash code. It's not a good hash function. You can check this by constructing subsets of [i/2**n for i in range(2**n)[. The hashes of those elements vary in only the high-order bits, and the frozenset hash function is poor at "avalanching" high-order changes to lower-order bits. It's good in the other direction, though. For frozensets of low-precision floats, collisions are far too common.

This showed up when trying to construct "bad cases" for the xxHash-based tuple hashing. Raymond was made aware of it, but never got around to "doing something" about it.

No idea how the Boost-inspired scheme would work. Its scrambler does do some high-to-low propagation (via right shifts), but xxHash's rotate is best-of-all (and we took care to ensure that all major compilers did emit a "rotate" instruction instead of the longer-winded portable C spelling we use).

Long story short: properly validating a compound hash function in the context of how it plays with CPython's hash results for primitive types (which, apart from string hashes, make no attempt at creating "random-looking" results) can require weeks of work.

I can't make time for that, and have less than no interest in doing it again anyway ;-) I do have confidence in the tuple hashing approach - which was hard won.

@lambda-abstraction
Copy link
Copy Markdown

what about changing the starting hash to avoid the collision with frozenset(frozendict.items())? im not sure if its realistic for a single set/dict to use both frozensets and frozendicts as keys, but i dont think making the hash of a frozendict just exactly identical to the frozenset of entries is perfect

@tim-one
Copy link
Copy Markdown
Member

tim-one commented May 15, 2026

I don't expect it matters. The context is unlikely, and it wouldn't make much difference if it crops up. Collisions are actually pretty cheap on their own! Comparing objects of very different types for equality typically returns False at once.

What does matter is "pileup": the number of distinct objects that all have the same hashcode. That leads to long collision chains, which kill hash-based performance. In the absence of that, no number of "just pairs" that collide can slow things down much.

OTOH, I have no objection either to starting with different seeds.

@vstinner
Copy link
Copy Markdown
Member Author

This change basically implements hash(frozendict) as hash(frozenset(frozendict.items())) without having to create a concrete frozenset object. It reuses hash(tuple) and hash(frozenset) code, it doesn't invent a new hashing function.

hash(frozenset(frozendict.items())) is used by other frozendict implementation such as https://pypi.org/project/frozendict/ : see the __hash__() method.

It's good to avoid the frozenset hash code. It's not a good hash function.

I reused hash(frozenset) code since it fits well frozendict needs: hash an unordered set of items.

If someone proposes a better hash function for frozenset, the frozendict can be update to reuse it. For me, that's out of the scope of the issue gh-149807 bug report.

@vstinner
Copy link
Copy Markdown
Member Author

cc @corona10

@tim-one
Copy link
Copy Markdown
Member

tim-one commented May 15, 2026

Ya, but the code is getting ever more cryptic and mysterious as bit-fiddling tricks got copied from one module to another.

The cardinal sin of frozendict's frozenset's hashing is its feeble _shuffle_bits function., which is also cop;ied. Multiplication and left shift can only propagate bit changes "to the left". High-order bit changes never propagate to "the right", Which is why it's a disaster for sets of simple floats, whose hashes differ only in the high-order bits.

Comments in the original correctly point out that it's aimed at propagating low-order bit changes to higher-order bits, but is blind to that ;propagation in the other direction is also important.

In the forzendict context, that doesn't matter, because xxHash already does a good job of propagating changes in both directions. Indeed, calling _shuffle_bits on top of that is almost certainly a waste of cycles.

Copy link
Copy Markdown
Contributor

@MojoVampire MojoVampire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include of pycore_tuple should have its comments updated to specify the additional constants you're borrowing from it (since it's not just _PyTuple_Recycle anymore).

Inline comments on performance and spec compliance.

Comment thread Objects/dictobject.c
static Py_hash_t
frozendict_pair_hash(PyObject *key, PyObject *value)
{
Py_ssize_t len = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this a static const since it's not intended to mutated, it's just a constant? I expect most optimizing compilers to realize they can just inline len regardless (being a local that they can see is never changed), but why not make sure of it, so acc += len ^ (_PyTuple_HASH_XXPRIME_5 ^ 3527539UL); is definitely just an addition, not a couple of xors first?

Comment thread Objects/dictobject.c
@@ -8244,17 +8278,11 @@ frozendict_hash(PyObject *op)
PyObject *key, *value; // borrowed refs
Py_ssize_t pos = 0;
while (PyDict_Next(op, &pos, &key, &value)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to do _PyDict_Next(op, &pos, NULL, &value, &keyhash), avoiding retrieving the key entirely (_PyDict_Next allows passing NULL for the key pointer, value pointer, or hash pointer; PyDict_Next itself is just calling it with NULL for the hash pointer) in favor of solely retrieving its hash? This then simplifies frozendict_pair_hash by letting it take the key hash rather than the key, saving a PyObject_Hash call and the associated error-checking, and letting you directly initialize Py_uhash_t acc = _PyTuple_HASH_XXROTATE(_PyTuple_HASH_XXPRIME_5 + keyhash * _PyTuple_HASHXXPRIME_2) * _PyTuple_HASH_XXPRIME_1; (maybe split up a bit, but you'd have everything you needed for all those steps up front, and could reduce eight lines to just 1-3). Sure, some types store their cached hash internally so PyObject_Hash is low cost for them, but dict's internal iteration guarantees zero cost, so why not take advantage of it?

Comment thread Lib/test/test_dict.py
frozendict({"a": "b", False: True, True: "c"}),
]
hashes = {hash(fd) for fd in cases}
self.assertEqual(len(hashes), len(cases))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps simpler, just verify the PEP 814 guarantee that hash(fd) == hash(frozenset(fd.items()))? There's no strict rule that says none of these hashes can collide, especially with the vagaries of the seeded string hashes, but we can verify that the behavior matches the PEP 814 spec and trust that the frozenset and tuple hash algorithms are adequate. This also provides a good test to catch if someone does update the tuple or frozenset hashing in a way that would break compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants