gh-149807: Fix hash(frozendict): compute (key, value) pair hash by vstinner · Pull Request #149841 · python/cpython

vstinner · 2026-05-14T17:47:55Z

Issue: frozendict_hash doesnt match the PEP and might have too many collisions #149807

tim-one · 2026-05-14T20:33:48Z

It's good to avoid the frozenset hash code. It's not a good hash function. You can check this by constructing subsets of [i/2**n for i in range(2**n)[. The hashes of those elements vary in only the high-order bits, and the frozenset hash function is poor at "avalanching" high-order changes to lower-order bits. It's good in the other direction, though. For frozensets of low-precision floats, collisions are far too common.

This showed up when trying to construct "bad cases" for the xxHash-based tuple hashing. Raymond was made aware of it, but never got around to "doing something" about it.

No idea how the Boost-inspired scheme would work. Its scrambler does do some high-to-low propagation (via right shifts), but xxHash's rotate is best-of-all (and we took care to ensure that all major compilers did emit a "rotate" instruction instead of the longer-winded portable C spelling we use).

Long story short: properly validating a compound hash function in the context of how it plays with CPython's hash results for primitive types (which, apart from string hashes, make no attempt at creating "random-looking" results) can require weeks of work.

I can't make time for that, and have less than no interest in doing it again anyway ;-) I do have confidence in the tuple hashing approach - which was hard won.

lambda-abstraction · 2026-05-14T23:34:15Z

what about changing the starting hash to avoid the collision with frozenset(frozendict.items())? im not sure if its realistic for a single set/dict to use both frozensets and frozendicts as keys, but i dont think making the hash of a frozendict just exactly identical to the frozenset of entries is perfect

tim-one · 2026-05-15T04:32:46Z

I don't expect it matters. The context is unlikely, and it wouldn't make much difference if it crops up. Collisions are actually pretty cheap on their own! Comparing objects of very different types for equality typically returns False at once.

What does matter is "pileup": the number of distinct objects that all have the same hashcode. That leads to long collision chains, which kill hash-based performance. In the absence of that, no number of "just pairs" that collide can slow things down much.

OTOH, I have no objection either to starting with different seeds.

vstinner · 2026-05-15T11:50:29Z

This change basically implements hash(frozendict) as hash(frozenset(frozendict.items())) without having to create a concrete frozenset object. It reuses hash(tuple) and hash(frozenset) code, it doesn't invent a new hashing function.

hash(frozenset(frozendict.items())) is used by other frozendict implementation such as https://pypi.org/project/frozendict/ : see the __hash__() method.

It's good to avoid the frozenset hash code. It's not a good hash function.

I reused hash(frozenset) code since it fits well frozendict needs: hash an unordered set of items.

If someone proposes a better hash function for frozenset, the frozendict can be update to reuse it. For me, that's out of the scope of the issue gh-149807 bug report.

vstinner · 2026-05-15T11:50:45Z

cc @corona10

tim-one · 2026-05-15T19:53:56Z

Ya, but the code is getting ever more cryptic and mysterious as bit-fiddling tricks got copied from one module to another.

The cardinal sin of ~~frozendict's~~ frozenset's hashing is its feeble _shuffle_bits function., which is also cop;ied. Multiplication and left shift can only propagate bit changes "to the left". High-order bit changes never propagate to "the right", Which is why it's a disaster for sets of simple floats, whose hashes differ only in the high-order bits.

Comments in the original correctly point out that it's aimed at propagating low-order bit changes to higher-order bits, but is blind to that ;propagation in the other direction is also important.

In the forzendict context, that doesn't matter, because xxHash already does a good job of propagating changes in both directions. Indeed, calling _shuffle_bits on top of that is almost certainly a waste of cycles.

MojoVampire

The include of pycore_tuple should have its comments updated to specify the additional constants you're borrowing from it (since it's not just _PyTuple_Recycle anymore).

Inline comments on performance and spec compliance.

MojoVampire · 2026-05-16T22:40:20Z

+static Py_hash_t
+frozendict_pair_hash(PyObject *key, PyObject *value)
+{
+    Py_ssize_t len = 2;


Perhaps make this a static const since it's not intended to mutated, it's just a constant? I expect most optimizing compilers to realize they can just inline len regardless (being a local that they can see is never changed), but why not make sure of it, so acc += len ^ (_PyTuple_HASH_XXPRIME_5 ^ 3527539UL); is definitely just an addition, not a couple of xors first?

MojoVampire · 2026-05-16T22:47:02Z

@@ -8244,17 +8278,11 @@ frozendict_hash(PyObject *op)
    PyObject *key, *value;  // borrowed refs
    Py_ssize_t pos = 0;
    while (PyDict_Next(op, &pos, &key, &value)) {


Is there any reason not to do _PyDict_Next(op, &pos, NULL, &value, &keyhash), avoiding retrieving the key entirely (_PyDict_Next allows passing NULL for the key pointer, value pointer, or hash pointer; PyDict_Next itself is just calling it with NULL for the hash pointer) in favor of solely retrieving its hash? This then simplifies frozendict_pair_hash by letting it take the key hash rather than the key, saving a PyObject_Hash call and the associated error-checking, and letting you directly initialize Py_uhash_t acc = _PyTuple_HASH_XXROTATE(_PyTuple_HASH_XXPRIME_5 + keyhash * _PyTuple_HASHXXPRIME_2) * _PyTuple_HASH_XXPRIME_1; (maybe split up a bit, but you'd have everything you needed for all those steps up front, and could reduce eight lines to just 1-3). Sure, some types store their cached hash internally so PyObject_Hash is low cost for them, but dict's internal iteration guarantees zero cost, so why not take advantage of it?

MojoVampire · 2026-05-16T22:57:55Z

+            frozendict({"a": "b", False: True, True: "c"}),
+        ]
+        hashes = {hash(fd) for fd in cases}
+        self.assertEqual(len(hashes), len(cases))


Perhaps simpler, just verify the PEP 814 guarantee that hash(fd) == hash(frozenset(fd.items()))? There's no strict rule that says none of these hashes can collide, especially with the vagaries of the seeded string hashes, but we can verify that the behavior matches the PEP 814 spec and trust that the frozenset and tuple hash algorithms are adequate. This also provides a good test to catch if someone does update the tuple or frozenset hashing in a way that would break compatibility.

pythongh-149807: Fix hash(frozendict): compute (key, value) pair hash

474c166

vstinner requested review from markshannon and methane as code owners May 14, 2026 17:47

bedevere-app Bot added the awaiting core review label May 14, 2026

bedevere-app Bot mentioned this pull request May 14, 2026

frozendict_hash doesnt match the PEP and might have too many collisions #149807

Open

vstinner added 2 commits May 14, 2026 19:49

Cleanup the test

75abaed

Add comment

92b6eb4

lambda-abstraction mentioned this pull request May 14, 2026

gh-149807 : frozendict_hash : hash (key, value) pairs rather than keys and values separately #149808

Closed

corona10 approved these changes May 15, 2026

View reviewed changes

bedevere-app Bot added awaiting merge and removed awaiting core review labels May 15, 2026

MojoVampire requested changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841

gh-149807: Fix hash(frozendict): compute (key, value) pair hash#149841
vstinner wants to merge 3 commits into
python:mainfrom
vstinner:frozendict_pair_hash

vstinner commented May 14, 2026 •

edited by bedevere-app Bot

Loading

Uh oh!

tim-one commented May 14, 2026

Uh oh!

lambda-abstraction commented May 14, 2026

Uh oh!

tim-one commented May 15, 2026 •

edited

Loading

Uh oh!

vstinner commented May 15, 2026

Uh oh!

vstinner commented May 15, 2026

Uh oh!

tim-one commented May 15, 2026 •

edited

Loading

Uh oh!

MojoVampire left a comment

Uh oh!

MojoVampire May 16, 2026

Uh oh!

MojoVampire May 16, 2026

Uh oh!

MojoVampire May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

vstinner commented May 14, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tim-one commented May 14, 2026

Uh oh!

lambda-abstraction commented May 14, 2026

Uh oh!

tim-one commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented May 15, 2026

Uh oh!

vstinner commented May 15, 2026

Uh oh!

tim-one commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MojoVampire left a comment

Choose a reason for hiding this comment

Uh oh!

MojoVampire May 16, 2026

Choose a reason for hiding this comment

Uh oh!

MojoVampire May 16, 2026

Choose a reason for hiding this comment

Uh oh!

MojoVampire May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vstinner commented May 14, 2026 •

edited by bedevere-app Bot

Loading

tim-one commented May 15, 2026 •

edited

Loading

tim-one commented May 15, 2026 •

edited

Loading