Equirecursive type canonicalization#3717
Conversation
Use Hopcroft's DFA minimization algorithm to properly canonicalize and deduplicate recursive types. Type canonicalization has two stages: 1. Structural canonicalization - The top-level structure of HeapTypes is used to split the declared HeapTypes into their initial partitions. - Hopcroft's algorithm refines the partitions such that all pairs of distinguishable types end up in different partitions. - A fresh HeapTypeInfo is created for each final partition. Each new HeapTypeInfo is linked to other new HeapTypeInfos to create a minimal type definition graph that defines the same types as the original graph. 2. Global canonicalization - Each new minimal HeapTypeInfo that does not have the same finite structure as an existing globally canonical HeapTypeInfo is moved to the global heap type store to become the new canonical HeapTypeInfo. - Each temporary Type referenced by the newly canonical HeapTypeInfos is replaced in-place with the equivalent canonical Type to avoid leaking temporary Types into the global stores.
|
Great to see this! Before getting to the code here: Does this pass the fuzzer? If so then I think it may be worth landing first and reviewing later. I've investigated the fuzz bugs fallout from the LUB removal and not all of them are about subtyping (e.g. #3715 - it's hard to find how many more remain). That blocks an emscripten release and might affect |
|
Hmm, it's a little hard to tell about the fuzzer. When I run it, I get a bunch of errors about "Could not ensure supertype," which is fallout from LUB removal that is probably not related to this PR. I haven't seen any other errors, though, so at least this is more stable than the LUB removal. For example, my last run lasted 240 iterations before hitting the no-LUB bug. Does the fuzzer actually generate interesting (i.e. duplicate, recursive) type definitions, though? How about we proceed as normal with review while I re-implement LUBbing on top of this, and when I finish that we can decide whether it makes sense to accelerate the landing of this PR and the LUB PR based on what the fuzzer thinks of the new LUBbing? |
|
Sounds good @tlively . I incorrectly thought this PR included LUB calculations. The fuzzer won't generate complex types, correct. |
| bool isTemp = false; | ||
| bool isSelfReferential = false; | ||
| // If `isFinalized`, then hashing and equality are performed on the finite | ||
| // structure of the type definition tree rooted at the HeapTypeInfo. |
There was a problem hiding this comment.
is "finalized" after global canonicalization basically? (from the comment i'm not sure if it's another stage)
There was a problem hiding this comment.
No, it's after the user has stopped adding new types to the TypeBuilder.
There was a problem hiding this comment.
I see - that wasn't obvious to me then. Perhaps stillBuilding or stillAdding would be clearer?
| // Helper for hashing the structures of TypeInfos and HeapTypeInfos. Uses de | ||
| // Bruijn indices to represent back edges so that infos referring to different | ||
| // type IDs but sharing a finite structure will compare and hash the same. | ||
| struct TypeStructureHasher { |
There was a problem hiding this comment.
how about calling these finite hasher/comparators FiniteStructureHasher etc. so the finiteness is more explicit in the name?
Also how about Structure => Shape as we have GC structs. So FiniteShapeHasher? (I think that's clear enough without a Type in the name, as it's the shape of a Type, and it's in wasm-type.cpp.)
| return rtt == other.rtt; | ||
| } | ||
| WASM_UNREACHABLE("unexpected kind"); | ||
| return TypeStructureEquator().eq(*this, other); |
There was a problem hiding this comment.
Is the idea that it is ok to only compare the finite shape because this is after global canonicalization? (if so then a comment maybe could help)
There was a problem hiding this comment.
Not after global canonicalization, but after shape canonicalization. Will add a comment.
| } | ||
|
|
||
| return printChild(type, [&]() { | ||
| if (isTemp(type)) { |
There was a problem hiding this comment.
with these uses of isTemp I think we can remove the hacks on lines 278-281 and 287-290.
There was a problem hiding this comment.
Oh hm, I thought I had already done that. Thanks for the catch!
| // type IDs but sharing a finite structure will compare and hash the same. | ||
| struct TypeStructureHasher { | ||
| bool topLevelOnly; | ||
| size_t currDepth = 0; |
There was a problem hiding this comment.
It looks like currDepth is in the de Bruijn sense, and not in the tree sense, is that right? As there is no decrement for it. If so a comment could help as the same name is used in the printer where it is in the tree sense. Or maybe a different name? https://en.wikipedia.org/wiki/De_Bruijn_index doesn't use "depth", so perhaps currIndex?
(The erase where a decrement would appear (line 1378) also implies this index in a particular subtree, is that right? edit: oh, it has to be, nevermind this last part)
There was a problem hiding this comment.
Yeah, that was a bug I think. But this should be fixed by the latest change.
| } | ||
|
|
||
| size_t TypeStructureHasher::hash(const Rtt& rtt) { | ||
| size_t digest = rtt.depth; |
There was a problem hiding this comment.
| size_t digest = rtt.depth; | |
| size_t digest = wasm::hash(rtt.depth); |
| // The new, minimal type definition graph. | ||
| std::vector<std::unique_ptr<HeapTypeInfo>> infos; | ||
|
|
||
| // Maps each input HeapTypes to the index of its partition in `partitions`, |
There was a problem hiding this comment.
| // Maps each input HeapTypes to the index of its partition in `partitions`, | |
| // Maps each input HeapType to the index of its partition in `partitions`, |
|
|
||
| // The Hopcroft's algorithm's list of partitions that may still be | ||
| // distinguishing partitions. Starts out containing all partitions. | ||
| std::unordered_set<size_t> distinguishers; |
There was a problem hiding this comment.
Being unordered makes the following algorithm nondeterministic. That shouldn't matter in the output, but it could be confusing when debugging or when profiling (e.g. a user may see slower behavior on some runs).
| std::unordered_set<size_t> distinguishers; | |
| std::set<size_t> distinguishers; |
| if (difference.empty()) { | ||
| continue; | ||
| } | ||
| // We can split the partition! Replace it with the intersection and add |
There was a problem hiding this comment.
this ! sounds too happy to me... we just found more work we have to reluctantly do! 😉
There was a problem hiding this comment.
Hahaha should I make it // We must bid this partition farewell 😢
| infos.back()->isTemp = true; | ||
| } | ||
| for (auto& info : infos) { | ||
| for (auto* child : getChildren(asHeapType(info))) { |
There was a problem hiding this comment.
I don't understand this loop. Each input HeapType is in one of the partitions, represented by one of the infos (at the same index). Can't we traverse all the HeapTypes by traversing the partitions and then the partition's contents? Traversing the children I suppose will cover the same things in the end, but it seems like more work?
(then getChildren would only be used in the predecessor calculation, which is what I'd expect is the main place that actually looks at the shape details)
There was a problem hiding this comment.
We don't just need the HeapTypes; we need the HeapType locations where they are reachable from the other HeapTypes so we can replace their uses in-place with uses of canonical HeapTypes.
There was a problem hiding this comment.
I see, thanks. Makes sense. I guess that's already explained in the comment above but I missed it.
| // contain references to temporary Types owned by the TypeBuilder, so we must | ||
| // subsequently replace those references with references to canonical Types. | ||
| // Canonicalize non-tuple types before tuple types to avoid canonicalizing a | ||
| // tuple that still contains non-canonical Types. |
There was a problem hiding this comment.
Why are tuples special? Structs also contain a list of children just like a Tuple so the difficulty seems the same?
There was a problem hiding this comment.
Tuples are the only types that directly contain other types. Structs are HeapTypes that contain types, but we've already canonicalized HeapTypes.
There was a problem hiding this comment.
Oh - that is rather surprising that such a difference causes a need for special handling here, but it seems unavoidable.
There was a problem hiding this comment.
This used to be handled by performing a topological sort on all the types and heap types, but now that none of the heap types need to participate in that topological sort, the sort seemed like overkill compared to just deferring the canonicalization of tuples.
kripken
left a comment
There was a problem hiding this comment.
lgtm. I have no confidence in my ability to find all possible bugs in this code 😄 but everything makes sense - nice work! - and more testing over time should iron out issues.
In addition to other review comments I'd suggest checking this doesn't at least regress us on fuzzability before landing.
| // type tree, but rather tests for equality of the finite shape of the graph. If | ||
| // FiniteShapeEquator reports that two type shapes are equal, FiniteShapeHasher | ||
| // should produce the same hash for them. | ||
| struct FiniteShapeEquator { |
There was a problem hiding this comment.
We have Comparer and Comparator in other places, but not sure those are better than Equator. But it would be good to standardize on the best at some point.
There was a problem hiding this comment.
Yeah, makes sense. In this file we also have a TypeComparator, but that implements < rather than ==, so I thought a different name would be good for making that difference clear.
|
Thanks for the review! My fuzzing last night didn't turn up any issues beyond the LUB removal fallout, so I'll land this now. The new LUB implementation should be ready soon. |
Use Hopcroft's DFA minimization algorithm to properly canonicalize and
deduplicate recursive types. Type canonicalization has two stages:
Structural canonicalization
The top-level structure of HeapTypes is used to split the declared HeapTypes
into their initial partitions.
Hopcroft's algorithm refines the partitions such that all pairs of
distinguishable types end up in different partitions.
A fresh HeapTypeInfo is created for each final partition. Each new
HeapTypeInfo is linked to other new HeapTypeInfos to create a minimal type
definition graph that defines the same types as the original graph.
Global canonicalization
Each new minimal HeapTypeInfo that does not have the same finite
structure as an existing globally canonical HeapTypeInfo is moved to the
global heap type store to become the new canonical HeapTypeInfo.
Each temporary Type referenced by the newly canonical HeapTypeInfos is
replaced in-place with the equivalent canonical Type to avoid leaking
temporary Types into the global stores.