Skip to content

Equirecursive type canonicalization#3717

Merged
tlively merged 5 commits into
WebAssembly:mainfrom
tlively:hopcrofts
Mar 24, 2021
Merged

Equirecursive type canonicalization#3717
tlively merged 5 commits into
WebAssembly:mainfrom
tlively:hopcrofts

Conversation

@tlively

@tlively tlively commented Mar 23, 2021

Copy link
Copy Markdown
Member

Use Hopcroft's DFA minimization algorithm to properly canonicalize and
deduplicate recursive types. Type canonicalization has two stages:

  1. Structural canonicalization

    • The top-level structure of HeapTypes is used to split the declared HeapTypes
      into their initial partitions.

    • Hopcroft's algorithm refines the partitions such that all pairs of
      distinguishable types end up in different partitions.

    • A fresh HeapTypeInfo is created for each final partition. Each new
      HeapTypeInfo is linked to other new HeapTypeInfos to create a minimal type
      definition graph that defines the same types as the original graph.

  2. Global canonicalization

    • Each new minimal HeapTypeInfo that does not have the same finite
      structure as an existing globally canonical HeapTypeInfo is moved to the
      global heap type store to become the new canonical HeapTypeInfo.

    • Each temporary Type referenced by the newly canonical HeapTypeInfos is
      replaced in-place with the equivalent canonical Type to avoid leaking
      temporary Types into the global stores.

Use Hopcroft's DFA minimization algorithm to properly canonicalize and
deduplicate recursive types. Type canonicalization has two stages:

1. Structural canonicalization

 - The top-level structure of HeapTypes is used to split the declared HeapTypes
   into their initial partitions.

 - Hopcroft's algorithm refines the partitions such that all pairs of
   distinguishable types end up in different partitions.

 - A fresh HeapTypeInfo is created for each final partition. Each new
   HeapTypeInfo is linked to other new HeapTypeInfos to create a minimal type
   definition graph that defines the same types as the original graph.

2. Global canonicalization

 - Each new minimal HeapTypeInfo that does not have the same finite
   structure as an existing globally canonical HeapTypeInfo is moved to the
   global heap type store to become the new canonical HeapTypeInfo.

 - Each temporary Type referenced by the newly canonical HeapTypeInfos is
   replaced in-place with the equivalent canonical Type to avoid leaking
   temporary Types into the global stores.
@tlively tlively requested review from aheejin and kripken March 23, 2021 03:01
@kripken

kripken commented Mar 23, 2021

Copy link
Copy Markdown
Member

Great to see this!

Before getting to the code here: Does this pass the fuzzer? If so then I think it may be worth landing first and reviewing later. I've investigated the fuzz bugs fallout from the LUB removal and not all of them are about subtyping (e.g. #3715 - it's hard to find how many more remain). That blocks an emscripten release and might affect tot users. So the current state of main is a little uncertain, and if landing this quickly helps there I think we should do that.

@tlively

tlively commented Mar 23, 2021

Copy link
Copy Markdown
Member Author

Hmm, it's a little hard to tell about the fuzzer. When I run it, I get a bunch of errors about "Could not ensure supertype," which is fallout from LUB removal that is probably not related to this PR. I haven't seen any other errors, though, so at least this is more stable than the LUB removal. For example, my last run lasted 240 iterations before hitting the no-LUB bug. Does the fuzzer actually generate interesting (i.e. duplicate, recursive) type definitions, though?

How about we proceed as normal with review while I re-implement LUBbing on top of this, and when I finish that we can decide whether it makes sense to accelerate the landing of this PR and the LUB PR based on what the fuzzer thinks of the new LUBbing?

@kripken

kripken commented Mar 23, 2021

Copy link
Copy Markdown
Member

Sounds good @tlively . I incorrectly thought this PR included LUB calculations.

The fuzzer won't generate complex types, correct.

@kripken kripken left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comments up to line 1644)

Comment thread src/wasm/wasm-type.cpp Outdated
bool isTemp = false;
bool isSelfReferential = false;
// If `isFinalized`, then hashing and equality are performed on the finite
// structure of the type definition tree rooted at the HeapTypeInfo.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "finalized" after global canonicalization basically? (from the comment i'm not sure if it's another stage)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's after the user has stopped adding new types to the TypeBuilder.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - that wasn't obvious to me then. Perhaps stillBuilding or stillAdding would be clearer?

Comment thread src/wasm/wasm-type.cpp Outdated
// Helper for hashing the structures of TypeInfos and HeapTypeInfos. Uses de
// Bruijn indices to represent back edges so that infos referring to different
// type IDs but sharing a finite structure will compare and hash the same.
struct TypeStructureHasher {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about calling these finite hasher/comparators FiniteStructureHasher etc. so the finiteness is more explicit in the name?

Also how about Structure => Shape as we have GC structs. So FiniteShapeHasher? (I think that's clear enough without a Type in the name, as it's the shape of a Type, and it's in wasm-type.cpp.)

Comment thread src/wasm/wasm-type.cpp Outdated
return rtt == other.rtt;
}
WASM_UNREACHABLE("unexpected kind");
return TypeStructureEquator().eq(*this, other);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that it is ok to only compare the finite shape because this is after global canonicalization? (if so then a comment maybe could help)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not after global canonicalization, but after shape canonicalization. Will add a comment.

Comment thread src/wasm/wasm-type.cpp
}

return printChild(type, [&]() {
if (isTemp(type)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with these uses of isTemp I think we can remove the hacks on lines 278-281 and 287-290.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hm, I thought I had already done that. Thanks for the catch!

Comment thread src/wasm/wasm-type.cpp
// type IDs but sharing a finite structure will compare and hash the same.
struct TypeStructureHasher {
bool topLevelOnly;
size_t currDepth = 0;

@kripken kripken Mar 23, 2021

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like currDepth is in the de Bruijn sense, and not in the tree sense, is that right? As there is no decrement for it. If so a comment could help as the same name is used in the printer where it is in the tree sense. Or maybe a different name? https://en.wikipedia.org/wiki/De_Bruijn_index doesn't use "depth", so perhaps currIndex?

(The erase where a decrement would appear (line 1378) also implies this index in a particular subtree, is that right? edit: oh, it has to be, nevermind this last part)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was a bug I think. But this should be fixed by the latest change.

Comment thread src/wasm/wasm-type.cpp Outdated
}

size_t TypeStructureHasher::hash(const Rtt& rtt) {
size_t digest = rtt.depth;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t digest = rtt.depth;
size_t digest = wasm::hash(rtt.depth);

Comment thread src/wasm/wasm-type.cpp Outdated
// The new, minimal type definition graph.
std::vector<std::unique_ptr<HeapTypeInfo>> infos;

// Maps each input HeapTypes to the index of its partition in `partitions`,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Maps each input HeapTypes to the index of its partition in `partitions`,
// Maps each input HeapType to the index of its partition in `partitions`,

Comment thread src/wasm/wasm-type.cpp Outdated

// The Hopcroft's algorithm's list of partitions that may still be
// distinguishing partitions. Starts out containing all partitions.
std::unordered_set<size_t> distinguishers;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being unordered makes the following algorithm nondeterministic. That shouldn't matter in the output, but it could be confusing when debugging or when profiling (e.g. a user may see slower behavior on some runs).

Suggested change
std::unordered_set<size_t> distinguishers;
std::set<size_t> distinguishers;

Comment thread src/wasm/wasm-type.cpp
if (difference.empty()) {
continue;
}
// We can split the partition! Replace it with the intersection and add

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ! sounds too happy to me... we just found more work we have to reluctantly do! 😉

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha should I make it // We must bid this partition farewell 😢

Comment thread src/wasm/wasm-type.cpp
infos.back()->isTemp = true;
}
for (auto& info : infos) {
for (auto* child : getChildren(asHeapType(info))) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this loop. Each input HeapType is in one of the partitions, represented by one of the infos (at the same index). Can't we traverse all the HeapTypes by traversing the partitions and then the partition's contents? Traversing the children I suppose will cover the same things in the end, but it seems like more work?

(then getChildren would only be used in the predecessor calculation, which is what I'd expect is the main place that actually looks at the shape details)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't just need the HeapTypes; we need the HeapType locations where they are reachable from the other HeapTypes so we can replace their uses in-place with uses of canonical HeapTypes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. Makes sense. I guess that's already explained in the comment above but I missed it.

Comment thread src/wasm/wasm-type.cpp Outdated
// contain references to temporary Types owned by the TypeBuilder, so we must
// subsequently replace those references with references to canonical Types.
// Canonicalize non-tuple types before tuple types to avoid canonicalizing a
// tuple that still contains non-canonical Types.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are tuples special? Structs also contain a list of children just like a Tuple so the difficulty seems the same?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuples are the only types that directly contain other types. Structs are HeapTypes that contain types, but we've already canonicalized HeapTypes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - that is rather surprising that such a difference causes a need for special handling here, but it seems unavoidable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be handled by performing a topological sort on all the types and heap types, but now that none of the heap types need to participate in that topological sort, the sort seemed like overkill compared to just deferring the canonicalization of tuples.

@kripken kripken left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I have no confidence in my ability to find all possible bugs in this code 😄 but everything makes sense - nice work! - and more testing over time should iron out issues.

In addition to other review comments I'd suggest checking this doesn't at least regress us on fuzzability before landing.

Comment thread src/wasm/wasm-type.cpp
// type tree, but rather tests for equality of the finite shape of the graph. If
// FiniteShapeEquator reports that two type shapes are equal, FiniteShapeHasher
// should produce the same hash for them.
struct FiniteShapeEquator {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have Comparer and Comparator in other places, but not sure those are better than Equator. But it would be good to standardize on the best at some point.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. In this file we also have a TypeComparator, but that implements < rather than ==, so I thought a different name would be good for making that difference clear.

@tlively

tlively commented Mar 24, 2021

Copy link
Copy Markdown
Member Author

Thanks for the review! My fuzzing last night didn't turn up any issues beyond the LUB removal fallout, so I'll land this now. The new LUB implementation should be ready soon.

@tlively tlively merged commit 6aa04fa into WebAssembly:main Mar 24, 2021
@tlively tlively deleted the hopcrofts branch March 24, 2021 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants