Equirecursive type canonicalization by tlively · Pull Request #3717 · WebAssembly/binaryen

tlively · 2021-03-23T03:01:08Z

Use Hopcroft's DFA minimization algorithm to properly canonicalize and
deduplicate recursive types. Type canonicalization has two stages:

Structural canonicalization
- The top-level structure of HeapTypes is used to split the declared HeapTypes
  into their initial partitions.
- Hopcroft's algorithm refines the partitions such that all pairs of
  distinguishable types end up in different partitions.
- A fresh HeapTypeInfo is created for each final partition. Each new
  HeapTypeInfo is linked to other new HeapTypeInfos to create a minimal type
  definition graph that defines the same types as the original graph.
Global canonicalization
- Each new minimal HeapTypeInfo that does not have the same finite
  structure as an existing globally canonical HeapTypeInfo is moved to the
  global heap type store to become the new canonical HeapTypeInfo.
- Each temporary Type referenced by the newly canonical HeapTypeInfos is
  replaced in-place with the equivalent canonical Type to avoid leaking
  temporary Types into the global stores.

Use Hopcroft's DFA minimization algorithm to properly canonicalize and deduplicate recursive types. Type canonicalization has two stages: 1. Structural canonicalization - The top-level structure of HeapTypes is used to split the declared HeapTypes into their initial partitions. - Hopcroft's algorithm refines the partitions such that all pairs of distinguishable types end up in different partitions. - A fresh HeapTypeInfo is created for each final partition. Each new HeapTypeInfo is linked to other new HeapTypeInfos to create a minimal type definition graph that defines the same types as the original graph. 2. Global canonicalization - Each new minimal HeapTypeInfo that does not have the same finite structure as an existing globally canonical HeapTypeInfo is moved to the global heap type store to become the new canonical HeapTypeInfo. - Each temporary Type referenced by the newly canonical HeapTypeInfos is replaced in-place with the equivalent canonical Type to avoid leaking temporary Types into the global stores.

kripken · 2021-03-23T03:13:05Z

Great to see this!

Before getting to the code here: Does this pass the fuzzer? If so then I think it may be worth landing first and reviewing later. I've investigated the fuzz bugs fallout from the LUB removal and not all of them are about subtyping (e.g. #3715 - it's hard to find how many more remain). That blocks an emscripten release and might affect tot users. So the current state of main is a little uncertain, and if landing this quickly helps there I think we should do that.

tlively · 2021-03-23T05:31:52Z

Hmm, it's a little hard to tell about the fuzzer. When I run it, I get a bunch of errors about "Could not ensure supertype," which is fallout from LUB removal that is probably not related to this PR. I haven't seen any other errors, though, so at least this is more stable than the LUB removal. For example, my last run lasted 240 iterations before hitting the no-LUB bug. Does the fuzzer actually generate interesting (i.e. duplicate, recursive) type definitions, though?

How about we proceed as normal with review while I re-implement LUBbing on top of this, and when I finish that we can decide whether it makes sense to accelerate the landing of this PR and the LUB PR based on what the fuzzer thinks of the new LUBbing?

kripken · 2021-03-23T14:17:44Z

Sounds good @tlively . I incorrectly thought this PR included LUB calculations.

The fuzzer won't generate complex types, correct.

kripken

(comments up to line 1644)

kripken · 2021-03-23T15:53:34Z

  bool isTemp = false;
-  bool isSelfReferential = false;
+  // If `isFinalized`, then hashing and equality are performed on the finite
+  // structure of the type definition tree rooted at the HeapTypeInfo.


is "finalized" after global canonicalization basically? (from the comment i'm not sure if it's another stage)

No, it's after the user has stopped adding new types to the TypeBuilder.

I see - that wasn't obvious to me then. Perhaps stillBuilding or stillAdding would be clearer?

kripken · 2021-03-23T15:58:24Z

+// Helper for hashing the structures of TypeInfos and HeapTypeInfos. Uses de
+// Bruijn indices to represent back edges so that infos referring to different
+// type IDs but sharing a finite structure will compare and hash the same.
+struct TypeStructureHasher {


how about calling these finite hasher/comparators FiniteStructureHasher etc. so the finiteness is more explicit in the name?

Also how about Structure => Shape as we have GC structs. So FiniteShapeHasher? (I think that's clear enough without a Type in the name, as it's the shape of a Type, and it's in wasm-type.cpp.)

kripken · 2021-03-23T16:02:56Z

-      return rtt == other.rtt;
-  }
-  WASM_UNREACHABLE("unexpected kind");
+  return TypeStructureEquator().eq(*this, other);


Is the idea that it is ok to only compare the finite shape because this is after global canonicalization? (if so then a comment maybe could help)

Not after global canonicalization, but after shape canonicalization. Will add a comment.

kripken · 2021-03-23T16:07:14Z

  }

  return printChild(type, [&]() {
+    if (isTemp(type)) {


with these uses of isTemp I think we can remove the hacks on lines 278-281 and 287-290.

Oh hm, I thought I had already done that. Thanks for the catch!

kripken · 2021-03-23T16:12:36Z

+// type IDs but sharing a finite structure will compare and hash the same.
+struct TypeStructureHasher {
+  bool topLevelOnly;
+  size_t currDepth = 0;


It looks like currDepth is in the de Bruijn sense, and not in the tree sense, is that right? As there is no decrement for it. If so a comment could help as the same name is used in the printer where it is in the tree sense. Or maybe a different name? https://en.wikipedia.org/wiki/De_Bruijn_index doesn't use "depth", so perhaps currIndex?

(The erase where a decrement would appear (line 1378) also implies this index in a particular subtree, is that right? edit: oh, it has to be, nevermind this last part)

Yeah, that was a bug I think. But this should be fixed by the latest change.

kripken · 2021-03-23T16:20:14Z

+}
+
+size_t TypeStructureHasher::hash(const Rtt& rtt) {
+  size_t digest = rtt.depth;


Suggested change

size_t digest = rtt.depth;

size_t digest = wasm::hash(rtt.depth);

kripken · 2021-03-23T19:47:35Z

+  // The new, minimal type definition graph.
+  std::vector<std::unique_ptr<HeapTypeInfo>> infos;
+
+  // Maps each input HeapTypes to the index of its partition in `partitions`,


Suggested change

// Maps each input HeapTypes to the index of its partition in `partitions`,

// Maps each input HeapType to the index of its partition in `partitions`,

kripken · 2021-03-23T19:54:52Z

+
+  // The Hopcroft's algorithm's list of partitions that may still be
+  // distinguishing partitions. Starts out containing all partitions.
+  std::unordered_set<size_t> distinguishers;


Being unordered makes the following algorithm nondeterministic. That shouldn't matter in the output, but it could be confusing when debugging or when profiling (e.g. a user may see slower behavior on some runs).

Suggested change

std::unordered_set<size_t> distinguishers;

std::set<size_t> distinguishers;

kripken · 2021-03-23T20:01:15Z

+        if (difference.empty()) {
+          continue;
+        }
+        // We can split the partition! Replace it with the intersection and add


this ! sounds too happy to me... we just found more work we have to reluctantly do! 😉

Hahaha should I make it // We must bid this partition farewell 😢

kripken · 2021-03-23T20:15:11Z

+    infos.back()->isTemp = true;
+  }
+  for (auto& info : infos) {
+    for (auto* child : getChildren(asHeapType(info))) {


I don't understand this loop. Each input HeapType is in one of the partitions, represented by one of the infos (at the same index). Can't we traverse all the HeapTypes by traversing the partitions and then the partition's contents? Traversing the children I suppose will cover the same things in the end, but it seems like more work?

(then getChildren would only be used in the predecessor calculation, which is what I'd expect is the main place that actually looks at the shape details)

We don't just need the HeapTypes; we need the HeapType locations where they are reachable from the other HeapTypes so we can replace their uses in-place with uses of canonical HeapTypes.

I see, thanks. Makes sense. I guess that's already explained in the comment above but I missed it.

kripken · 2021-03-23T20:21:11Z

+  // contain references to temporary Types owned by the TypeBuilder, so we must
+  // subsequently replace those references with references to canonical Types.
+  // Canonicalize non-tuple types before tuple types to avoid canonicalizing a
+  // tuple that still contains non-canonical Types.


Why are tuples special? Structs also contain a list of children just like a Tuple so the difficulty seems the same?

Tuples are the only types that directly contain other types. Structs are HeapTypes that contain types, but we've already canonicalized HeapTypes.

Oh - that is rather surprising that such a difference causes a need for special handling here, but it seems unavoidable.

This used to be handled by performing a topological sort on all the types and heap types, but now that none of the heap types need to participate in that topological sort, the sort seemed like overkill compared to just deferring the canonicalization of tuples.

kripken

lgtm. I have no confidence in my ability to find all possible bugs in this code 😄 but everything makes sense - nice work! - and more testing over time should iron out issues.

In addition to other review comments I'd suggest checking this doesn't at least regress us on fuzzability before landing.

kripken · 2021-03-23T21:44:53Z

+// type tree, but rather tests for equality of the finite shape of the graph. If
+// FiniteShapeEquator reports that two type shapes are equal, FiniteShapeHasher
+// should produce the same hash for them.
+struct FiniteShapeEquator {


We have Comparer and Comparator in other places, but not sure those are better than Equator. But it would be good to standardize on the best at some point.

Yeah, makes sense. In this file we also have a TypeComparator, but that implements < rather than ==, so I thought a different name would be good for making that difference clear.

tlively · 2021-03-24T03:43:07Z

Thanks for the review! My fuzzing last night didn't turn up any issues beyond the LUB removal fallout, so I'll land this now. The new LUB implementation should be ready soon.

tlively requested review from aheejin and kripken March 23, 2021 03:01

Format

76edf89

kripken reviewed Mar 23, 2021

View reviewed changes

tlively added 2 commits March 23, 2021 09:38

Linear instead of exponential hashing and equating

13ff034

Address initial comments

7c4cd26

kripken reviewed Mar 23, 2021

View reviewed changes

Address further comments

3ca8342

kripken approved these changes Mar 23, 2021

View reviewed changes

tlively merged commit 6aa04fa into WebAssembly:main Mar 24, 2021

tlively deleted the hopcrofts branch March 24, 2021 03:44

kripken mentioned this pull request Mar 25, 2021

Regression from equirecursive type canonicalization #3736

Closed

	size_t digest = rtt.depth;
	size_t digest = wasm::hash(rtt.depth);

	// Maps each input HeapTypes to the index of its partition in `partitions`,
	// Maps each input HeapType to the index of its partition in `partitions`,

	std::unordered_set<size_t> distinguishers;
	std::set<size_t> distinguishers;

Conversation

tlively commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kripken commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlively commented Mar 23, 2021

Uh oh!

kripken commented Mar 23, 2021

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kripken left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlively commented Mar 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tlively commented Mar 23, 2021 •

edited

Loading

kripken commented Mar 23, 2021 •

edited

Loading

kripken Mar 23, 2021 •

edited

Loading