[SQL] Propagate join hints through compiler stages#6162
Conversation
| switch (hint.hintName) { | ||
| case "broadcast": | ||
| strategies.add(new JoinStrategy(JoinStrategy.Strategy.Broadcast)); | ||
| break; | ||
| case "balance": | ||
| strategies.add(new JoinStrategy(JoinStrategy.Strategy.Balance)); | ||
| break; | ||
| case "shard": | ||
| strategies.add(new JoinStrategy(JoinStrategy.Strategy.Shard)); | ||
| break; | ||
| default: | ||
| break; |
There was a problem hiding this comment.
The hint names "broadcast", "balance", "shard" are duplicated as raw strings here and again in SqlToRelCompiler#HintStrategyTable. Two sources of truth for the same registry — easy to drift if one name is renamed or a new strategy added. Consider deriving them from JoinStrategy.Strategy (e.g. Strategy.Broadcast.name().toLowerCase(Locale.ROOT)) and registering them in one place so adding a new strategy is a single-file change.
Also: the default: arm silently drops anything that is not one of the three. Calcite's HintStrategyTable filters unknown names upstream, but a typo on a registered name (e.g. "broadcaast") would silently no-op here. An assertion or a switch keyed on Strategy.values() would catch that.
| HepProgram getProgram(RelNode node, int level) { | ||
| // only call this if there are no Correlates | ||
| if (!containsCorrelate(node)) { | ||
| if (!containsCorrelate(node) && !containsHints(node) && (joinCount(node) > 1)) { |
There was a problem hiding this comment.
Two distinct behavioral changes bundled into one condition:
-
!containsHints(node)is wider than it needs to be. The comment two lines down says HYPER_GRAPH_OPTIMIZE "loses hints" — that is specifically about join hints. Any hint anywhere in the tree (a query-level/*+ ... */on the SELECT, a table hint that has nothing to do with joins) now disables hypergraph reordering for the entire query. Tighten this to "contains a hint on aJoin" so unrelated hints do not silently kill an optimization. -
joinCount(node) > 1is unrelated to the hint-propagation goal of this PR. Whether or not it is a no-op for single-join queries in practice, the gate change should land with its own rationale and a test that pins the new behavior. Right now nothing in the test suite would notice if this condition were flipped back.
| static boolean containsHints(RelNode node) { | ||
| if (node instanceof Hintable ha) | ||
| return !ha.getHints().isEmpty(); | ||
| for (RelNode input : node.getInputs()) { | ||
| if (containsHints(input)) | ||
| return true; | ||
| } | ||
| return false; | ||
| } | ||
|
|
||
| static int joinCount(RelNode node) { | ||
| int recurse = 0; | ||
| for (RelNode input : node.getInputs()) { | ||
| recurse += joinCount(input); | ||
| } | ||
| if (node instanceof Join) | ||
| recurse += 1; |
There was a problem hiding this comment.
Each of containsCorrelate, containsHints, and joinCount walks the entire RelNode tree independently. On every Hypergraph step invocation that is three full traversals where a single visitor could compute all three predicates at once. Soft — only worth folding together if profiling shows it on a hot path.
| public enum Strategy { | ||
| Broadcast, | ||
| Shard, | ||
| Balance, |
There was a problem hiding this comment.
Broadcast and Shard are well-known terms in distributed-join literature; Balance is not. A short Javadoc on each variant — what it means operationally, even informally — would save the next reader (or a user reading docs generated from this) a hunt. Especially important since these names are user-facing via /*+ ... */ syntax.
| Assert.assertTrue(this.joinFound); | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
Only broadcast is covered. shard and balance go through the same switch but a typo in one of those case labels would slip past every test in this PR. Parameterizing across JoinStrategy.Strategy.values() is cheap and keeps coverage automatic when new strategies are added.
Also missing: a regression test for the new hypergraph gate — a 3-way join with a join hint, asserting the hint survives end-to-end (i.e. that the gate added in CalciteOptimizer actually does what it says).
| FROM | ||
| tableName /*+ hint3(5, 'x') */ | ||
| JOIN | ||
| tableName /*+ hint4(c=id), hint5 */ |
There was a problem hiding this comment.
hint4(c=id) uses an identifier (id) as the value, but the grammar a few lines below declares optionVal : stringLiteral. Either the example should be hint4(c='id') or the grammar should list simpleIdentifier as an alternative for optionVal.
| Support for hints is under development. | ||
|
|
||
| ## Creating indexes | ||
|
|
There was a problem hiding this comment.
This section is empty other than "Support for hints is under development." Since the parser now accepts broadcast, shard, and balance without errors (they are registered in HintStrategyTable), users will discover them and try them. A one-line table listing the three names with "recognized but currently no-op" is more honest than the current placeholder and pre-empts the bug report "my hint did nothing."
| public static final String HINT_BROADCAST_LEFT = "broadcast_left"; | ||
| public static final String HINT_BROADCAST_RIGHT = "broadcast_right"; | ||
| public static final String HINT_BALANCE = "balance"; | ||
| public static final String HINT_SHARD = "shart"; |
There was a problem hiding this comment.
Typo: "shart" should be "shard". As written, the Shard strategy is registered under the SQL hint name shart, so users have to write /*+ shart */ to get sharded joins, and /*+ shard */ falls through to the default arm in visitJoin and is silently ignored. No test catches this because only broadcast_left is exercised end-to-end.
| HepProgram getProgram(RelNode node, int level) { | ||
| // only call this if there are no Correlates | ||
| if (!containsCorrelate(node)) { | ||
| if (!containsCorrelate(node) && !containsHints(node) && (joinCount(node) > 1)) { |
There was a problem hiding this comment.
The gate now combines three conditions, but two of them are conceptually unrelated:
!containsHints(node)— disabling the hyper-graph rule when any hint is present anywhere in the subtree is much coarser than necessary. A non-join hint elsewhere in the plan (or any future scan/aggregate hint) will silently disable hyper-graph optimization for the whole query. Skipping should be conditional on a hint the rule would actually drop, e.g. aJoinStrategyannotation on aJoin.joinCount(node) > 1— this is an independent behavioral change with no rationale comment and no test. If it's meant as a perf guard (don't run hyper on trivial single-join plans), please say so in a comment and add coverage; otherwise it shouldn't be folded into this PR.
Also, three independent recursive walks (containsCorrelate + containsHints + joinCount) over the same tree on every optimizer entry — fine for now, but worth a single visitor if this grows.
| public void testHints() { | ||
| var ccs = this.getCCS(""" | ||
| CREATE TABLE T(x INT); | ||
| CREATE VIEW V AS SELECT /*+ broadcast_left */ * FROM T JOIN T AS S USING (x);"""); |
There was a problem hiding this comment.
Only broadcast_left is exercised end-to-end. broadcast_right, balance, and shard go through the same switch in visitJoin, but each maps to a distinct JoinStrategy.Strategy enum value — easy to typo a case arm (cf. "shart" above). Please parameterize this test or add three more variants asserting the correct annotation per hint.
|
|
||
| ### Supported hints and their impact on query implementation | ||
|
|
||
| Support for hints is under development. |
There was a problem hiding this comment.
This PR introduces four user-visible join hints (broadcast_left, broadcast_right, balance, shard) and registers them in HintStrategyTable, but the "Supported hints" section still just says "Support for hints is under development." Either document the four hints here (name, semantics, where they may appear) or add a follow-up issue and link it from this section.
7c8b9e7 to
ce3b5cb
Compare
| a hash-join strategy by sharding the input with alias *table* | ||
| - `balanced(`*table*`)`: Indicates that the following `JOIN` should be implemented using | ||
| a balanced strategy by hashing on all fields the input with alias *table* | ||
|
|
There was a problem hiding this comment.
Doc/code mismatch: this lists the hint as balanced(table), but the registered hint name is balance — HINT_BALANCE = "balance" in SqlToRelCompiler, and the new testThreeWayJoinHints writes /*+ ... balance(U) */. A user copying the docs verbatim with /*+ balanced(U) */ will get an unknown hint that silently no-ops. Either rename the constant to balanced or fix the docs to say balance(table).
de11941 to
53126b5
Compare
ryzhyk
left a comment
There was a problem hiding this comment.
Approving modulo Rust warning @mihaibudiu is fixing.
Compiler warning are currently false negatives. This is ok as a preview, but we really need to fix this and turn them, into errors for people to be able to use hints.
71c7019 to
761fe64
Compare
8738bff to
318cfa0
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
New commits since my last approval are CI/test fixes (Java test verbosity, multi-crate CI fix, Python integration test fix). No substantive design changes. Prior approval stands.
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
…own because of earthly bugs) Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>

Fixes #6160
This includes some test that optimizer hints are pushed (at least for simple programs) through all optimizations.
Currently these are not used in the generated code, since there is no API to pass join hints to the circuit; that needs some DBSP support.
I created 3 hints for joins (broadcast, shard, balance), but I didn't document them yet, since they have no effect.