Skip to content

Investigate EmitTo::First usage in partial hash aggregate output #23249

Description

@hhhizzz

Thanks, the problem exist in partial path now actually, using emit::First is usually a bad idea unless the sorted input case.
I think at least we should rename the emit::First and add comments to tell that.

Originally posted by @Rachelint in #23182 (comment)

#23182 fixed final hash aggregate terminal output by materializing once and slicing.
However, the partial hash aggregate path still uses emit_to_for_batch_size(...)
and therefore EmitTo::First(batch_size) when group_count > batch_size.

EmitTo::First is not just "return first n rows"; it also shifts remaining
group indexes and maintains GroupValues lookup state. For terminal drain of a
hash table, this may repeat work that is no longer needed.

Need to verify:

  • whether high-cardinality partial aggregate output shows similar emitting_time overhead
  • whether materialize-once-then-slice is acceptable for partial output
  • whether EmitTo::First should be renamed/commented to make destructive renumbering explicit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions