Skip to content

Allow using pandas as a table join engine#19860

Draft
taldcroft wants to merge 1 commit into
astropy:mainfrom
taldcroft:table-join-engine-pandas
Draft

Allow using pandas as a table join engine#19860
taldcroft wants to merge 1 commit into
astropy:mainfrom
taldcroft:table-join-engine-pandas

Conversation

@taldcroft
Copy link
Copy Markdown
Member

Description

Pandas has extremely efficient and optimized support for table joins using a dict-like mapping and C/Cython code. Joining a large table using pandas is up to 20 times faster than astropy, which uses a fairly naive implementation using numpy sorting.

This PR allows using pandas as a join engine, resulting in astropy table join performance this is nearly as fast as pandas (about 10-20% slower).

Fixes #

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

@github-actions github-actions Bot added the table label Jun 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@hamogu
Copy link
Copy Markdown
Member

hamogu commented Jun 3, 2026

How about using "pandas" as the default when it's installed and only to fall back to "astropy" if pandas is not available. In fact, you might not add a keyword at all and just take what's available.

@neutrinoceros
Copy link
Copy Markdown
Contributor

you might not add a keyword at all and just take what's available.

that's the approach we've been following with bottleneck-powered accelerations but I would recommend we avoid it in the future, as it also creates confusion when the "faster" implementation is silently selected but returns incorrect results (or causes a crash).

@taldcroft
Copy link
Copy Markdown
Member Author

taldcroft commented Jun 3, 2026

How about using "pandas" as the default when it's installed and only to fall back to "astropy" if pandas is not available. In fact, you might not add a keyword at all and just take what's available.

I'm pretty keen on maintaining explicit control. It might be that astropy is faster for small tables, or the user wants the sort order from astropy to maintain stability in their pipeline products. So I would propose an option engine="auto" to automatically select the engine based on some heuristics.

According to gemini, Polars join is even faster than pandas by factors of 4-40x for medium to large datasets. So we want both explicit control and some logic to decide on the fastest join engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants