Allow using pandas as a table join engine#19860
Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
|
How about using "pandas" as the default when it's installed and only to fall back to "astropy" if pandas is not available. In fact, you might not add a keyword at all and just take what's available. |
that's the approach we've been following with |
I'm pretty keen on maintaining explicit control. It might be that astropy is faster for small tables, or the user wants the sort order from astropy to maintain stability in their pipeline products. So I would propose an option According to gemini, Polars join is even faster than pandas by factors of 4-40x for medium to large datasets. So we want both explicit control and some logic to decide on the fastest join engine. |
Description
Pandas has extremely efficient and optimized support for table joins using a dict-like mapping and C/Cython code. Joining a large table using pandas is up to 20 times faster than astropy, which uses a fairly naive implementation using numpy sorting.
This PR allows using pandas as a join engine, resulting in astropy table join performance this is nearly as fast as pandas (about 10-20% slower).
Fixes #