Support writing and reading table indices for formats supporting column metadata#18703
Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
|
👋 Thank you for your draft pull request! Do you know that you can use |
|
I only had time for a brief look, and think the overall idea is good. However, I wonder if we cannot stick closer to the existing idea of serializing columns, i.e., treat this as a representation of a specific class. I feel that might avoid introducing specific new arguments like For reference, here's a simple table with a SkyCoord and a masked column (note that the mask in the sky column ignores I guess an overall difficulty, as always, is naming... This is partially related to whether one should think of the indices as belonging to the column or to the table. In the former case, maybe it should be something like |
@mhvk - I did think about trying to fit this into the serialized columns infrastructure and I agree that re-using this would be preferable if possible. There were a couple of issues that I could not resolve: I don't think it will be possible to entirely eliminate all the table-level sandwich construct/represent code. To create an index for a multi-column index on Similarly, the machinery is expecting that the attribute data for a column is unique to that column. Consider the multi-column index above. Then column So overall I am unable to see a clear path to actually using the serialized columns machinery here. That doesn't mean it isn't possible, just that I probably can't do it without help.
One lingering doubt in all of this is that I do not like the current indexing implementation of the indices living on the columns instead of the table. I was supervising the GSOC student that did this but I honestly don't remember why we went that direction, but I have in my head a project to refactor this. If you look at the earlier commits in this PR, the indices were being serialized at the table level which actually feels somewhat more natural and you avoid duplication for multi-column indices. |
|
About putting the indices into Everything is possible but I fear this would be messy. |
|
EDIT: that is conceptually nicer but has an implementation problem. |
|
@taldcroft - I agree, let's think of those indices not as part of any column! Given that, I agree with the basic idea of your notation -- though perhaps it could still be something like This, however, arguably makes those indices more like a column, not less so! So, I continue to wonder if it would not be possible to use the Anyway, I'll try to think a bit more too, with the idea that they live on the table, not on columns, in mind. |
|
@mhvk - for some reason when writing about using a table-level This highlights that the current solution (meta on columns) allows keeping the code for this feature confined to |
As described more fully previously, I do not see how this is possible. The serialization mechanism with |
If, in the end, we are putting the serialized index information somewhere other than |
|
@taldcroft - I can see it making some sense, in that, effectively, these are computed columns, so different, and it would seem correct to not have them inside So, agreed to think of it as part of the table meta data and call it Another question would be why one would store the whole index rather than just have it be recomputed? (Or that at least be the default option?) But to answer that at least in part myself: I think it is great to have the possibility to write out the index column since it provides information about the table that now is explicitly exposed, and can thus be read by humans or regular .csv readers, etc. And whatever is in the header needs of course to include what is needed to reconstruct the index. |
|
More to the nitty gritty, looking at the output as you have it, First, smaller details, easy to fix:
A slightly larger suggestion would still be to go with a format more like So, not very different from what you have, except that now |
Hmm, only now saw your counterpoint to what we seemed to be heading towards... I do think our conclusion was the right one, though, so perhaps we should bite the bullet and start treating tables more like columns everywhere. Maybe this is not so difficult if we start making |
For me at least, the promise of serializing the indices comes with actually storing the index like a real database. I.e. prefering to use disk space instead of CPU and processing time. Just having a light shim to store the keys and recompute the indices seems like a bit of a sham. |
I'm not sure if you are being loose with words or I am not getting the message across. The index information needs to be stored on the COLUMN meta, not on the table meta. That is, unless we are willing to add significant new infrastructure to serializing meta for FITS. Currently the FITS writer assumes that all keywords in the table meta can be expressed directly in FITS, i.e. the values should be simple types like int, float, string. In tl;dr: column meta gets serialized for free in FITS (and ECSV, HDF5). Serializing table meta in FITS would require special-casing some keyword like |
|
About the suggestion of formatting like below: I also don't like the repetition of tl;dr: the indices must be represented as a list, not a mapping.
Sadly the indexing code is not so simple. The |
|
@mhvk - I've pushed a few more commits to start making this real, with the understanding that the output format may still be open for update. Right now, here is what it looks like, where I implemented your suggestions about |
ca99162 to
991b078
Compare
|
Ping @mhvk. The feature freeze is coming very soon and this would be good to get in if we can agree on the format and location for the new metadata. Thinking about how the ECSV looks is maybe the easiest way to visualize. The internal implementation does not matter too much since we can change that later. The failing tests are all from a trivial doctest issue that I will fix. |
OMG is it this week? 😱 |
|
The release calendar says Friday Oct 24, but I have not seen any notice to astropy-dev. Usually there should be a week since the final warning and the feature freeze. |
e8a4b10 to
4d97ef3
Compare
|
Passing tests! |
|
Quick note that I couldn't actually get to review the tests this morning, because the task I tackled first (looking for the origin of a segfault in devdeps) took me much longer than I anticipated. I'll get around to this next time I'm at a keyboard, which will be about 2 hours from now. Sorry about this delay. |
|
@mhvk @neutrinoceros - I looked again at the FITS code and I think I see a way to use a table-level meta. It requires some special-casing and also smushing a new key into the |
mhvk
left a comment
There was a problem hiding this comment.
@taldcroft - I think that FITS comment section would be very self-explanatory, so definitely good as far as I'm concerned. And happy to take (correctly) any blame if we have to convince the maintainers that this can go in a little late...
But let me also approve this now -- it looks all good! I won't merge, though, since @neutrinoceros seemed to intend to come back to this soon (aside, thanks very much @neutrinoceros for taking over the more careful review!).
|
Thanks @mhvk. I only regret now calling that section About the follow-up PR, it might not be a stretch to call it a bug fix that could be applied to the RC. 😄 |
neutrinoceros
left a comment
There was a problem hiding this comment.
got a couple simple questions about these tests
| lines = [ | ||
| "# %ECSV 1.0", | ||
| "# ---", | ||
| "# datatype:", | ||
| "# - name: a", | ||
| "# datatype: int64", | ||
| "# meta: !!omap", | ||
| "# - __indices__:", | ||
| "# - colnames: [a]", | ||
| "# engine: Foo", | ||
| "# index_colname: __index__", | ||
| "# unique: false", | ||
| "# - {name: __index__, datatype: int64}", | ||
| "# schema: astropy-2.0", | ||
| "a __index__", | ||
| "1 0", | ||
| "3 2", | ||
| "2 1", | ||
| ] | ||
| text = "\n".join(lines) |
There was a problem hiding this comment.
I think this is stable under ruff format
| lines = [ | |
| "# %ECSV 1.0", | |
| "# ---", | |
| "# datatype:", | |
| "# - name: a", | |
| "# datatype: int64", | |
| "# meta: !!omap", | |
| "# - __indices__:", | |
| "# - colnames: [a]", | |
| "# engine: Foo", | |
| "# index_colname: __index__", | |
| "# unique: false", | |
| "# - {name: __index__, datatype: int64}", | |
| "# schema: astropy-2.0", | |
| "a __index__", | |
| "1 0", | |
| "3 2", | |
| "2 1", | |
| ] | |
| text = "\n".join(lines) | |
| text = "\n".join([ | |
| "# %ECSV 1.0", | |
| "# ---", | |
| "# datatype:", | |
| "# - name: a", | |
| "# datatype: int64", | |
| "# meta: !!omap", | |
| "# - __indices__:", | |
| "# - colnames: [a]", | |
| "# engine: Foo", | |
| "# index_colname: __index__", | |
| "# unique: false", | |
| "# - {name: __index__, datatype: int64}", | |
| "# schema: astropy-2.0", | |
| "a __index__", | |
| "1 0", | |
| "3 2", | |
| "2 1", | |
| ]) |
There was a problem hiding this comment.
I don't like stacking things like this, won't fix.
There was a problem hiding this comment.
What about using implicit str concatenation then ?
There was a problem hiding this comment.
(Basically, replace "\n".join([ with (, drop all trailing commas, and change ]) for ) at the end)
There was a problem hiding this comment.
I find the easiest to read for these kinds of situations is just a triple quote string with a dedent at the end!
| t.add_index("a") | ||
| out = io.StringIO() | ||
| t.write(out, format="ecsv", write_indices=True) | ||
| assert out.getvalue().splitlines() == [ |
There was a problem hiding this comment.
Is there a particular benefit to comparing lists of lines instead of a single string of text that I'm not seeing ?
There was a problem hiding this comment.
Because it formats more nicely. I'm always bothered by having fully left-justified lines in the code, it just looks ugly. I've settled on this idiom and it works well and is convenient for me.
There was a problem hiding this comment.
Implicit str concatenation looks very similar but even simpler.
There was a problem hiding this comment.
But the output using implicit str concatenation is not easy to generate, in contrast to the way I am doing it.
There was a problem hiding this comment.
What do mean "generate" ?
There was a problem hiding this comment.
For "regression test" type outputs like this, I normally run the test script with --pdb and if it fails, then I print out.getvalue().splitlines(), paste that into the file and reformat with ruff. Then I examine the diffs carefully to ensure that they are what I expect.
That process is reliable and efficient.
| t = QTable() | ||
| t["a"] = Time([1, 3, 2, 2], format="cxcsec") | ||
| t["b"] = [3, 2, 2, 1] | ||
| t["__index__"] = [3, 1, 4, 2] # Force a collision | ||
| indices_colnames = [ | ||
| ["a"], | ||
| ["b", "a"], | ||
| ["a", "b", "__index__"], | ||
| ["__index__"], | ||
| ] |
There was a problem hiding this comment.
since the setup never changes, I'd prefer storing this input in a module constant and reuse-it (without mutations) than skip test case that we currently think do not matter (but for how long ?).
There was a problem hiding this comment.
Won't fix, out of time.
| t.add_index(colnames, engine=engine) | ||
|
|
||
| path = tmp_path / f"out.{fmt}" | ||
| kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {} |
There was a problem hiding this comment.
this would be clearer if expressed as part of the test's parametrization instead of here
There was a problem hiding this comment.
Is it okay if I do it ? Maybe as a follow up PR ?
There was a problem hiding this comment.
You can do a followup PR, but not before the 8.0 feature freeze. I'm currently working on a PR (intended for 8.0) to move the serialization info from columns into table meta. This will touch tests and I don't want to get into a merge conflict.
| kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {} | ||
| t.write(path, format=fmt, write_indices=True, **kwargs) | ||
|
|
||
| kwargs = {"astropy_native": True} if fmt == "fits" else {} |
| for colnames in indices_colnames: | ||
| index = t.indices[colnames] | ||
| index2 = t2.indices[colnames] | ||
| assert index.id == index2.id |
There was a problem hiding this comment.
just a note in passing, I really wish we could use pytest 9's subtest fixture here. We can't, it's too new, but I'll still take the opportunity to raise awareness about this new and extremely convenient way to write large tests with many asserts
| assert np.all(index.data.sorted_data() == index2.data.sorted_data()) | ||
| # Check engine items as a list of pairs of the form | ||
| # [(key, [row 1, row 2, ...]), ...]. | ||
| assert index.data.items() == index2.data.items() |
There was a problem hiding this comment.
does this actually compare lists ? Deliberately not refreshing my memory here, but when I see an items() method I expect it to work as dict's, which produces a generator, which in turn doesn't really support comparisons other than identity.
There was a problem hiding this comment.
For these engines the protocol dictates that .items() returns list[tuple[Hashable, list[Integral]]]. So this is OK.
taldcroft
left a comment
There was a problem hiding this comment.
@neutrinoceros - thanks for the detailed review. I'm not going to address some of your change requests that don't directly improve test correctness or completeness but are intended to improve the code style. I am out of time. I'd like to get this PR merged and if you want to improve the test code style later that's fine.
| t = QTable() | ||
| t["a"] = Time([1, 3, 2, 2], format="cxcsec") | ||
| t["b"] = [3, 2, 2, 1] | ||
| t["__index__"] = [3, 1, 4, 2] # Force a collision | ||
| indices_colnames = [ | ||
| ["a"], | ||
| ["b", "a"], | ||
| ["a", "b", "__index__"], | ||
| ["__index__"], | ||
| ] |
There was a problem hiding this comment.
Won't fix, out of time.
| t.add_index(colnames, engine=engine) | ||
|
|
||
| path = tmp_path / f"out.{fmt}" | ||
| kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {} |
| kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {} | ||
| t.write(path, format=fmt, write_indices=True, **kwargs) | ||
|
|
||
| kwargs = {"astropy_native": True} if fmt == "fits" else {} |
| assert np.all(index.data.sorted_data() == index2.data.sorted_data()) | ||
| # Check engine items as a list of pairs of the form | ||
| # [(key, [row 1, row 2, ...]), ...]. | ||
| assert index.data.items() == index2.data.items() |
There was a problem hiding this comment.
For these engines the protocol dictates that .items() returns list[tuple[Hashable, list[Integral]]]. So this is OK.
|
Understandable. Will you be available tomorrow at all, or should I just assume that any correction has to be postponed to after the freeze ? |
|
I won't be able to look at it until my follow-up or is merged. Your cleanup PR needs to come after my follow-up PR because they are likely to conflict. I'm only working for a couple of hours tomorrow morning. |
|
argh. Forgot to squash. |
|
@astrofrog, any way we could replay this as a squash-merge ? (basically what I'm asking amounts to "could you force push my merge commit away very soon ?") |
|
Why do we need to squash merge? |
|
NBD if we can't, as far as I'm concerned. I just wish I could ease my instant regrets. |
|
I would have preferred a squash merge, but I agree at this point that we should not do any shenanigans with main. There are no huge files or other particular problems in the commits. |
Description
This has been a long-standing goal since 2017 (#6925 and #16836 from 2024).
See this gist for a quick demo of this functionality.
Fixes #16836