Skip to content

Support writing and reading table indices for formats supporting column metadata#18703

Merged
neutrinoceros merged 29 commits into
astropy:mainfrom
taldcroft:table-serialize-indices-on-read-write
May 1, 2026
Merged

Support writing and reading table indices for formats supporting column metadata#18703
neutrinoceros merged 29 commits into
astropy:mainfrom
taldcroft:table-serialize-indices-on-read-write

Conversation

@taldcroft
Copy link
Copy Markdown
Member

@taldcroft taldcroft commented Oct 9, 2025

Description

This has been a long-standing goal since 2017 (#6925 and #16836 from 2024).

See this gist for a quick demo of this functionality.

Fixes #16836

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

@taldcroft taldcroft added this to the v7.2.0 milestone Oct 9, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 9, 2025

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 9, 2025

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

Comment thread astropy/table/connect.py Outdated
@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Oct 9, 2025

I only had time for a brief look, and think the overall idea is good. However, I wonder if we cannot stick closer to the existing idea of serializing columns, i.e., treat this as a representation of a specific class. I feel that might avoid introducing specific new arguments like engine or wonder about how to represent that.

For reference, here's a simple table with a SkyCoord and a masked column (note that the mask in the sky column ignores serialize_method="data_mask" -- #18704):

# %ECSV 1.0
# ---
# datatype:
# - {name: sc.ra, unit: deg, datatype: float64}
# - {name: sc.dec, unit: deg, datatype: float64}
# - {name: q, unit: km, datatype: float64}
# - {name: q.mask, datatype: bool}
# meta: !!omap
# - __serialized_columns__:
#     q:
#       __class__: astropy.units.quantity.Quantity
#       unit: !astropy.units.Unit {unit: km}
#       value: !astropy.table.SerializedColumn
#         __class__: astropy.utils.masked.core.MaskedNDArray
#         data: !astropy.table.SerializedColumn {name: q}
#         mask: !astropy.table.SerializedColumn {name: q.mask}
#     sc:
#       __class__: astropy.coordinates.sky_coordinate.SkyCoord
#       dec: !astropy.table.SerializedColumn
#         __class__: astropy.coordinates.angles.core.Latitude
#         unit: &id001 !astropy.units.Unit {unit: deg}
#         value: !astropy.table.SerializedColumn
#           __class__: astropy.utils.masked.core.MaskedNDArray
#           data: !astropy.table.SerializedColumn {name: sc.dec}
#       frame: icrs
#       ra: !astropy.table.SerializedColumn
#         __class__: astropy.coordinates.angles.core.Longitude
#         unit: *id001
#         value: !astropy.table.SerializedColumn
#           __class__: astropy.utils.masked.core.MaskedNDArray
#           data: !astropy.table.SerializedColumn {name: sc.ra}
#         wrap_angle: !astropy.coordinates.Angle
#           unit: *id001
#           value: 360.0
#       representation_type: spherical
# schema: astropy-2.0
sc.ra sc.dec q q.mask
"" "" 0.0 False
29.999999999999996 6.0 8.0 True
44.99999999999999 7.0 9.0 False

I guess an overall difficulty, as always, is naming... This is partially related to whether one should think of the indices as belonging to the column or to the table. In the former case, maybe it should be something like col._index_. But multiple columns is possible and the index cannot be accessed that way. Still should it be _index_col and _index_col1_col2 rather than a number?

@taldcroft
Copy link
Copy Markdown
Member Author

taldcroft commented Oct 11, 2025

However, I wonder if we cannot stick closer to the existing idea of serializing columns, i.e., treat this as a representation of a specific class.

@mhvk - I did think about trying to fit this into the serialized columns infrastructure and I agree that re-using this would be preferable if possible. There were a couple of issues that I could not resolve:

I don't think it will be possible to entirely eliminate all the table-level sandwich construct/represent code. To create an index for a multi-column index on ["a", "b"], the code needs the data for both columns a and b. The serialized column machinery is explicitly running per-column and column a knows nothing about the column b data.

Similarly, the machinery is expecting that the attribute data for a column is unique to that column. Consider the multi-column index above. Then column a represents the row index data with a.indices.row_index, but this will be repeated for column b as b.indices.row_index. It isn't obvious how to get around this. This also presumes some new class to allow constructing an index from some attributes and data (basically doing what get_index does but in a class, though again since this requires the entire table I'm not even sure this can work).

So overall I am unable to see a clear path to actually using the serialized columns machinery here. That doesn't mean it isn't possible, just that I probably can't do it without help.

What might be reasonable is moving the representation tbl["a"]["meta"]["__indices__"] data into the meta["__serialized_columns__"] structure. That would at least avoid making a new namespace. Edit: see below.

One lingering doubt in all of this is that I do not like the current indexing implementation of the indices living on the columns instead of the table. I was supervising the GSOC student that did this but I honestly don't remember why we went that direction, but I have in my head a project to refactor this. If you look at the earlier commits in this PR, the indices were being serialized at the table level which actually feels somewhat more natural and you avoid duplication for multi-column indices.

@taldcroft
Copy link
Copy Markdown
Member Author

About putting the indices into __serialized_columns__, I realize now that this would not be easy. The way things are now the code wants to consume all the keywords and put them into the class constructor. So:

>>> text = """
# %ECSV 1.0
# ---
# datatype:
# - {name: col0, unit: m, datatype: float64}
# meta: !!omap
# - __serialized_columns__:
#     col0:
#       __class__: astropy.units.quantity.Quantity
#       unit: !astropy.units.Unit {unit: m}
#       value: !astropy.table.SerializedColumn {name: col0}
#       indices: stuff
# schema: astropy-2.0
col0
1.0
2.0
"""
>>> QTable.read(text, format="ecsv")
TypeError: Quantity.__new__() got an unexpected keyword argument 'indices'

Everything is possible but I fear this would be messy.

@taldcroft
Copy link
Copy Markdown
Member Author

taldcroft commented Oct 11, 2025

I'm somewhat leaning to going back to the original implementation of a new __table_indices__ key on the table meta, alongside __serialized_columns__. That eliminates duplication and matches the more intuitive notion that the indices are an attribute of the table not the columns.

EDIT: that is conceptually nicer but has an implementation problem.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Oct 12, 2025

@taldcroft - I agree, let's think of those indices not as part of any column! Given that, I agree with the basic idea of your notation -- though perhaps it could still be something like f"_index_{'_'.join(relevant_colnames)}". I do worry that might get quite long, so perhaps just a number is fine. After all, in general there will be only one index (in which case I'd suggest it to be just _index_ or _index or so).

This, however, arguably makes those indices more like a column, not less so! So, I continue to wonder if it would not be possible to use the SerializedColumn machinery. If I understand the indexing correctly (which I'm really not sure I do...), the actual index holders are SlicedIndex. I guess conceptually the question is whether one can imagine it having a sensible .info so that it looks enough like a column that it can be serialized. Of course, one then still has to special-case that it is not stored in TableColumns, but rather in Table.meta. But it seems OK that where the indices are stored on the table is an implementation detail.

Anyway, I'll try to think a bit more too, with the idea that they live on the table, not on columns, in mind.

@taldcroft
Copy link
Copy Markdown
Member Author

@mhvk - for some reason when writing about using a table-level __table_indices__, I forgot the reason I switched to columns. The FITS writer handles meta on columns transparently, but for table level meta it defaults to those items being normal FITS keywords, with special handling for __serialized_columns__.

This highlights that the current solution (meta on columns) allows keeping the code for this feature confined to table/connect.py, while ideas at the table level will require changes in other modules.

@taldcroft
Copy link
Copy Markdown
Member Author

the question is whether one can imagine it having a sensible .info so that it looks enough like a column that it can be serialized.

As described more fully previously, I do not see how this is possible. The serialization mechanism with .info is predicated on the object owning its attributes and having a constructor that takes an attribute column (e.g. ra for SkyCoord) as a keyword argument. If that previous comment does not make sense I can elaborate.

@taldcroft
Copy link
Copy Markdown
Member Author

Of course, one then still has to special-case that it is not stored in TableColumns, but rather in Table.meta.

If, in the end, we are putting the serialized index information somewhere other than __serialized_columns__, then I don't see any reason to use the SerializedColumn machinery to do the serialization since that amounts to an implementation detail. The current implementation is almost certainly simpler and more confined than what we could do via SerializedColumn and .info since that is not a great fit and requires making more special cases and other scaffolding to work.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Oct 12, 2025

@taldcroft - I can see it making some sense, in that, effectively, these are computed columns, so different, and it would seem correct to not have them inside __serializaed_columns__. (Which doesn't mean one couldn't use the machinery more -- I think whatever would be in .info is not all that different from what you currently coded up in get_index -- but let's agree to make it an implementation detail.)

So, agreed to think of it as part of the table meta data and call it __indices__.

Another question would be why one would store the whole index rather than just have it be recomputed? (Or that at least be the default option?) But to answer that at least in part myself: I think it is great to have the possibility to write out the index column since it provides information about the table that now is explicitly exposed, and can thus be read by humans or regular .csv readers, etc. And whatever is in the header needs of course to include what is needed to reconstruct the index.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Oct 12, 2025

More to the nitty gritty, looking at the output as you have it,

      #   - __indices__:
      #     - colnames: !!python/tuple [a]
      #       engine: SortedArray
      #       primary: true
      #       row_index_colname: __index__0
      #       unique: false

First, smaller details, easy to fix:

  1. I'd just let colnames be a yaml list - that it is a tuple is irrelevant for the information, and since we are creating a new serialization mechanism for .escv, we don't need to include that.
  2. primary would seem a property not of an individual index, but of the whole group, so I'd move that out under __indices__ -- though not sure one wants it to refer to [a], as it does in the table, or rather to __index__0.

A slightly larger suggestion would still be to go with a format more like SerialiazedColumn, to keep implementation simpler and more consistent (also for other people writing .ecsv readers!). I.e., I'd do,

- __indices__
  - primary: __index__0  # or primary_key: [a], might want to omit if there is only one index
  - __index__0  # still don't like that name much!, do omit zero if there is only one.
    - __class__: Index  # not SlicedIndex, that seems irrelevant
    - columns: [a]
    - engine: SortedArray
    - unique: False

So, not very different from what you have, except that now __index__0 is basically formatted like a serialized column, so initialization is trivial: just __class__(**kwargs), with machinery that is in place.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Oct 12, 2025

This highlights that the current solution (meta on columns) allows keeping the code for this feature confined to table/connect.py, while ideas at the table level will require changes in other modules.

Hmm, only now saw your counterpoint to what we seemed to be heading towards... I do think our conclusion was the right one, though, so perhaps we should bite the bullet and start treating tables more like columns everywhere. Maybe this is not so difficult if we start making table.info more compatible with column.info? (Not suggesting we again try to make a table be able to be a column in another one, but just to increase the amount of shared machinery...)

@taldcroft
Copy link
Copy Markdown
Member Author

Another question would be why one would store the whole index rather than just have it be recomputed?

For me at least, the promise of serializing the indices comes with actually storing the index like a real database. I.e. prefering to use disk space instead of CPU and processing time. Just having a light shim to store the keys and recompute the indices seems like a bit of a sham.

@taldcroft
Copy link
Copy Markdown
Member Author

So, agreed to think of it as part of the table meta data and call it __indices__.

I'm not sure if you are being loose with words or I am not getting the message across. The index information needs to be stored on the COLUMN meta, not on the table meta. That is, unless we are willing to add significant new infrastructure to serializing meta for FITS.

Currently the FITS writer assumes that all keywords in the table meta can be expressed directly in FITS, i.e. the values should be simple types like int, float, string. In _encode_mixins(), the code checks if there is any column meta, and if so triggers the subsequent code to serialize that arbitrary column meta using FITS COMMENT line starting with "--BEGIN-ASTROPY-SERIALIZED-COLUMNS--". It is not elegant but that's what we use.

tl;dr: column meta gets serialized for free in FITS (and ECSV, HDF5).

Serializing table meta in FITS would require special-casing some keyword like __indices__ and then making a new section of FITS header comments like "--BEGIN-ASTROPY-SERIALIZED-INDICES--". This is feasible, but I'm not very keen on doubling-down a formalism that I don't like to begin with (despite having invented it!).

@taldcroft
Copy link
Copy Markdown
Member Author

About the suggestion of formatting like below:

- __indices__
  - primary: __index__0  # or primary_key: [a], might want to omit if there is only one index
  - __index__0  # still don't like that name much!, do omit zero if there is only one.
    - __class__: Index  # not SlicedIndex, that seems irrelevant
    - columns: [a]
    - engine: SortedArray
    - unique: False

I also don't like the repetition of primary, but with your suggestion there is a problem that YAML mappings do not preserve the order, so managing them by key does not guarantee preserving the index order. This matters because tbl.indices is fundamentally a list and code can certainly do index = tbl.indices[1].

tl;dr: the indices must be represented as a list, not a mapping.

So, not very different from what you have, except that now __index__0 is basically formatted like a serialized column, so initialization is trivial: just __class__(**kwargs), with machinery that is in place.

Sadly the indexing code is not so simple. The Index class is actually created as an object and then passed into the SlicedIndex initializer. The new get_index code somewhat approximates what you have in your mind, but you can see that the entire table is an input to this constructor. Your __class__(**kwargs) does not have sufficient information to create a SlicedIndex.

@taldcroft
Copy link
Copy Markdown
Member Author

@mhvk - I've pushed a few more commits to start making this real, with the understanding that the output format may still be open for update.

Right now, here is what it looks like, where I implemented your suggestions about colnames and starting with __index__. Another key update here is not repeating index_info, so each unique index info appears only once. I agree this is not perfect, but I do think it is better than the alternative I outlined for FITS.

>>> t = QTable()
>>> t["a"] = [2, 3, 1, 1]
>>> t["b"] = [3, 5, 4, 6]
>>> t.add_index("a", engine=SCEngine)
>>> t.add_index(["b", "a"])
>>> t.write(sys.stdout, format="ecsv", write_indices=True)
# %ECSV 1.0
# ---
# datatype:
# - name: a
#   datatype: int64
#   meta: !!omap
#   - __indices__:
#     - colnames: [a]
#       engine: SCEngine
#       primary: true
#       row_index_colname: __index__
#       unique: false
# - name: b
#   datatype: int64
#   meta: !!omap
#   - __indices__:
#     - colnames: [b, a]
#       engine: SortedArray
#       primary: false
#       row_index_colname: __index__1
#       unique: false
# - {name: __index__, datatype: int64}
# - {name: __index__1, datatype: int64}
# schema: astropy-2.0
a b __index__ __index__1
2 3 2 0
3 5 3 2
1 4 0 1
1 6 1 3

@taldcroft taldcroft force-pushed the table-serialize-indices-on-read-write branch from ca99162 to 991b078 Compare October 14, 2025 18:53
@taldcroft
Copy link
Copy Markdown
Member Author

taldcroft commented Oct 19, 2025

Ping @mhvk.

The feature freeze is coming very soon and this would be good to get in if we can agree on the format and location for the new metadata. Thinking about how the ECSV looks is maybe the easiest way to visualize. The internal implementation does not matter too much since we can change that later.

The failing tests are all from a trivial doctest issue that I will fix.

@taldcroft taldcroft marked this pull request as ready for review October 19, 2025 12:23
@pllim
Copy link
Copy Markdown
Member

pllim commented Oct 20, 2025

The feature freeze is coming very soon

OMG is it this week? 😱

@taldcroft
Copy link
Copy Markdown
Member Author

The release calendar says Friday Oct 24, but I have not seen any notice to astropy-dev. Usually there should be a week since the final warning and the feature freeze.

@taldcroft taldcroft force-pushed the table-serialize-indices-on-read-write branch from e8a4b10 to 4d97ef3 Compare April 30, 2026 10:34
@taldcroft taldcroft requested review from mhvk and neutrinoceros April 30, 2026 13:00
@taldcroft
Copy link
Copy Markdown
Member Author

Passing tests!

@neutrinoceros
Copy link
Copy Markdown
Contributor

Quick note that I couldn't actually get to review the tests this morning, because the task I tackled first (looking for the origin of a segfault in devdeps) took me much longer than I anticipated. I'll get around to this next time I'm at a keyboard, which will be about 2 hours from now. Sorry about this delay.

@taldcroft
Copy link
Copy Markdown
Member Author

@mhvk @neutrinoceros - I looked again at the FITS code and I think I see a way to use a table-level meta. It requires some special-casing and also smushing a new key into the --BEGIN-ASTROPY-SERIALIZED-COLUMNS-- comments section. It would look like below. I'm going to play around with this today as a follow-up PR to this one, but let me know now if you have objections.

COMMENT --BEGIN-ASTROPY-SERIALIZED-COLUMNS--
COMMENT datatype:
COMMENT - {name: a, unit: m, datatype: float64}
COMMENT - {name: b, datatype: int64}
COMMENT meta: !!omap
COMMENT - __serialized_columns__:
COMMENT     a:
COMMENT       __class__: astropy.units.quantity.Quantity
COMMENT       unit: !astropy.units.Unit {unit: m}
COMMENT       value: !astropy.table.SerializedColumn {name: a}
COMMENT - __table_indices__:
COMMENT     indices:
COMMENT     - colnames: [a, __index__1]
COMMENT       row_index_colname: __index__
COMMENT     - colnames: [a]
COMMENT       row_index_colname: __index__2
COMMENT     primary_key: [a, __index__1]
COMMENT --END-ASTROPY-SERIALIZED-COLUMNS--

Copy link
Copy Markdown
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taldcroft - I think that FITS comment section would be very self-explanatory, so definitely good as far as I'm concerned. And happy to take (correctly) any blame if we have to convince the maintainers that this can go in a little late...

But let me also approve this now -- it looks all good! I won't merge, though, since @neutrinoceros seemed to intend to come back to this soon (aside, thanks very much @neutrinoceros for taking over the more careful review!).

@taldcroft
Copy link
Copy Markdown
Member Author

Thanks @mhvk. I only regret now calling that section ASTROPY-SERIALIZED-COLUMNS instead of something like ASTROPY-TABLE-META. Sigh.

About the follow-up PR, it might not be a stretch to call it a bug fix that could be applied to the RC. 😄

Copy link
Copy Markdown
Contributor

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got a couple simple questions about these tests

Comment thread astropy/table/tests/test_index.py Outdated
Comment on lines +760 to +779
lines = [
"# %ECSV 1.0",
"# ---",
"# datatype:",
"# - name: a",
"# datatype: int64",
"# meta: !!omap",
"# - __indices__:",
"# - colnames: [a]",
"# engine: Foo",
"# index_colname: __index__",
"# unique: false",
"# - {name: __index__, datatype: int64}",
"# schema: astropy-2.0",
"a __index__",
"1 0",
"3 2",
"2 1",
]
text = "\n".join(lines)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is stable under ruff format

Suggested change
lines = [
"# %ECSV 1.0",
"# ---",
"# datatype:",
"# - name: a",
"# datatype: int64",
"# meta: !!omap",
"# - __indices__:",
"# - colnames: [a]",
"# engine: Foo",
"# index_colname: __index__",
"# unique: false",
"# - {name: __index__, datatype: int64}",
"# schema: astropy-2.0",
"a __index__",
"1 0",
"3 2",
"2 1",
]
text = "\n".join(lines)
text = "\n".join([
"# %ECSV 1.0",
"# ---",
"# datatype:",
"# - name: a",
"# datatype: int64",
"# meta: !!omap",
"# - __indices__:",
"# - colnames: [a]",
"# engine: Foo",
"# index_colname: __index__",
"# unique: false",
"# - {name: __index__, datatype: int64}",
"# schema: astropy-2.0",
"a __index__",
"1 0",
"3 2",
"2 1",
])

Copy link
Copy Markdown
Member Author

@taldcroft taldcroft Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like stacking things like this, won't fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using implicit str concatenation then ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Basically, replace "\n".join([ with (, drop all trailing commas, and change ]) for ) at the end)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the easiest to read for these kinds of situations is just a triple quote string with a dedent at the end!

Comment thread astropy/table/tests/test_index.py Outdated
Comment thread astropy/table/tests/test_index.py
t.add_index("a")
out = io.StringIO()
t.write(out, format="ecsv", write_indices=True)
assert out.getvalue().splitlines() == [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular benefit to comparing lists of lines instead of a single string of text that I'm not seeing ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it formats more nicely. I'm always bothered by having fully left-justified lines in the code, it just looks ugly. I've settled on this idiom and it works well and is convenient for me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit str concatenation looks very similar but even simpler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the output using implicit str concatenation is not easy to generate, in contrast to the way I am doing it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do mean "generate" ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "regression test" type outputs like this, I normally run the test script with --pdb and if it fails, then I print out.getvalue().splitlines(), paste that into the file and reformat with ruff. Then I examine the diffs carefully to ensure that they are what I expect.

That process is reliable and efficient.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

Comment on lines +901 to +910
t = QTable()
t["a"] = Time([1, 3, 2, 2], format="cxcsec")
t["b"] = [3, 2, 2, 1]
t["__index__"] = [3, 1, 4, 2] # Force a collision
indices_colnames = [
["a"],
["b", "a"],
["a", "b", "__index__"],
["__index__"],
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the setup never changes, I'd prefer storing this input in a module constant and reuse-it (without mutations) than skip test case that we currently think do not matter (but for how long ?).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix, out of time.

t.add_index(colnames, engine=engine)

path = tmp_path / f"out.{fmt}"
kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be clearer if expressed as part of the test's parametrization instead of here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if I do it ? Maybe as a follow up PR ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do a followup PR, but not before the 8.0 feature freeze. I'm currently working on a PR (intended for 8.0) to move the serialization info from columns into table meta. This will touch tests and I don't want to get into a merge conflict.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha

kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {}
t.write(path, format=fmt, write_indices=True, **kwargs)

kwargs = {"astropy_native": True} if fmt == "fits" else {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix.

for colnames in indices_colnames:
index = t.indices[colnames]
index2 = t2.indices[colnames]
assert index.id == index2.id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note in passing, I really wish we could use pytest 9's subtest fixture here. We can't, it's too new, but I'll still take the opportunity to raise awareness about this new and extremely convenient way to write large tests with many asserts

assert np.all(index.data.sorted_data() == index2.data.sorted_data())
# Check engine items as a list of pairs of the form
# [(key, [row 1, row 2, ...]), ...].
assert index.data.items() == index2.data.items()
Copy link
Copy Markdown
Contributor

@neutrinoceros neutrinoceros Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually compare lists ? Deliberately not refreshing my memory here, but when I see an items() method I expect it to work as dict's, which produces a generator, which in turn doesn't really support comparisons other than identity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these engines the protocol dictates that .items() returns list[tuple[Hashable, list[Integral]]]. So this is OK.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Copy Markdown
Member Author

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neutrinoceros - thanks for the detailed review. I'm not going to address some of your change requests that don't directly improve test correctness or completeness but are intended to improve the code style. I am out of time. I'd like to get this PR merged and if you want to improve the test code style later that's fine.

Comment on lines +901 to +910
t = QTable()
t["a"] = Time([1, 3, 2, 2], format="cxcsec")
t["b"] = [3, 2, 2, 1]
t["__index__"] = [3, 1, 4, 2] # Force a collision
indices_colnames = [
["a"],
["b", "a"],
["a", "b", "__index__"],
["__index__"],
]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix, out of time.

t.add_index(colnames, engine=engine)

path = tmp_path / f"out.{fmt}"
kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix.

kwargs = {"serialize_meta": True, "path": "root"} if fmt == "hdf5" else {}
t.write(path, format=fmt, write_indices=True, **kwargs)

kwargs = {"astropy_native": True} if fmt == "fits" else {}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix.

assert np.all(index.data.sorted_data() == index2.data.sorted_data())
# Check engine items as a list of pairs of the form
# [(key, [row 1, row 2, ...]), ...].
assert index.data.items() == index2.data.items()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these engines the protocol dictates that .items() returns list[tuple[Hashable, list[Integral]]]. So this is OK.

@neutrinoceros
Copy link
Copy Markdown
Contributor

Understandable. Will you be available tomorrow at all, or should I just assume that any correction has to be postponed to after the freeze ?

@taldcroft
Copy link
Copy Markdown
Member Author

I won't be able to look at it until my follow-up or is merged. Your cleanup PR needs to come after my follow-up PR because they are likely to conflict. I'm only working for a couple of hours tomorrow morning.

@neutrinoceros neutrinoceros merged commit d7e0ad4 into astropy:main May 1, 2026
37 of 38 checks passed
@neutrinoceros
Copy link
Copy Markdown
Contributor

argh. Forgot to squash.

@neutrinoceros
Copy link
Copy Markdown
Contributor

@astrofrog, any way we could replay this as a squash-merge ? (basically what I'm asking amounts to "could you force push my merge commit away very soon ?")

@astrofrog
Copy link
Copy Markdown
Member

Why do we need to squash merge?

@neutrinoceros
Copy link
Copy Markdown
Contributor

NBD if we can't, as far as I'm concerned. I just wish I could ease my instant regrets.

@taldcroft
Copy link
Copy Markdown
Member Author

I would have preferred a squash merge, but I agree at this point that we should not do any shenanigans with main. There are no huge files or other particular problems in the commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index column not preserved on table write/read

5 participants