Skip to content

Fix character array handling in VOTable parsing and serialization (issue #17098)#19974

Open
cosmoJFH wants to merge 2 commits into
astropy:mainfrom
cosmoJFH:Include_char_array
Open

Fix character array handling in VOTable parsing and serialization (issue #17098)#19974
cosmoJFH wants to merge 2 commits into
astropy:mainfrom
cosmoJFH:Include_char_array

Conversation

@cosmoJFH

@cosmoJFH cosmoJFH commented Jun 23, 2026

Copy link
Copy Markdown

Fix handling of character array fields, fix #17098

The current implementation does not correctly handle certain character array cases during VOTable parsing and serialization, leading to inconsistent behavior and failures in the scenarios described in the issue. This PR updates the character array handling to correctly process these cases

The same underlying issue has also been observed downstream in astropy/astroquery#3599, (Gaia and Euclid).

VOTables produced for the Gaia and Euclid missions are generated using the STILTS library, which supports the creation of VOTables containing string arrays.

The following is an example that illustrates the issue:

  1. Consider the following votable, where col2 defines an array of four elements, each consisting of 10 characters:
<?xml version="1.0" encoding="UTF-8"?>
<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3">
    <RESOURCE name="myFavouriteGalaxies">
        <COOSYS ID="sys" equinox="J2000" epoch="J2000" system="eq_FK5"/>
        <TABLE name="results">
            <DESCRIPTION>Velocities and Distance estimations</DESCRIPTION>
            <FIELD name="col1" ID="col1" datatype="char" arraysize="10*" unit="Mpc">
            </FIELD>
            <FIELD name="col2" ID="col2" datatype="char" arraysize="10x4">
            </FIELD>
            <DATA>
                <TABLEDATA>
                    <TR>
                        <TD>  cdefgh  </TD>
                        <TD>1234567890qwertyuiop0987654321abcdefghij</TD>
                    </TR>
                </TABLEDATA>
            </DATA>
        </TABLE>
    </RESOURCE>
</VOTABLE>
  1. When we try to open it, the following error is raised:
import astropy
from astropy.io.votable import parse

print(astropy.__version__)
6.1.7

votable = parse("votable_test.vot")


Traceback (most recent call last):
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/converters.py", line 328, in __init__
    self.arraysize = int(field.arraysize)
ValueError: invalid literal for int() with base 10: '10x4'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/table.py", line 164, in parse
    return tree.VOTableFile(config=config, pos=(1, 1)).parse(iterator, config)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 4202, in parse
    tag_mapping.get(tag, self._add_unknown_tag)(
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 4082, in _add_resource
    resource.parse(self, iterator, config)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 3871, in parse
    tag_mapping.get(tag, self._add_unknown_tag)(
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 3813, in _add_table
    table.parse(iterator, config)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 2707, in parse
    tag_mapping.get(tag, self._add_unknown_tag)(
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 2623, in _add_field
    field = Field(self._votable, config=config, pos=pos, **data)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 1386, in __init__
    self._setup(config, pos)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/tree.py", line 1429, in _setup
    self.converter = converters.get_converter(self, config, pos)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/converters.py", line 1316, in get_converter
    converter = cls(field, config, pos)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/converters.py", line 330, in __init__
    vo_raise(E01, (field.arraysize, "char", field.ID), config)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/votable/exceptions.py", line 124, in vo_raise
    raise exception_class(args, config, pos)


astropy.io.votable.exceptions.E01: votable_test.vot:?:?: E01: Invalid size specifier '10x4' for a char field (in field 'col2')

Changes

We have included the following changes:

  • Introduced/refactored CharArray and CharArrayVarArray, along with their Unicode counterparts (UnicodeCharArray, UnicodeCharArrayVarArray), to unify and simplify character array handling.
  • Ensured compatibility with the existing public API.
  • Preserved existing binary and text serialization formats.

We have added regression tests covering the cases described in #17098. We have also included new tests:

  • Reproduce the failure on the current main branch.
  • Validate the corrected parsing and serialization behavior.
  • Cover both fixed-length and variable-length character array configurations where applicable.

This is our first contribution to Astropy, so we would welcome any feedback on the implementation approach or on the test coverage.

@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@cosmoJFH cosmoJFH force-pushed the Include_char_array branch 2 times, most recently from 5a36756 to c4be280 Compare June 23, 2026 12:52
@cosmoJFH cosmoJFH changed the title Include char arrayFix character array handling in VOTable parsing and serialization (issue #17098) Fix character array handling in VOTable parsing and serialization (issue #17098) Jun 23, 2026
@pllim pllim added this to the v8.1.0 milestone Jun 23, 2026
@cosmoJFH cosmoJFH force-pushed the Include_char_array branch 2 times, most recently from 7db2f89 to d26fee6 Compare June 23, 2026 19:29
pllim

This comment was marked as resolved.

@pllim

This comment was marked as resolved.

@cosmoJFH

Copy link
Copy Markdown
Author

Does this completely fix #17098 ?

Hi @pllim,

Yes, our intention is to fix #17098, as it is affecting the astroquery library, specifically the Euclid and Gaia modules (see astropy/astroquery#3599).

Please note that this is my first contribution to Astropy, so I would greatly appreciate any feedback.

Cheers,

Jorge.

@cosmoJFH cosmoJFH force-pushed the Include_char_array branch 6 times, most recently from 27f5090 to 131bf83 Compare June 24, 2026 10:22
@pllim

This comment was marked as resolved.

@cosmoJFH cosmoJFH force-pushed the Include_char_array branch 7 times, most recently from ee2e746 to bafd64e Compare June 25, 2026 09:06
@cosmoJFH

cosmoJFH commented Jun 25, 2026

Copy link
Copy Markdown
Author

Hi @pllim,

Please find my point-by-point responses below.

  • Please revert unnecessary style changes across the board; e.g., why remove blank lines, add weird indentations, and so on? There are A LOT of these.

The style changes were introduced automatically by PyCharm, which applies PEP 8 formatting rules. We can revert them if you would prefer to keep the existing style.

  • Do we really need 10 new data files? Can they be combined or tested in other ways, especially since you also added tests that do not need them.

We believe having separate files makes it easier to test and validate the different scenarios.

We developed new tests that read and parse VOTables in the BINARY2 format for the "unicodeChar" and "char data types.

The following files correspond to fixed-length arrays (10x4):

binary2_fix_length_array_char.xml
binary2_fix_length_array_unicodeChar.xml

The following files correspond to variable-length arrays (10x*):

binary2_variable_length_array_char.xml
binary2_variable_length_array_unicodeChar.xml

The following files cover higher-dimensional arrays, including fixed-length arrays (10x4x2) and variable-length arrays (10x4x*):

binary2_fix_length_multidimension_array_char.xml
binary2_fix_length_multidimension_array_unicodeChar.xml
binary2_variable_length_multidimension_array_char.xml
binary2_variable_length_multidimension_array_unicodeChar.xml

The following files are used by the new tests that read and parse regular VOTables (non-BINARY2 format) containing 1D string arrays:

valid_votable_with_char_array.xml
valid_votable_with_char_array2.xml
valid_votable_with_unicodeChar_array.xml
valid_votable_with_unicodeChar_array2.xml

The final two VOTables contain 2D string arrays:

valid_votable_with_char_multidimensional_array.xml
valid_votable_with_unicodeChar_multidimensional_array.xml

We also added the parameterized test test_table_write_char_arrays_verify, which exercises table.write() using different formats and the VOTables listed above.

Our testing shows that table.write() works correctly for fixed-length arrays of type "char". However, this is not the case for "unicodeChar" or multidimensional arrays. We are unsure whether support for those cases should also be included within the scope of this PR.

  • Need 2 change logs, one for io.ascii and one for io.votable

Could you give us more details about this?

Cheers,

Jorge

@cosmoJFH cosmoJFH force-pushed the Include_char_array branch from d15df54 to 3e32d1f Compare June 25, 2026 11:35
@cosmoJFH cosmoJFH force-pushed the Include_char_array branch 4 times, most recently from e3238b0 to 4f4256e Compare June 25, 2026 16:46
@cosmoJFH

cosmoJFH commented Jun 25, 2026

Copy link
Copy Markdown
Author

We have included 2 new files in the Changelog directory and format the code using ruff (+ pre-commit).

Comment thread docs/changes/io.ascii/19974.feature.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry should be modified to explain this diff. This is not just VOTable. What are you fixing here?

-                if col.mask[idx]:
+                flat_mask = col.mask.ravel()
+                if flat_mask[idx]:

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also added the parameterized test test_table_write_char_arrays_verify, which tests table.write() with different output formats. We made these changes to fix an error that occurs when using

table.write(filename, format="html", overwrite=True)

where filename refers to a VOTable containing a 1D character array.

We are unsure whether this is outside the scope of this PR. If you would prefer to keep the original implementation, we can revert these changes and address this issue, along with similar issues involving character arrays in other output formats, in a separate PRs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more like rewording to something like "fix a bug where 2D mask cannot be used" or something more accurate. The current verbiage is not very informative from io.ascii POV. Hope this is not too confusing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmoJFH - can you provide a minimal example that demonstrates the bug being fixed here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination would be to make this into a separate PR since this seems out of scope. I almost just ignored the review request entirely since I don't do VOtable. 😄.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmoJFH - can you provide a minimal example that demonstrates the bug being fixed here?

Hi @taldcroft, we have include new tests in this PR:

  1. astropy/io/votable/tests/test_converter.py‎
  2. ‎astropy/io/votable/tests/test_table.py
  3. ‎astropy/io/votable/tests/test_vo.py

These tests are based on the new votables:

  1. fixed-length arrays (10x4):

binary2_fix_length_array_char.xml
binary2_fix_length_array_unicodeChar.xml

  1. variable-length arrays (10x*):

binary2_variable_length_array_char.xml
binary2_variable_length_array_unicodeChar.xml

  1. higher-dimensional arrays, including fixed-length arrays (10x4x2) and variable-length arrays (10x4x*):

binary2_fix_length_multidimension_array_char.xml
binary2_fix_length_multidimension_array_unicodeChar.xml
binary2_variable_length_multidimension_array_char.xml
binary2_variable_length_multidimension_array_unicodeChar.xml

  1. VOTables (non-BINARY2 format) containing 1D string arrays:

valid_votable_with_char_array.xml
valid_votable_with_char_array2.xml
valid_votable_with_unicodeChar_array.xml
valid_votable_with_unicodeChar_array2.xml

  1. VOTables contain 2D string arrays:

valid_votable_with_char_multidimensional_array.xml
valid_votable_with_unicodeChar_multidimensional_array.xml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmoJFH - I meant a minimal reproducible example that defines a table (without going through VOtable at all) which shows the problem with HTML writing. You should be able to provide this example as a few lines in a GitHub comment.

@cosmoJFH cosmoJFH Jun 29, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmoJFH - I meant a minimal reproducible example that defines a table (without going through VOtable at all) which shows the problem with HTML writing. You should be able to provide this example as a few lines in a GitHub comment.

@taldcroft, yes you are right, we have added an example to the GitHub comment that illustrates the issue.

Regarding the problem with HTML writing. We are not sure if this is a real problem (and we should revert the chages). Here is an example

<?xml version="1.0" encoding="UTF-8"?>
<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3">
    <RESOURCE name="myFavouriteGalaxies">
        <COOSYS ID="sys" equinox="J2000" epoch="J2000" system="eq_FK5"/>
        <TABLE name="results">
            <DESCRIPTION>Velocities and Distance estimations</DESCRIPTION>
            <FIELD name="col1" ID="col1" datatype="char" arraysize="10*" unit="Mpc">
            </FIELD>
            <FIELD name="col2" ID="col2" datatype="int" arraysize="2x2">
            </FIELD>
            <DATA>
                <TABLEDATA>
                    <TR>
                        <TD>  cdefgh  </TD>
                        <TD>1 2 3 4</TD>
                    </TR>
                </TABLEDATA>
            </DATA>
        </TABLE>
    </RESOURCE>
</VOTABLE>

and then

from astropy.table import Table
table = Table.read("votable_test.vot")

table.write("my_table.html", format="ascii.html", overwrite=True)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/table/connect.py", line 130, in __call__
    self.registry.write(instance, *args, **kwargs)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/registry/core.py", line 386, in write
    return writer(data, *args, **kwargs)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/connect.py", line 28, in io_write
    return write(table, filename, **kwargs)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/utils/decorators.py", line 603, in wrapper
    return function(*args, **kwargs)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/ui.py", line 1035, in write
    lines = writer.write(table)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/html.py", line 354, in write
    self._check_multidim_table(table)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/core.py", line 1378, in _check_multidim_table
    _check_multidim_table(table, self.max_ndim)
  File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/core.py", line 54, in _check_multidim_table
    raise ValueError(


ValueError: column(s) with dimension > 2 cannot be be written with this format, try using 'ecsv' (Enhanced CSV) format

But for a 1D char array (that is represented as a 2D array) the same error is raised.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changed code in html.py does not fix this example. From the ascii package perspective, adding support for writing multidimensional columns is entirely out of scope for this PR, please remove that change.

@cosmoJFH cosmoJFH Jul 1, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changed code in html.py does not fix this example. From the ascii package perspective, adding support for writing multidimensional columns is entirely out of scope for this PR, please remove that change.

We have reverted the changes in html.py . Also, we commented out the test that makes use of table.write(filename, format="html") , otherwise the following error is raised

        for idx, col_str in enumerate(col_str_iters):
            if is_masked_column and has_fill_values:
>               if col.mask[idx]:
                   ^^^^^^^^^^^^^
E               ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

@pllim pllim requested review from olebole and tomdonaldson June 25, 2026 18:36
@taldcroft taldcroft requested review from taldcroft and removed request for taldcroft June 26, 2026 15:52
@cosmoJFH cosmoJFH force-pushed the Include_char_array branch 16 times, most recently from 8aba966 to c879666 Compare July 3, 2026 12:46
@cosmoJFH cosmoJFH force-pushed the Include_char_array branch from c879666 to 79eaad2 Compare July 3, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VOTable reader does not accept array-like PARAMs from Topcat

3 participants