Fix character array handling in VOTable parsing and serialization (issue #17098)#19974
Fix character array handling in VOTable parsing and serialization (issue #17098)#19974cosmoJFH wants to merge 2 commits into
Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
5a36756 to
c4be280
Compare
7db2f89 to
d26fee6
Compare
This comment was marked as resolved.
This comment was marked as resolved.
Hi @pllim, Yes, our intention is to fix #17098, as it is affecting the astroquery library, specifically the Euclid and Gaia modules (see astropy/astroquery#3599). Please note that this is my first contribution to Astropy, so I would greatly appreciate any feedback. Cheers, Jorge. |
27f5090 to
131bf83
Compare
This comment was marked as resolved.
This comment was marked as resolved.
ee2e746 to
bafd64e
Compare
|
Hi @pllim, Please find my point-by-point responses below.
The style changes were introduced automatically by PyCharm, which applies PEP 8 formatting rules. We can revert them if you would prefer to keep the existing style.
We believe having separate files makes it easier to test and validate the different scenarios. We developed new tests that read and parse VOTables in the BINARY2 format for the "unicodeChar" and "char data types. The following files correspond to fixed-length arrays (10x4): The following files correspond to variable-length arrays (10x*): The following files cover higher-dimensional arrays, including fixed-length arrays (10x4x2) and variable-length arrays (10x4x*): The following files are used by the new tests that read and parse regular VOTables (non-BINARY2 format) containing 1D string arrays: The final two VOTables contain 2D string arrays: We also added the parameterized test Our testing shows that t
Could you give us more details about this? Cheers, Jorge |
d15df54 to
3e32d1f
Compare
e3238b0 to
4f4256e
Compare
|
We have included 2 new files in the Changelog directory and format the code using ruff (+ pre-commit). |
There was a problem hiding this comment.
This entry should be modified to explain this diff. This is not just VOTable. What are you fixing here?
- if col.mask[idx]:
+ flat_mask = col.mask.ravel()
+ if flat_mask[idx]:There was a problem hiding this comment.
We also added the parameterized test test_table_write_char_arrays_verify, which tests table.write() with different output formats. We made these changes to fix an error that occurs when using
table.write(filename, format="html", overwrite=True)
where filename refers to a VOTable containing a 1D character array.
We are unsure whether this is outside the scope of this PR. If you would prefer to keep the original implementation, we can revert these changes and address this issue, along with similar issues involving character arrays in other output formats, in a separate PRs.
There was a problem hiding this comment.
I was thinking more like rewording to something like "fix a bug where 2D mask cannot be used" or something more accurate. The current verbiage is not very informative from io.ascii POV. Hope this is not too confusing.
There was a problem hiding this comment.
@cosmoJFH - can you provide a minimal example that demonstrates the bug being fixed here?
There was a problem hiding this comment.
My inclination would be to make this into a separate PR since this seems out of scope. I almost just ignored the review request entirely since I don't do VOtable. 😄.
There was a problem hiding this comment.
@cosmoJFH - can you provide a minimal example that demonstrates the bug being fixed here?
Hi @taldcroft, we have include new tests in this PR:
- astropy/io/votable/tests/test_converter.py
- astropy/io/votable/tests/test_table.py
- astropy/io/votable/tests/test_vo.py
These tests are based on the new votables:
- fixed-length arrays (10x4):
binary2_fix_length_array_char.xml
binary2_fix_length_array_unicodeChar.xml
- variable-length arrays (10x*):
binary2_variable_length_array_char.xml
binary2_variable_length_array_unicodeChar.xml
- higher-dimensional arrays, including fixed-length arrays (10x4x2) and variable-length arrays (10x4x*):
binary2_fix_length_multidimension_array_char.xml
binary2_fix_length_multidimension_array_unicodeChar.xml
binary2_variable_length_multidimension_array_char.xml
binary2_variable_length_multidimension_array_unicodeChar.xml
- VOTables (non-BINARY2 format) containing 1D string arrays:
valid_votable_with_char_array.xml
valid_votable_with_char_array2.xml
valid_votable_with_unicodeChar_array.xml
valid_votable_with_unicodeChar_array2.xml
- VOTables contain 2D string arrays:
valid_votable_with_char_multidimensional_array.xml
valid_votable_with_unicodeChar_multidimensional_array.xml
There was a problem hiding this comment.
@cosmoJFH - I meant a minimal reproducible example that defines a table (without going through VOtable at all) which shows the problem with HTML writing. You should be able to provide this example as a few lines in a GitHub comment.
There was a problem hiding this comment.
@cosmoJFH - I meant a minimal reproducible example that defines a table (without going through VOtable at all) which shows the problem with HTML writing. You should be able to provide this example as a few lines in a GitHub comment.
@taldcroft, yes you are right, we have added an example to the GitHub comment that illustrates the issue.
Regarding the problem with HTML writing. We are not sure if this is a real problem (and we should revert the chages). Here is an example
<?xml version="1.0" encoding="UTF-8"?>
<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3">
<RESOURCE name="myFavouriteGalaxies">
<COOSYS ID="sys" equinox="J2000" epoch="J2000" system="eq_FK5"/>
<TABLE name="results">
<DESCRIPTION>Velocities and Distance estimations</DESCRIPTION>
<FIELD name="col1" ID="col1" datatype="char" arraysize="10*" unit="Mpc">
</FIELD>
<FIELD name="col2" ID="col2" datatype="int" arraysize="2x2">
</FIELD>
<DATA>
<TABLEDATA>
<TR>
<TD> cdefgh </TD>
<TD>1 2 3 4</TD>
</TR>
</TABLEDATA>
</DATA>
</TABLE>
</RESOURCE>
</VOTABLE>
and then
from astropy.table import Table
table = Table.read("votable_test.vot")
table.write("my_table.html", format="ascii.html", overwrite=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/table/connect.py", line 130, in __call__
self.registry.write(instance, *args, **kwargs)
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/registry/core.py", line 386, in write
return writer(data, *args, **kwargs)
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/connect.py", line 28, in io_write
return write(table, filename, **kwargs)
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/utils/decorators.py", line 603, in wrapper
return function(*args, **kwargs)
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/ui.py", line 1035, in write
lines = writer.write(table)
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/html.py", line 354, in write
self._check_multidim_table(table)
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/core.py", line 1378, in _check_multidim_table
_check_multidim_table(table, self.max_ndim)
File "/home/jfernandez/.local/lib/python3.10/site-packages/astropy/io/ascii/core.py", line 54, in _check_multidim_table
raise ValueError(
ValueError: column(s) with dimension > 2 cannot be be written with this format, try using 'ecsv' (Enhanced CSV) format
But for a 1D char array (that is represented as a 2D array) the same error is raised.
There was a problem hiding this comment.
The changed code in html.py does not fix this example. From the ascii package perspective, adding support for writing multidimensional columns is entirely out of scope for this PR, please remove that change.
There was a problem hiding this comment.
The changed code in
html.pydoes not fix this example. From theasciipackage perspective, adding support for writing multidimensional columns is entirely out of scope for this PR, please remove that change.
We have reverted the changes in html.py . Also, we commented out the test that makes use of table.write(filename, format="html") , otherwise the following error is raised
for idx, col_str in enumerate(col_str_iters):
if is_masked_column and has_fill_values:
> if col.mask[idx]:
^^^^^^^^^^^^^
E ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
8aba966 to
c879666
Compare
c879666 to
79eaad2
Compare
Fix handling of character array fields, fix #17098
The current implementation does not correctly handle certain character array cases during VOTable parsing and serialization, leading to inconsistent behavior and failures in the scenarios described in the issue. This PR updates the character array handling to correctly process these cases
The same underlying issue has also been observed downstream in astropy/astroquery#3599, (Gaia and Euclid).
VOTables produced for the Gaia and Euclid missions are generated using the STILTS library, which supports the creation of VOTables containing string arrays.
The following is an example that illustrates the issue:
Changes
We have included the following changes:
CharArrayandCharArrayVarArray, along with their Unicode counterparts (UnicodeCharArray,UnicodeCharArrayVarArray), to unify and simplify character array handling.We have added regression tests covering the cases described in #17098. We have also included new tests:
This is our first contribution to Astropy, so we would welcome any feedback on the implementation approach or on the test coverage.