Skip to content

ENH: allow larger than C int sized structured dtypes#31332

Open
tylerjereddy wants to merge 20 commits into
numpy:mainfrom
tylerjereddy:treddy_issue_31308
Open

ENH: allow larger than C int sized structured dtypes#31332
tylerjereddy wants to merge 20 commits into
numpy:mainfrom
tylerjereddy:treddy_issue_31308

Conversation

@tylerjereddy
Copy link
Copy Markdown
Contributor

@tylerjereddy tylerjereddy commented Apr 26, 2026

I spent a few hours debugging on this one--I'll add a few initial inline self-review comments to start...

[skip azp] [skip cirrus] [skip actions]

AI Disclosure

No AI tools used

Comment thread numpy/_core/src/multiarray/descriptor.c Outdated
Comment thread numpy/_core/tests/test_dtype.py
Comment thread numpy/_core/tests/test_dtype.py
Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick comments. Would be good to check if there might be other (int) casts are definitions (but I can see even asking some LLM to search the code-base).
I guess the pickling path is already fixed, so there likely isn't much, but I am not sure (some string paths or maybe the hashing).

Comment thread numpy/_core/src/multiarray/descriptor.c Outdated
Comment thread numpy/_core/src/multiarray/descriptor.c Outdated
Comment thread numpy/_core/tests/test_dtype.py
Comment thread numpy/_core/tests/test_dtype.py Outdated
assert_equal(x, y)
assert_equal(x[0], y[0])

@pytest.mark.slow()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_pickling_large (and the rest of the full testsuite) now pass locally on this branch, thanks to a small type fix in arraydescr_reduce.

A few notes here:

  • on an ARM M3 Max mac laptop, that test takes a long time to pass, more than 6 minutes: 1 passed in 402.71s (0:06:42); I think that means that the slow() marker doesn't cut it, and I guess NumPy doesn't have xslow? What do you want me to do here then? Do we really need to iterate over 2 x > C int cases x 5 pickle protocol versions? If so, how should we selectively run this only once in a while? Even a single protocol version is over a minute to run.
  • should we also stack in a requires_memory decoration here? I believe these arrays are pretty massive like in the other test I added that materializes such arrays.

Copy link
Copy Markdown
Member

@seberg seberg Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just skip the "dtype is functional" test, or make the array empty np.zeros(0, dtype=dtype).
I am not 100% sure if that covers everything, but it seems fair to me to do that.

There isn't really much to check anyway, the only thing I can think of is whether dtype.flags == pickled.flags, because the dtype == pickled should check almost everything else (metadata is already checked).

Copy link
Copy Markdown
Contributor Author

@tylerjereddy tylerjereddy Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the initial materialization in, y = np.zeros(3, dtype=pickled), but gave check_pickling a default-on argument that optionally runs the actually assertion on the materialized arrays, which I've turned off for the test_pickling_large calls.

I believe this is probably robust, but may have some risk, if some operating system did actually fully allocate the RAM on a zeros call, but I believe that's rare? CI may tells us--I just flushed CI for this time.

With that change, the test would no longer fail without the source patches here though, so I added an extra check that assert_equal(pickled.itemsize, dtype.itemsize), which does fail without the work in this PR. I was slightly surprised that assert_equal(pickled.descr, dtype.descr) wasn't sufficient to enforce proper reconstitution of itemsize, since that reads as "they have the same descriptors," but apparently it doesn't...

This also allowed me to remove the slow marker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, CI is failing on the first try, IS_64BIT may not be a sufficient guard for some Windows cases?

test_gh_31308 needs a 64-bit guard. Some of the CI failures are not related--bit messy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one MKL job generally failing right now, just ignore that one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit suppresses CI again for now. There is one remaining failure that I'm going to need to debug directly on a Windows box, below the fold. The new itemsize check that I added to check_pickling seems to show dtype = np.dtype(f"({2**31},)i") getting set to dtype(('<i4', (-2147483648,))), independent of serialization. The intp size on that Windows CI runner should be just fine since the test is guarded by IS_64BIT, which comes from np.dtype(np.intp).itemsize. git grep -E -i "int elsize" still has several hits on this branch, but way easier to debug directly on Win machine, so I'll have to circle back.

Details
>           assert_equal(pickled.itemsize, dtype.itemsize)
E           AssertionError: 
E           Items are not equal:
E            ACTUAL: 0
E            DESIRED: 8589934592

arr_assert = False
buf        = b'cnumpy\ndtype\np0\n(VV0\np1\nI00\nI01\ntp2\nRp3\n(I3\nV|\np4\n(g0\n(Vi4\np5\nI00\nI01\ntp6\nRp7\n(I3\nV<\np8\nNNNI-1\nI-1\nI0\ntp9\nb(I-2147483648\ntp10\ntp11\nNNI0\nI4\nI0\ntp12\nb.'
dtype      = dtype(('<i4', (-2147483648,)))
pickled    = dtype(('<i4', (-2147483648,)))

Copy link
Copy Markdown
Member

@seberg seberg May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably get away without trying a windows box. If this diverges on windows, you can be sure there is some use of long along the way. Either a C long or PyLong_FromLong or so call somewhere along the branch.
And just grepping sees things like:

PyObject *size_obj = PyLong_FromLong((long) totalsize);

There are a lot more in that file... Got to just replace them all with FromSsizet (or whatever it is). And most likely that will be enough.

(There might be corresponding PyLong_AsLong somewhere, but from a grep those seem more likely to be fixed than the reverse.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the CI is passing apart from two clearly unrelated failures.

I did end up having to use a Windows box because fork CI debugging was too painful. I also added an additional assertion to one of my new tests (test_gh_31308()), since it should have picked up the descr problem that was causing the pickle test to fail (it now does).

@tylerjereddy
Copy link
Copy Markdown
Contributor Author

Alright, most of the comments have been addressed now I think. Beyond some of the still-open discussions above, I wonder if I'm missing some important routes to structured dtype construction--my testing here mostly focused on the "tuple" specification route, but perhaps there are others that need testing/shims.

I guess that's related to my self-review rquirement for more thorough tests.

I'll have to circle back another day/time for that though.

@seberg
Copy link
Copy Markdown
Member

seberg commented Apr 28, 2026

for more thorough tests.

I guess creation wise, it is probably if 1000000i,100000i style also has overflow checks (I am not even sure that takes a different path, though).
Otherwise, I guess the biggest thing would be some actual usage tests, such as a cast/promotion. Maybe one that explodes the size result_type("<large_num>?,<large_num>", "<large_num>c,<large_num>?") which goes to <large_num>c,<large_num>c (larger than either inputs).
It may be interesting to actually try one of those for a real cast, but of course that needs a lot of memory and is very slow.
(Just throwing it out there, not sure we need quite all of these.)

@tylerjereddy
Copy link
Copy Markdown
Contributor Author

it is probably if 1000000i,100000i style also has overflow checks

I parametrized test_gh_31308() to include "2147483648i,2147483648i" construction, which produces an itemsize of 0, so that test case does indeed fail for now. I'll need to investigate.

I thought I saw a dictionary code path too.. but anyway one thing at a time.

@seberg
Copy link
Copy Markdown
Member

seberg commented Apr 29, 2026

I thought I saw a dictionary code path too.. but anyway one thing at a time.

Ah, yeah... dict(names=["a", ...], formats=["20000f"]) and that can also include offsets=[2**31-100], another path that could overflow, I guess.
(On the up-side, you are seriously cleaning up these code paths...)

@tylerjereddy
Copy link
Copy Markdown
Contributor Author

"2147483648i,2147483648i"

That test case is now passing as well, after latest commit. Let me check the dictionary case next.

@tylerjereddy
Copy link
Copy Markdown
Contributor Author

dict(names=["a", ...], formats=["20000f"])

Yes, that approach also overflows on main with large enough input--I've pushed up a new test case and small patch that allows it to pass.

* Related to numpygh-30315 and numpygh-31308, but very much
a work in progress.

* Although the original failing regression test does
pass here, it is not even close to safe to do this yet
even though the full testsuite passes locally with
`test_pickling_large` re-enabled.

[skip azp] [skip cirrus] [skip actions]
* `test_gh_31308` has been improved to verify that
`itemsize` is actually correctly populated on the
newly-supported `dtype` construction.

* Minor source changes have been made to allow
the above regression test to pass.
`test_gh_31308_materialized()` is now passing, so it has
been adjusted to be allowed to run if sufficient memory
is available. The test was also improved to add a basic
assertion about the recarray size that results.

[skip azp] [skip cirrus] [skip actions]
* `test_pickling_large` now passes thanks to a small type specification
change in `arraydescr_reduce`. Note that the test now takes ~6 minutes
to run locally on ARM Mac, so the newly-added `slow` marker probably
isn't sufficient.

[skip azp] [skip cirrus] [skip actions]
* Simplified the error handling in `_convert_from_tuple()` function
based on reviewer feedback.

[skip azp] [skip cirrus] [skip actions]
* Simplify the error checking in `_convert_from_tuple()` and
change a variable type in that function, based on reviewer
feedback.

[skip azp] [skip cirrus] [skip actions]
* `test_shape_invalid()` now has two "later overflow" test
cases restored, based on reviewer feedback.

[skip azp] [skip cirrus] [skip actions]
* `test_gh_31308_materialized()` has been adjusted to also
have a 64-bit machine guard, since that is required for this test.

[skip azp] [skip cirrus] [skip actions]
* `_convert_from_tuple()` now uses a more appropriate C function,
`npy_mul_sizes_with_overflow()`, to check for overflow, based
on reviewer feedback.

[skip azp] [skip cirrus] [skip actions]
* Remove an extraneous typecast in `_convert_from_tuple()` function.

[skip azp] [skip cirrus] [skip actions]
* Parametrize `test_gh_31308()` to include an `"i, i"` style
structured dtype construction. That test case currently fails
so will need to be repaired to avoid overflow.

[skip azp] [skip cirrus] [skip actions]
* Adjust a variable typing in `_convert_from_list()` function
to allow this structured dtype specification to be properly
processed: `"2147483648i,2147483648i"`.

[skip azp] [skip cirrus] [skip actions]
* `test_gh_31308()` has been augmented to include a new
test case for "larger than C int" structured dtype
specification via a dictionary.

* `_convert_from_dict()` has had a variable type specification
improved to support the above test case.

[skip azp] [skip cirrus] [skip actions]
* More dictionary-based structured dtype specification cases
have been added to `test_gh_31308()`. A small typing patch
for the `offset` variable in `_convert_from_dict()` has
been added that allows the new test cases to pass.

[skip azp] [skip cirrus] [skip actions]
@tylerjereddy
Copy link
Copy Markdown
Contributor Author

offsets=[2**31-100]

An offset smaller than a C int was "ok," even if the field element size was already larger than a C int, but an offset larger than a C int would error out. Test cases and a small patch to fix/allow that have been pushed up.

I also had to resolve some merge conflicts, guess there is some activity in the descriptor source.

I'll continue to skip the CI for now.

* `test_gh_31308()` has been improved with several new/more
complex structured dtype test cases (they all pass, as expected).

[skip azp] [skip cirrus] [skip actions]
* The `check_pickling` testing utility function has been
adjusted to allow skipping the comparison of materialized
arrays, because this can take several minutes for newly-supported
large structured dtypes. To compensate for this, `check_pickling`
has been augmented to additionally verify reconstitution of `itemsize`
for serialized dtypes, which is a check that fails without the source
patches in the above PR.

* As a result, it is no longer necessary to mark `test_pickling_large`
with `slow()`.
* `check_pickling` required too much memory for large structured
dtypes, so the materializations of the arrays have been moved
under the new `arr_assert` guard.

* `test_gh_31308` was missing a `IS_64BIT` guard.

[skip azp] [skip cirrus] [skip actions]
* Several more size/elsize related typing fixes in the descriptor
source to support the above PR, and to fix a Windows
test failure observed there (confirmed locally on Windows
box).

* Augment `test_gh_31308()` with an additional assertion
that is sensitive to the need for some of these source changes.
@tylerjereddy tylerjereddy changed the title WIP, ENH: allow larger than C int sized structured dtypes ENH: allow larger than C int sized structured dtypes May 1, 2026
* Add a release note for the above PR.
@seberg
Copy link
Copy Markdown
Member

seberg commented May 4, 2026

Thanks, this looks all good now. I dunno, TBH, an agent might be simplest to find and maybe also just fix these... Although a grep for elsize and ITEMSIZE may be sufficient (also field iteration/names).
While for some paths it's totally fine (because we know we have simple numeric types), there are still a bunch of paths that are wrong, e.g. (but there are more):

  • Use of Oi tuple unpacking in descriptor.c for the offset.
  • Use of integers for field iterations names in VOID_setitem/getitem/copyswap...
  • A few more int.*elsize definitions, many of which reachable by void dtypes.
    (Yes, a few of these are already bugs, since user DTypes were already allowed to be larger in practice.)

@tylerjereddy
Copy link
Copy Markdown
Contributor Author

I don't mind working on those. I wish we had an option for i.e., the preferred LLM to just complain about it directly in code review and for i.e., you to curate those complaints to the actual items I need to address, if we're going to lean on agent checks. Obviously I noted a few times above that I could git grep and find problems that still existed, although it is far more useful for me to identify actual reproducers/test cases for problems, rather than assuming that every git grep blemish needs to be addressed.

@tylerjereddy
Copy link
Copy Markdown
Contributor Author

Some downstream exploration of simplifications enabled by this branch is being explored by the scientists at mcdc-project/mcdc#419.

* Fixed an incorrectly typed `elsize` in `array_fromfile_binary()`.

* Added a matching regression test, though note that
`test_recarray_fromfile_massive()` already passes without the type
change above.

[skip azp] [skip cirrus] [skip actions]
@tylerjereddy
Copy link
Copy Markdown
Contributor Author

I pushed in a commit to test/fix the tofile/fromfile control flow, which had one mistyped elsize. I don't really like this though--the regression test didn't fail before the source patch--it looks like my other fixes allow the typecasting to work out in the end so that one feels more like a "should fix," but wasn't actually needed really.

The regression test is also extremely slow, though it will basically never run with @requires_memory(free_bytes=2e9).

I'll keep chipping away at these, though I don't like the ones that don't have an explicit reproducer that fails even if we "know" we need to change a type.

with tmpdir.as_cwd():
rec_arr.tofile("f.data")
actual = np.fromfile("f.data", dtype=kind_dtype)
assert actual.itemsize == 2 ** 28 * 8
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, fun. OTOH, the error here would be that we are not reading everything (i.e. the last bit of the result not being -1). But I think there is a fun little thing happening here:

  • This is a signed int overflow 2**31 -> -2**31
  • Cast to size_t for reading, it'll actually just go back to 2**31 as that is unsigned.

So, no problem until we reach 2**32 at which point the error would be reading nothing at all.

Possible we could employ a funny trick here: Just try to read 1 element of a 2**32+1 sized dtype from a short file (not empty with the +1).
With fromfile(...., count=1) that should try to read one element, fail and return an empty array. But with the bug I think it'll actually return an array of length 1.
(That said, clear that is more of a regression test designed for the specific error in this code...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: extremely large single field size dtype support ENH,BUG: DType creation doesn't allow to create larger than integer sized dtypes

2 participants