ENH: allow larger than C int sized structured dtypes by tylerjereddy · Pull Request #31332 · numpy/numpy

tylerjereddy · 2026-04-26T00:14:04Z

Fixes ENH,BUG: DType creation doesn't allow to create larger than integer sized dtypes #30315
Fixes ENH: extremely large single field size dtype support #31308
Although the original failing regression test does pass here, it is not even close to safe to do this yet even though the full testsuite passes locally with test_pickling_large re-enabled.

I spent a few hours debugging on this one--I'll add a few initial inline self-review comments to start...

[skip azp] [skip cirrus] [skip actions]

AI Disclosure

No AI tools used

seberg

Just some quick comments. Would be good to check if there might be other (int) casts are definitions (but I can see even asking some LLM to search the code-base).
I guess the pickling path is already fixed, so there likely isn't much, but I am not sure (some string paths or maybe the hashing).

tylerjereddy · 2026-04-27T17:41:09Z

            assert_equal(x, y)
            assert_equal(x[0], y[0])

+    @pytest.mark.slow()


test_pickling_large (and the rest of the full testsuite) now pass locally on this branch, thanks to a small type fix in arraydescr_reduce.

A few notes here:

on an ARM M3 Max mac laptop, that test takes a long time to pass, more than 6 minutes: 1 passed in 402.71s (0:06:42); I think that means that the slow() marker doesn't cut it, and I guess NumPy doesn't have xslow? What do you want me to do here then? Do we really need to iterate over 2 x > C int cases x 5 pickle protocol versions? If so, how should we selectively run this only once in a while? Even a single protocol version is over a minute to run.

should we also stack in a requires_memory decoration here? I believe these arrays are pretty massive like in the other test I added that materializes such arrays.

Maybe just skip the "dtype is functional" test, or make the array empty np.zeros(0, dtype=dtype).
I am not 100% sure if that covers everything, but it seems fair to me to do that.

There isn't really much to check anyway, the only thing I can think of is whether dtype.flags == pickled.flags, because the dtype == pickled should check almost everything else (metadata is already checked).

I left the initial materialization in, y = np.zeros(3, dtype=pickled), but gave check_pickling a default-on argument that optionally runs the actually assertion on the materialized arrays, which I've turned off for the test_pickling_large calls.

I believe this is probably robust, but may have some risk, if some operating system did actually fully allocate the RAM on a zeros call, but I believe that's rare? CI may tells us--I just flushed CI for this time.

With that change, the test would no longer fail without the source patches here though, so I added an extra check that assert_equal(pickled.itemsize, dtype.itemsize), which does fail without the work in this PR. I was slightly surprised that assert_equal(pickled.descr, dtype.descr) wasn't sufficient to enforce proper reconstitution of itemsize, since that reads as "they have the same descriptors," but apparently it doesn't...

This also allowed me to remove the slow marker.

Alright, CI is failing on the first try, IS_64BIT may not be a sufficient guard for some Windows cases?

test_gh_31308 needs a 64-bit guard. Some of the CI failures are not related--bit messy.

There is one MKL job generally failing right now, just ignore that one.

The latest commit suppresses CI again for now. There is one remaining failure that I'm going to need to debug directly on a Windows box, below the fold. The new itemsize check that I added to check_pickling seems to show dtype = np.dtype(f"({2**31},)i") getting set to dtype(('<i4', (-2147483648,))), independent of serialization. The intp size on that Windows CI runner should be just fine since the test is guarded by IS_64BIT, which comes from np.dtype(np.intp).itemsize. git grep -E -i "int elsize" still has several hits on this branch, but way easier to debug directly on Win machine, so I'll have to circle back.

Details

> assert_equal(pickled.itemsize, dtype.itemsize) E AssertionError: E Items are not equal: E ACTUAL: 0 E DESIRED: 8589934592 arr_assert = False buf = b'cnumpy\ndtype\np0\n(VV0\np1\nI00\nI01\ntp2\nRp3\n(I3\nV|\np4\n(g0\n(Vi4\np5\nI00\nI01\ntp6\nRp7\n(I3\nV<\np8\nNNNI-1\nI-1\nI0\ntp9\nb(I-2147483648\ntp10\ntp11\nNNI0\nI4\nI0\ntp12\nb.' dtype = dtype(('<i4', (-2147483648,))) pickled = dtype(('<i4', (-2147483648,)))

You can probably get away without trying a windows box. If this diverges on windows, you can be sure there is some use of long along the way. Either a C long or PyLong_FromLong or so call somewhere along the branch.
And just grepping sees things like:

PyObject *size_obj = PyLong_FromLong((long) totalsize);

There are a lot more in that file... Got to just replace them all with FromSsizet (or whatever it is). And most likely that will be enough.

(There might be corresponding PyLong_AsLong somewhere, but from a grep those seem more likely to be fixed than the reverse.)

Ok, the CI is passing apart from two clearly unrelated failures.

I did end up having to use a Windows box because fork CI debugging was too painful. I also added an additional assertion to one of my new tests (test_gh_31308()), since it should have picked up the descr problem that was causing the pickle test to fail (it now does).

tylerjereddy · 2026-04-27T19:43:29Z

Alright, most of the comments have been addressed now I think. Beyond some of the still-open discussions above, I wonder if I'm missing some important routes to structured dtype construction--my testing here mostly focused on the "tuple" specification route, but perhaps there are others that need testing/shims.

I guess that's related to my self-review rquirement for more thorough tests.

I'll have to circle back another day/time for that though.

seberg · 2026-04-28T07:40:37Z

for more thorough tests.

I guess creation wise, it is probably if 1000000i,100000i style also has overflow checks (I am not even sure that takes a different path, though).
Otherwise, I guess the biggest thing would be some actual usage tests, such as a cast/promotion. Maybe one that explodes the size result_type("<large_num>?,<large_num>", "<large_num>c,<large_num>?") which goes to <large_num>c,<large_num>c (larger than either inputs).
It may be interesting to actually try one of those for a real cast, but of course that needs a lot of memory and is very slow.
(Just throwing it out there, not sure we need quite all of these.)

tylerjereddy · 2026-04-28T23:47:12Z

it is probably if 1000000i,100000i style also has overflow checks

I parametrized test_gh_31308() to include "2147483648i,2147483648i" construction, which produces an itemsize of 0, so that test case does indeed fail for now. I'll need to investigate.

I thought I saw a dictionary code path too.. but anyway one thing at a time.

seberg · 2026-04-29T06:11:29Z

I thought I saw a dictionary code path too.. but anyway one thing at a time.

Ah, yeah... dict(names=["a", ...], formats=["20000f"]) and that can also include offsets=[2**31-100], another path that could overflow, I guess.
(On the up-side, you are seriously cleaning up these code paths...)

tylerjereddy · 2026-04-29T20:10:26Z

"2147483648i,2147483648i"

That test case is now passing as well, after latest commit. Let me check the dictionary case next.

tylerjereddy · 2026-04-29T20:26:11Z

dict(names=["a", ...], formats=["20000f"])

Yes, that approach also overflows on main with large enough input--I've pushed up a new test case and small patch that allows it to pass.

* Related to numpygh-30315 and numpygh-31308, but very much a work in progress. * Although the original failing regression test does pass here, it is not even close to safe to do this yet even though the full testsuite passes locally with `test_pickling_large` re-enabled. [skip azp] [skip cirrus] [skip actions]

* `test_gh_31308` has been improved to verify that `itemsize` is actually correctly populated on the newly-supported `dtype` construction. * Minor source changes have been made to allow the above regression test to pass.

`test_gh_31308_materialized()` is now passing, so it has been adjusted to be allowed to run if sufficient memory is available. The test was also improved to add a basic assertion about the recarray size that results. [skip azp] [skip cirrus] [skip actions]

* `test_pickling_large` now passes thanks to a small type specification change in `arraydescr_reduce`. Note that the test now takes ~6 minutes to run locally on ARM Mac, so the newly-added `slow` marker probably isn't sufficient. [skip azp] [skip cirrus] [skip actions]

* Simplified the error handling in `_convert_from_tuple()` function based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

* Simplify the error checking in `_convert_from_tuple()` and change a variable type in that function, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

* `test_shape_invalid()` now has two "later overflow" test cases restored, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

* `test_gh_31308_materialized()` has been adjusted to also have a 64-bit machine guard, since that is required for this test. [skip azp] [skip cirrus] [skip actions]

* `_convert_from_tuple()` now uses a more appropriate C function, `npy_mul_sizes_with_overflow()`, to check for overflow, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

* Remove an extraneous typecast in `_convert_from_tuple()` function. [skip azp] [skip cirrus] [skip actions]

* Parametrize `test_gh_31308()` to include an `"i, i"` style structured dtype construction. That test case currently fails so will need to be repaired to avoid overflow. [skip azp] [skip cirrus] [skip actions]

* Adjust a variable typing in `_convert_from_list()` function to allow this structured dtype specification to be properly processed: `"2147483648i,2147483648i"`. [skip azp] [skip cirrus] [skip actions]

* `test_gh_31308()` has been augmented to include a new test case for "larger than C int" structured dtype specification via a dictionary. * `_convert_from_dict()` has had a variable type specification improved to support the above test case. [skip azp] [skip cirrus] [skip actions]

* More dictionary-based structured dtype specification cases have been added to `test_gh_31308()`. A small typing patch for the `offset` variable in `_convert_from_dict()` has been added that allows the new test cases to pass. [skip azp] [skip cirrus] [skip actions]

tylerjereddy · 2026-04-29T20:58:11Z

offsets=[2**31-100]

An offset smaller than a C int was "ok," even if the field element size was already larger than a C int, but an offset larger than a C int would error out. Test cases and a small patch to fix/allow that have been pushed up.

I also had to resolve some merge conflicts, guess there is some activity in the descriptor source.

I'll continue to skip the CI for now.

* `test_gh_31308()` has been improved with several new/more complex structured dtype test cases (they all pass, as expected). [skip azp] [skip cirrus] [skip actions]

* The `check_pickling` testing utility function has been adjusted to allow skipping the comparison of materialized arrays, because this can take several minutes for newly-supported large structured dtypes. To compensate for this, `check_pickling` has been augmented to additionally verify reconstitution of `itemsize` for serialized dtypes, which is a check that fails without the source patches in the above PR. * As a result, it is no longer necessary to mark `test_pickling_large` with `slow()`.

* `check_pickling` required too much memory for large structured dtypes, so the materializations of the arrays have been moved under the new `arr_assert` guard. * `test_gh_31308` was missing a `IS_64BIT` guard. [skip azp] [skip cirrus] [skip actions]

* Several more size/elsize related typing fixes in the descriptor source to support the above PR, and to fix a Windows test failure observed there (confirmed locally on Windows box). * Augment `test_gh_31308()` with an additional assertion that is sensitive to the need for some of these source changes.

* Add a release note for the above PR.

seberg · 2026-05-04T07:13:11Z

Thanks, this looks all good now. I dunno, TBH, an agent might be simplest to find and maybe also just fix these... Although a grep for elsize and ITEMSIZE may be sufficient (also field iteration/names).
While for some paths it's totally fine (because we know we have simple numeric types), there are still a bunch of paths that are wrong, e.g. (but there are more):

Use of Oi tuple unpacking in descriptor.c for the offset.
Use of integers for field iterations names in VOID_setitem/getitem/copyswap...
A few more int.*elsize definitions, many of which reachable by void dtypes.
(Yes, a few of these are already bugs, since user DTypes were already allowed to be larger in practice.)

tylerjereddy · 2026-05-04T22:24:22Z

I don't mind working on those. I wish we had an option for i.e., the preferred LLM to just complain about it directly in code review and for i.e., you to curate those complaints to the actual items I need to address, if we're going to lean on agent checks. Obviously I noted a few times above that I could git grep and find problems that still existed, although it is far more useful for me to identify actual reproducers/test cases for problems, rather than assuming that every git grep blemish needs to be addressed.

tylerjereddy · 2026-05-07T15:49:25Z

Some downstream exploration of simplifications enabled by this branch is being explored by the scientists at mcdc-project/mcdc#419.

* Fixed an incorrectly typed `elsize` in `array_fromfile_binary()`. * Added a matching regression test, though note that `test_recarray_fromfile_massive()` already passes without the type change above. [skip azp] [skip cirrus] [skip actions]

tylerjereddy · 2026-05-12T00:00:55Z

I pushed in a commit to test/fix the tofile/fromfile control flow, which had one mistyped elsize. I don't really like this though--the regression test didn't fail before the source patch--it looks like my other fixes allow the typecasting to work out in the end so that one feels more like a "should fix," but wasn't actually needed really.

The regression test is also extremely slow, though it will basically never run with @requires_memory(free_bytes=2e9).

I'll keep chipping away at these, though I don't like the ones that don't have an explicit reproducer that fails even if we "know" we need to change a type.

seberg · 2026-05-12T07:52:36Z

+        with tmpdir.as_cwd():
+            rec_arr.tofile("f.data")
+            actual = np.fromfile("f.data", dtype=kind_dtype)
+            assert actual.itemsize == 2 ** 28 * 8


Ohh, fun. OTOH, the error here would be that we are not reading everything (i.e. the last bit of the result not being -1). But I think there is a fun little thing happening here:

This is a signed int overflow 2**31 -> -2**31

Cast to size_t for reading, it'll actually just go back to 2**31 as that is unsigned.

So, no problem until we reach 2**32 at which point the error would be reading nothing at all.

Possible we could employ a funny trick here: Just try to read 1 element of a 2**32+1 sized dtype from a short file (not empty with the +1).
With fromfile(...., count=1) that should try to read one element, fail and return an empty array. But with the bug I think it'll actually return an array of length 1.
(That said, clear that is more of a regression test designed for the specific error in this code...)

tylerjereddy added the 01 - Enhancement label Apr 26, 2026

tylerjereddy commented Apr 26, 2026

View reviewed changes

Comment thread numpy/_core/src/multiarray/descriptor.c Outdated

tylerjereddy commented Apr 26, 2026

View reviewed changes

Comment thread numpy/_core/tests/test_dtype.py

tylerjereddy commented Apr 26, 2026

View reviewed changes

Comment thread numpy/_core/tests/test_dtype.py

tylerjereddy added the 55 - Needs work label Apr 26, 2026

seberg reviewed Apr 27, 2026

View reviewed changes

Comment thread numpy/_core/src/multiarray/descriptor.c Outdated

Comment thread numpy/_core/src/multiarray/descriptor.c Outdated

Comment thread numpy/_core/tests/test_dtype.py

tylerjereddy commented Apr 27, 2026

View reviewed changes

tylerjereddy added 14 commits April 29, 2026 14:48

TST, ENH: PR 31332 revisions

de85568

* `test_gh_31308` has been improved to verify that `itemsize` is actually correctly populated on the newly-supported `dtype` construction. * Minor source changes have been made to allow the above regression test to pass.

TST: PR 31332 revisions

de8440f

`test_gh_31308_materialized()` is now passing, so it has been adjusted to be allowed to run if sufficient memory is available. The test was also improved to add a basic assertion about the recarray size that results. [skip azp] [skip cirrus] [skip actions]

MAINT: PR 31332 revisions

377836b

* Simplified the error handling in `_convert_from_tuple()` function based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

MAINT: PR 31332 revisions

d9bba0c

* Simplify the error checking in `_convert_from_tuple()` and change a variable type in that function, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

TST: PR 31332 revisions

ebcfa4a

* `test_shape_invalid()` now has two "later overflow" test cases restored, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

TST: PR 31332 revisions

8812144

* `test_gh_31308_materialized()` has been adjusted to also have a 64-bit machine guard, since that is required for this test. [skip azp] [skip cirrus] [skip actions]

MAINT: PR 31332 revisions

9dbee21

* `_convert_from_tuple()` now uses a more appropriate C function, `npy_mul_sizes_with_overflow()`, to check for overflow, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]

MAINT: PR 31332 revisions

b266f4c

* Remove an extraneous typecast in `_convert_from_tuple()` function. [skip azp] [skip cirrus] [skip actions]

TST: PR 31332 revisions

997f0c4

* Parametrize `test_gh_31308()` to include an `"i, i"` style structured dtype construction. That test case currently fails so will need to be repaired to avoid overflow. [skip azp] [skip cirrus] [skip actions]

ENH: PR 31332 revisions

208193c

* Adjust a variable typing in `_convert_from_list()` function to allow this structured dtype specification to be properly processed: `"2147483648i,2147483648i"`. [skip azp] [skip cirrus] [skip actions]

tylerjereddy force-pushed the treddy_issue_31308 branch from d415e07 to 3892f1d Compare April 29, 2026 20:54

TST: PR 31332 revisions

3dba943

* `test_gh_31308()` has been improved with several new/more complex structured dtype test cases (they all pass, as expected). [skip azp] [skip cirrus] [skip actions]

tylerjereddy added 2 commits April 30, 2026 10:45

TST: PR 31332 revisions

4c51883

* `check_pickling` required too much memory for large structured dtypes, so the materializations of the arrays have been moved under the new `arr_assert` guard. * `test_gh_31308` was missing a `IS_64BIT` guard. [skip azp] [skip cirrus] [skip actions]

tylerjereddy mentioned this pull request Apr 30, 2026

Some TST panel feedback mcdc-project/mcdc#408

Open

7 tasks

tylerjereddy removed the 55 - Needs work label May 1, 2026

tylerjereddy changed the title ~~WIP, ENH: allow larger than C int sized structured dtypes~~ ENH: allow larger than C int sized structured dtypes May 1, 2026

DOC: PR 31332 revisions

692a3f1

* Add a release note for the above PR.

TST, MAINT: PR 31332 revisions

673b19b

* Fixed an incorrectly typed `elsize` in `array_fromfile_binary()`. * Added a matching regression test, though note that `test_recarray_fromfile_massive()` already passes without the type change above. [skip azp] [skip cirrus] [skip actions]

seberg reviewed May 12, 2026

View reviewed changes

Uh oh!

Conversation

tylerjereddy commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Disclosure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tylerjereddy Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

seberg Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

seberg Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

tylerjereddy Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

seberg May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerjereddy May 1, 2026

Choose a reason for hiding this comment

Uh oh!

tylerjereddy commented Apr 27, 2026

Uh oh!

seberg commented Apr 28, 2026

Uh oh!

tylerjereddy commented Apr 28, 2026

Uh oh!

seberg commented Apr 29, 2026

Uh oh!

tylerjereddy commented Apr 29, 2026

Uh oh!

tylerjereddy commented Apr 29, 2026

Uh oh!

tylerjereddy commented Apr 29, 2026

Uh oh!

seberg commented May 4, 2026

Uh oh!

tylerjereddy commented May 4, 2026

Uh oh!

tylerjereddy commented May 7, 2026

Uh oh!

tylerjereddy commented May 12, 2026

Uh oh!

seberg May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tylerjereddy commented Apr 26, 2026 •

edited

Loading

seberg Apr 28, 2026 •

edited

Loading

tylerjereddy Apr 30, 2026 •

edited

Loading

seberg May 1, 2026 •

edited

Loading