ENH: allow larger than C int sized structured dtypes#31332
ENH: allow larger than C int sized structured dtypes#31332tylerjereddy wants to merge 20 commits into
Conversation
seberg
left a comment
There was a problem hiding this comment.
Just some quick comments. Would be good to check if there might be other (int) casts are definitions (but I can see even asking some LLM to search the code-base).
I guess the pickling path is already fixed, so there likely isn't much, but I am not sure (some string paths or maybe the hashing).
| assert_equal(x, y) | ||
| assert_equal(x[0], y[0]) | ||
|
|
||
| @pytest.mark.slow() |
There was a problem hiding this comment.
test_pickling_large (and the rest of the full testsuite) now pass locally on this branch, thanks to a small type fix in arraydescr_reduce.
A few notes here:
- on an ARM M3 Max mac laptop, that test takes a long time to pass, more than 6 minutes:
1 passed in 402.71s (0:06:42); I think that means that theslow()marker doesn't cut it, and I guess NumPy doesn't havexslow? What do you want me to do here then? Do we really need to iterate over 2 x > C int cases x 5 pickle protocol versions? If so, how should we selectively run this only once in a while? Even a single protocol version is over a minute to run. - should we also stack in a
requires_memorydecoration here? I believe these arrays are pretty massive like in the other test I added that materializes such arrays.
There was a problem hiding this comment.
Maybe just skip the "dtype is functional" test, or make the array empty np.zeros(0, dtype=dtype).
I am not 100% sure if that covers everything, but it seems fair to me to do that.
There isn't really much to check anyway, the only thing I can think of is whether dtype.flags == pickled.flags, because the dtype == pickled should check almost everything else (metadata is already checked).
There was a problem hiding this comment.
I left the initial materialization in, y = np.zeros(3, dtype=pickled), but gave check_pickling a default-on argument that optionally runs the actually assertion on the materialized arrays, which I've turned off for the test_pickling_large calls.
I believe this is probably robust, but may have some risk, if some operating system did actually fully allocate the RAM on a zeros call, but I believe that's rare? CI may tells us--I just flushed CI for this time.
With that change, the test would no longer fail without the source patches here though, so I added an extra check that assert_equal(pickled.itemsize, dtype.itemsize), which does fail without the work in this PR. I was slightly surprised that assert_equal(pickled.descr, dtype.descr) wasn't sufficient to enforce proper reconstitution of itemsize, since that reads as "they have the same descriptors," but apparently it doesn't...
This also allowed me to remove the slow marker.
There was a problem hiding this comment.
Alright, CI is failing on the first try, IS_64BIT may not be a sufficient guard for some Windows cases?
test_gh_31308 needs a 64-bit guard. Some of the CI failures are not related--bit messy.
There was a problem hiding this comment.
There is one MKL job generally failing right now, just ignore that one.
There was a problem hiding this comment.
The latest commit suppresses CI again for now. There is one remaining failure that I'm going to need to debug directly on a Windows box, below the fold. The new itemsize check that I added to check_pickling seems to show dtype = np.dtype(f"({2**31},)i") getting set to dtype(('<i4', (-2147483648,))), independent of serialization. The intp size on that Windows CI runner should be just fine since the test is guarded by IS_64BIT, which comes from np.dtype(np.intp).itemsize. git grep -E -i "int elsize" still has several hits on this branch, but way easier to debug directly on Win machine, so I'll have to circle back.
Details
> assert_equal(pickled.itemsize, dtype.itemsize)
E AssertionError:
E Items are not equal:
E ACTUAL: 0
E DESIRED: 8589934592
arr_assert = False
buf = b'cnumpy\ndtype\np0\n(VV0\np1\nI00\nI01\ntp2\nRp3\n(I3\nV|\np4\n(g0\n(Vi4\np5\nI00\nI01\ntp6\nRp7\n(I3\nV<\np8\nNNNI-1\nI-1\nI0\ntp9\nb(I-2147483648\ntp10\ntp11\nNNI0\nI4\nI0\ntp12\nb.'
dtype = dtype(('<i4', (-2147483648,)))
pickled = dtype(('<i4', (-2147483648,)))
There was a problem hiding this comment.
You can probably get away without trying a windows box. If this diverges on windows, you can be sure there is some use of long along the way. Either a C long or PyLong_FromLong or so call somewhere along the branch.
And just grepping sees things like:
PyObject *size_obj = PyLong_FromLong((long) totalsize);
There are a lot more in that file... Got to just replace them all with FromSsizet (or whatever it is). And most likely that will be enough.
(There might be corresponding PyLong_AsLong somewhere, but from a grep those seem more likely to be fixed than the reverse.)
There was a problem hiding this comment.
Ok, the CI is passing apart from two clearly unrelated failures.
I did end up having to use a Windows box because fork CI debugging was too painful. I also added an additional assertion to one of my new tests (test_gh_31308()), since it should have picked up the descr problem that was causing the pickle test to fail (it now does).
|
Alright, most of the comments have been addressed now I think. Beyond some of the still-open discussions above, I wonder if I'm missing some important routes to structured dtype construction--my testing here mostly focused on the "tuple" specification route, but perhaps there are others that need testing/shims. I guess that's related to my self-review rquirement for more thorough tests. I'll have to circle back another day/time for that though. |
I guess creation wise, it is probably if |
I parametrized I thought I saw a dictionary code path too.. but anyway one thing at a time. |
Ah, yeah... |
That test case is now passing as well, after latest commit. Let me check the dictionary case next. |
Yes, that approach also overflows on |
* Related to numpygh-30315 and numpygh-31308, but very much a work in progress. * Although the original failing regression test does pass here, it is not even close to safe to do this yet even though the full testsuite passes locally with `test_pickling_large` re-enabled. [skip azp] [skip cirrus] [skip actions]
* `test_gh_31308` has been improved to verify that `itemsize` is actually correctly populated on the newly-supported `dtype` construction. * Minor source changes have been made to allow the above regression test to pass.
`test_gh_31308_materialized()` is now passing, so it has been adjusted to be allowed to run if sufficient memory is available. The test was also improved to add a basic assertion about the recarray size that results. [skip azp] [skip cirrus] [skip actions]
* `test_pickling_large` now passes thanks to a small type specification change in `arraydescr_reduce`. Note that the test now takes ~6 minutes to run locally on ARM Mac, so the newly-added `slow` marker probably isn't sufficient. [skip azp] [skip cirrus] [skip actions]
* Simplified the error handling in `_convert_from_tuple()` function based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]
* Simplify the error checking in `_convert_from_tuple()` and change a variable type in that function, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]
* `test_shape_invalid()` now has two "later overflow" test cases restored, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]
* `test_gh_31308_materialized()` has been adjusted to also have a 64-bit machine guard, since that is required for this test. [skip azp] [skip cirrus] [skip actions]
* `_convert_from_tuple()` now uses a more appropriate C function, `npy_mul_sizes_with_overflow()`, to check for overflow, based on reviewer feedback. [skip azp] [skip cirrus] [skip actions]
* Remove an extraneous typecast in `_convert_from_tuple()` function. [skip azp] [skip cirrus] [skip actions]
* Parametrize `test_gh_31308()` to include an `"i, i"` style structured dtype construction. That test case currently fails so will need to be repaired to avoid overflow. [skip azp] [skip cirrus] [skip actions]
* Adjust a variable typing in `_convert_from_list()` function to allow this structured dtype specification to be properly processed: `"2147483648i,2147483648i"`. [skip azp] [skip cirrus] [skip actions]
* `test_gh_31308()` has been augmented to include a new test case for "larger than C int" structured dtype specification via a dictionary. * `_convert_from_dict()` has had a variable type specification improved to support the above test case. [skip azp] [skip cirrus] [skip actions]
* More dictionary-based structured dtype specification cases have been added to `test_gh_31308()`. A small typing patch for the `offset` variable in `_convert_from_dict()` has been added that allows the new test cases to pass. [skip azp] [skip cirrus] [skip actions]
d415e07 to
3892f1d
Compare
An offset smaller than a C int was "ok," even if the field element size was already larger than a C int, but an offset larger than a C int would error out. Test cases and a small patch to fix/allow that have been pushed up. I also had to resolve some merge conflicts, guess there is some activity in the descriptor source. I'll continue to skip the CI for now. |
* `test_gh_31308()` has been improved with several new/more complex structured dtype test cases (they all pass, as expected). [skip azp] [skip cirrus] [skip actions]
* The `check_pickling` testing utility function has been adjusted to allow skipping the comparison of materialized arrays, because this can take several minutes for newly-supported large structured dtypes. To compensate for this, `check_pickling` has been augmented to additionally verify reconstitution of `itemsize` for serialized dtypes, which is a check that fails without the source patches in the above PR. * As a result, it is no longer necessary to mark `test_pickling_large` with `slow()`.
* `check_pickling` required too much memory for large structured dtypes, so the materializations of the arrays have been moved under the new `arr_assert` guard. * `test_gh_31308` was missing a `IS_64BIT` guard. [skip azp] [skip cirrus] [skip actions]
* Several more size/elsize related typing fixes in the descriptor source to support the above PR, and to fix a Windows test failure observed there (confirmed locally on Windows box). * Augment `test_gh_31308()` with an additional assertion that is sensitive to the need for some of these source changes.
* Add a release note for the above PR.
|
Thanks, this looks all good now. I dunno, TBH, an agent might be simplest to find and maybe also just fix these... Although a grep for
|
|
I don't mind working on those. I wish we had an option for i.e., the preferred LLM to just complain about it directly in code review and for i.e., you to curate those complaints to the actual items I need to address, if we're going to lean on agent checks. Obviously I noted a few times above that I could |
|
Some downstream exploration of simplifications enabled by this branch is being explored by the scientists at mcdc-project/mcdc#419. |
* Fixed an incorrectly typed `elsize` in `array_fromfile_binary()`. * Added a matching regression test, though note that `test_recarray_fromfile_massive()` already passes without the type change above. [skip azp] [skip cirrus] [skip actions]
|
I pushed in a commit to test/fix the The regression test is also extremely slow, though it will basically never run with I'll keep chipping away at these, though I don't like the ones that don't have an explicit reproducer that fails even if we "know" we need to change a type. |
| with tmpdir.as_cwd(): | ||
| rec_arr.tofile("f.data") | ||
| actual = np.fromfile("f.data", dtype=kind_dtype) | ||
| assert actual.itemsize == 2 ** 28 * 8 |
There was a problem hiding this comment.
Ohh, fun. OTOH, the error here would be that we are not reading everything (i.e. the last bit of the result not being -1). But I think there is a fun little thing happening here:
- This is a signed int overflow
2**31 -> -2**31 - Cast to
size_tfor reading, it'll actually just go back to2**31as that is unsigned.
So, no problem until we reach 2**32 at which point the error would be reading nothing at all.
Possible we could employ a funny trick here: Just try to read 1 element of a 2**32+1 sized dtype from a short file (not empty with the +1).
With fromfile(...., count=1) that should try to read one element, fail and return an empty array. But with the bug I think it'll actually return an array of length 1.
(That said, clear that is more of a regression test designed for the specific error in this code...)
Fixes ENH,BUG: DType creation doesn't allow to create larger than integer sized dtypes #30315
Fixes ENH: extremely large single field size dtype support #31308
Although the original failing regression test does pass here, it is not even close to safe to do this yet even though the full testsuite passes locally with
test_pickling_largere-enabled.I spent a few hours debugging on this one--I'll add a few initial inline self-review comments to start...
[skip azp] [skip cirrus] [skip actions]
AI Disclosure
No AI tools used