Skip to content

ENH, PERF: allocate memory as part of the array object for scalars#31092

Open
mhvk wants to merge 2 commits intonumpy:mainfrom
mhvk:array-allocate-scalar-or-strides-via-object-flags
Open

ENH, PERF: allocate memory as part of the array object for scalars#31092
mhvk wants to merge 2 commits intonumpy:mainfrom
mhvk:array-allocate-scalar-or-strides-via-object-flags

Conversation

@mhvk
Copy link
Copy Markdown
Contributor

@mhvk mhvk commented Mar 29, 2026

@seberg, @eendebakpt: this is an updated (and greatly improved) version of #29878, making use of the fact that in previous PRs I removed the explicit dependence on how data were allocated.

Details

Allocate memory for the dimensions and strides, as well as small amounts of data, as part of the object, to avoid the overhead of multiple allocations (with the allocation of tracked memory for data especially large).

Note that the above is done only for standard arrays, though I think in principle it could be extended to subttypes with a suitable check.

Also, allocating space for data is done only for the standard allocator.

Timings

%timeit np.empty(())
133->68.2 ns
a = 1.6

%timeit np.array(a)
151->83 ns

%timeit np.add(a, a)
581->423 ns

b = np.array(a)
%timeit np.add(b, b)
293->225 ns

No AI was used.

@mhvk mhvk requested review from eendebakpt and seberg March 29, 2026 12:54
@mhvk mhvk force-pushed the array-allocate-scalar-or-strides-via-object-flags branch 2 times, most recently from 87e035f to 0f8ba39 Compare March 29, 2026 13:02
npy_free_cache_dim(PyArray_DIMS(arr), PyArray_NDIM(arr));
/* deallocate if not part of the array instance */
if ((PyArray_DIMS(arr) != NULL)
&& (PyArray_DIMS(arr) !=
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, there should be another flag for whether the strides and dimensions are stored on the array, but it seemed overkill.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPY_ARRAY_DIMSONINSTANCE seems clearer to me than the pointer math that happens here but I agree it's not a big deal

@mhvk mhvk force-pushed the array-allocate-scalar-or-strides-via-object-flags branch 3 times, most recently from c5f5dfd to 4ee2623 Compare March 29, 2026 14:18
Copy link
Copy Markdown
Contributor

@eendebakpt eendebakpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and small things, but overall I am +1 on this

There are a couple of cases (pickling, reshapes) where we could make more use of the inline data, dims and strides, but since these are deprecated or not performance critical I suggest we do not add them here.

* This flag may be tested for in PyArray_FLAGS(arr).
* (MAYBE: allow it to be requested as well?)
*/
#define NPY_ARRAY_DATAONINSTANCE 0x0008
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly needed (where the data is allocated is an implementation detail, in any case for the default memory handler). But fine in having this in the interface. If we keep this, can we add a test for it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this should be an implementation detail, and in principle one could also tell that the data is on the instance from the fact that mem_handler=NULL -- unfortunately, in the code elsewhere that is taken to mean that someone messed around with the instance and just set data (it is free'd in the deallocator).

That said, I agree this does not need to be public; I just need to have some way to know the data are on the instance internally.

I actually wondered whether one could adjust the default memory handler to somehow keep track of this, but am not sure how... (it doesn't have access to the instance, after all).

Comment thread numpy/_core/src/multiarray/ctors.c Outdated
Comment thread numpy/_core/src/multiarray/ctors.c
Comment thread numpy/_core/src/multiarray/ctors.c Outdated
Allocate memory for the dimensions and strides as part of the object,
to avoid the overhead of an extra allocation.  This will only become
useless memory if the array shape is set (now deprecated) or the array
is resized in-place (frowned upon).

Note that the above is done only for standard arrays, though I think
in principle it could be extended to subttypes with a suitable check.

Also, allocating space for data is done only for the standard
allocator.

Timings
```
%timeit np.empty(())
133->127 ns

a = 1.6
%timeit np.array(a)
151->137 ns

%timeit np.add(a, a)
581->557 ns

b = np.array(a)
%timeit np.add(b, b)
293->279 ns
```
@mhvk mhvk force-pushed the array-allocate-scalar-or-strides-via-object-flags branch from 4ee2623 to 50923bb Compare March 30, 2026 13:40

fail:
NPY_traverse_info_xfree(&fill_zero_info);
Py_XDECREF(fa->mem_handler);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake in the failure path - mem_handler gets decref'd in the deallocator too.

Allocate memory for small amounts of data as part of the object, to
avoid the overhead of the allocation of tracked memory.

Like for the dimensions and strides, this is done only for standard
arrays.  Furthermore, allocating space for data is done only for the
standard allocator.

Timings relative to having dimensions, strides and data not on the
object:
```
%timeit np.empty(())
133->68.2 ns

a = 1.6
%timeit np.array(a)
151->83 ns

%timeit np.add(a, a)
581->423 ns

b = np.array(a)
%timeit np.add(b, b)
293->225 ns
```
@mhvk mhvk force-pushed the array-allocate-scalar-or-strides-via-object-flags branch from 50923bb to 13df5f5 Compare March 30, 2026 14:18
@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Mar 30, 2026

OK, now split into 2 commits, one for storing the dimensions and strides on the object (which has only a modest benefit for performance, see commit message), and the second for also storing small amounts of data.

I'm still not quite happy with having to keep track of whether or not the data is on the instance via a flag; suggestions for alternatives appreciated! Ideally it would somehow be possible to let the default allocator do some of the work, by registering the address of the data-on-instance as memory that does not need freeing (so renew would just allocate new memory).

Copy link
Copy Markdown
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments below. I asked Claude Code to do a review pass and most of these are generated from the report is made. All the wording is from me and I tried to validate everything I commented about.

return PyErr_NoMemory();
}
/* PyType_GenericAlloc zeroes the extra bytes, but we don't need to. */
fa = (PyArrayObject_fields *)PyObject_Init((PyObject *)alloc, subtype);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can fail to due to memory exhaustion, might as well add a NULL check just in case

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot fail - the allocation has already been made above. Python also does not check:
https://github.com/python/cpython/blob/9e5b8383724211d14165a32c0e7682e56e13843a/Objects/object.c#L531-L540

Py_DECREF(descr);
return PyErr_NoMemory();
}
/* PyType_GenericAlloc zeroes the extra bytes, but we don't need to. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* PyType_GenericAlloc zeroes the extra bytes, but we don't need to. */
/* PyType_GenericAlloc zeroes the extra bytes, but we don't need to
because all fields are explicitly initialized below */

for future readers, in case new fields ever get added?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've done this in the other PR, though, since I think that ended up the better approach.

npy_free_cache_dim(PyArray_DIMS(arr), PyArray_NDIM(arr));
/* deallocate if not part of the array instance */
if ((PyArray_DIMS(arr) != NULL)
&& (PyArray_DIMS(arr) !=
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPY_ARRAY_DIMSONINSTANCE seems clearer to me than the pointer math that happens here but I agree it's not a big deal

* TODO: make this a plain else; see comment in array_dealloc.
*/
if (newnbytes <= oldnbytes) {
new_data = PyArray_DATA(self);
Copy link
Copy Markdown
Member

@ngoldbaum ngoldbaum Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get here and PyArray_NDIM(self) != new_nd, then we might get into a state where the dimensions are stored outside the array but the data are still stored on-instance. Maybe there should be a PyArray_NDIM(self) == new_nd check here?

/*
* Create the array instance.
*
* For standard arrays, we allocate extra memory for the dimensions and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "for ndarrays (but not subtypes)" instead of 'standard arrays'.

Py_DECREF(descr);
return NULL;
if (subtype == &PyArray_Type) {
size_t size = subtype->tp_basicsize;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 64 bit architectures, this is 96 bytes, which is enough space for 12 64 bit data elements. So this optimization applies for small arrays as well as scalars. Unless I'm missing something that restricts it to just scalars.

The PR description and title talk about scalars, so maybe that behavior is unintentional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, originally I only did scalars, but now it is for small arrays as well, indeed.

@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Mar 31, 2026

See gh-31108 for an alternative implementation that uses a "hacked" allocator to deal with possible resize, etc., but avoids having a flag. It has the advantage that it needs far less change elsewhere in the code base, thus giving hope that if people use mem_handler outside of numpy, things will just continue to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants