BUG: Use dict to store ufunc loops (fixes thread-safety)#31184
BUG: Use dict to store ufunc loops (fixes thread-safety)#31184ngoldbaum merged 5 commits intonumpy:mainfrom
Conversation
This could even be a global mutex, since adding new loops in a deallocator is not reasonable. Of course, there are other complexities around promotion and the loop cache that are indirectly related. And it may thus be that we'll want to use this (or a similar) mutex more widely. I.e. this does _not_ enable simple thread-safe dynamic addition of user loops (the user would have to lock themselves and track if they added the loop dynamically already at the time being). Closes numpygh-31112
|
While a mutex/critical section should be OK here (although I forgot to replace the The downside is that the best solution for the actual promotion is probably just to call |
| int nin = ufunc->nin, nargs = ufunc->nargs; | ||
| Py_ssize_t size = PySequence_Length(ufunc->_loops); | ||
| int ret = -1; | ||
| /* PyDict_Values returns a snapshot, safe against concurrent additions. */ |
There was a problem hiding this comment.
Yeah, I used cursor/claude (although had to tell it to modify almost everything to my liking... just because you set out to make something thread-safe, doesn't mean it quite remembered that part, hehe), had to tell it off to reduce it's comment load, but this seems pretty OK ;).
One note: I decided to keep the tuple -> (tuple, meth/promoter) for now. Just like that one // borrow-ref OK comment this is currently not designed to allow deletion and it would just add churn here...
(I am not even sure if we want to ever delete things, rather than just put it somewhere else... even if we allow replacing, it shouldn't happen often and immortalizing may be pragmatic.)
EDIT: As a note, I suspect all is good. But I'll go once more myself tomorrow morning to see if there aren't ref-counting issues or so. (done)
There was a problem hiding this comment.
So, the value of the dict effectively contains a tuple (key, real-value)? Or can the tuple of dtypes possibly be different? If not, then my sense would be to remove the key from the value, to help the logic: a dict keyed by the dtypes that returns a promotor or method seems much clearer conceptually.
That said, I think it makes sense to not make that change here if it is more intrusive, to keep this PR focussed on thread safety.
There was a problem hiding this comment.
Right, the thing is that info currently ends up in the cache (where the key isn't identical anymore) as borrowed.
So it looks awkward, but changing it seems really awkward too.
The practical refactor might be different: Right now there is an awkward distinction between "BoundArrayMethod" and the "ArrayMethod", if we remove it the need for the tuple should just go away (of course the DTypes will still be duplicated, but it shouldn't feel as awkward ;)).
(The backstory is to accept that DTypes are immortal -- with a plausible exception for some future dynamically created subclasses of a superclass DType which would have no loops registered for the subclass. At the time that felt, and was pointed out, like we should keep it plausible to make DTypes mortal in practice and that requires the split. But it was never practical -- Python doesn't even have the necessary technology to do it nicely/correctly!)
There was a problem hiding this comment.
OK, that makes sense -- and getting rid of bound vs unbound (a difference I don't fully understand...) sounds like a plan!
There was a problem hiding this comment.
Too much information, so feel free to ignore :).
and getting rid of bound vs unbound (a difference I don't fully understand...)
Like a bound method: A bound method keeps self alive, a BoundArrayMethod keeps the DTypes (i.e. the multiple "self") alive.
With that distinction it was plausible to at least clean up many/most cases (still annoying as the key needs to be something like a weakref).
Without... Maybe it is still plausible if maybe more awkward; but all cases would require technology Python doesn't have (i.e. ephemerons).
There was a problem hiding this comment.
In a variation, too much ignorance, so feel free to ignore :)
But why do array methods need to keep dtypes alive at all? Isn't it only the other way around, an otherwise unused array method should just disappear if all dtypes that use it have disappeared? Or, trying not to be completely ignorant, is it because the array method has internal links to the dtypes it works for? (I really should re-read the relevant NEPs and documentation and try to make improve/summarize myself, so people like me who have an incomplete idea can understand how it all works...)
There was a problem hiding this comment.
Since I started to try to explain it, I'll continue. But to be clear: The only thing that matters is if there is a real reason to think that cleanup matters...
I am not 100% sure, but to me it seems rather implausible (it still works for dynamically created DTypes if you only register their loops via a superclass DTypes -- although at least casts don't work for that pattern right now).
I doubt the NEPs help this is just unwieldy to think about but a pretty small detail in a sense...
We need the DTypes when executing any ArrayMethod at some point (we also pass it in to the resolve_descriptors, etc. but we need it for more things before also).
So at that time you need them guaranteed alive in some form, and those DTypes aren't the ones passed in by the user:
- Because promotion might have happened.
- If output DTypes differ from inputs, then those need to be kept somewhere and at least in principle such an
ArrayMethod(or ufunc loop) should keep the output DType alive as long as it is callable.
Now, I think at the time I mostly said that 2 is the thing that just can't work (except maybe for casts).
The thing you can possibly make work, is if the ArrayMethods can become "invalid". I don't think I really thought about it at the time (as it breaks 2).
Having a BoundArrayMethod solve the need for having to "invalidate" an ArrayMethod (whether we can be sure to clean it up or not), while allowing us to have an object that represents the full thing.
Now, the only truly correct path is if the ufunc loop (effectively) keeps it's DTypes alive but is cleaned up when it becomes inaccessible. If the ArrayMethod doesn't own the DTypes this can work for many cases (e.g. if all input and output DTypes are identical and maybe casts -- if all are the same, a weakref can be used).
But for all other cases you need an Ephemeron to spell something like: As long as the input DType is alive, this ArrayMethod/key is alive and keeps all the (output) DTypes alive.
(If Python had an ephemeron aware GC, the cyclic GC could then figure out if a DType becomes truly inaccessible.)
But as you can (hopefully) see, you do need a split if you want cleanup in the generic case. Right now it is a (barely used/existing) BoundArrayMethod vs. ArrayMethod. If the ArrayMethod knew it's DTypes it but didn't keep it alive, then it would mean that it needs to have an "invalid" state.
And technically, if you expose it to Python you still kinda need the BoundArrayMethod as at least in that case it should guarantee staying valid!
(I suppose I could also store the BoundArrayMethod as the info for now, but it is also a step in the "we'll never implement cleanup anyway" direction ;).)
Anyway, this is too much and I dunno if it can be understood :). In the end, the question is really if there actually is some reason to care about plausible generic cleanup... My gut feeling is "no" as it is just really impractical, but if there is a sensible reason, then keeping the current split may be better to not move in a direction that would make it even harder.
There was a problem hiding this comment.
Thanks, yes, the output dtype is a tricky one. Conceptually, I think it doesn't strictly need to be on the array method:
- For dtype casts, the casts themselves could be a dict of methods keyed by output dtype, so that the cast machinery rather than the array method becomes responsible for keeping the references.
- For ufuncs, I guess the method would need to be the value in a dict keyed by both input and output dtype, and anything that is incomplete has to be handled by a promoter.
But, really, there is a lot to be said for a arraymethod specification to be self-describing, which does mean it is logical to have references to the dtypes.
And I think you are right that throw-away dtypes that also have their own loops are an unlikely prospect, so in practice there really should not be a problem.
mhvk
left a comment
There was a problem hiding this comment.
Overall, this looks nice! Indeed, the dict seems a more logical solution to the problem.
But it does seem to me that, perhaps in follow-up, we should move to have a dict that just has the promoter or array method as its value.
| int nin = ufunc->nin, nargs = ufunc->nargs; | ||
| Py_ssize_t size = PySequence_Length(ufunc->_loops); | ||
| int ret = -1; | ||
| /* PyDict_Values returns a snapshot, safe against concurrent additions. */ |
There was a problem hiding this comment.
So, the value of the dict effectively contains a tuple (key, real-value)? Or can the tuple of dtypes possibly be different? If not, then my sense would be to remove the key from the value, to help the logic: a dict keyed by the dtypes that returns a promotor or method seems much clearer conceptually.
That said, I think it makes sense to not make that change here if it is more intrusive, to keep this PR focussed on thread safety.
_loops (fixes thread-safety)
_loops (fixes thread-safety)
ngoldbaum
left a comment
There was a problem hiding this comment.
Spotted a couple minor issues (with the help of Claude). Also ping @kumaraditya303 to give this a once-over.
Also can you update the PR description so this gets merged with a more relevant commit message?
| } | ||
| if (existing_info != NULL) { | ||
| PyObject *existing_meth = PyTuple_GET_ITEM(existing_info, 1); | ||
| Py_DECREF(existing_info); |
There was a problem hiding this comment.
I think it's possible this decref could cause the last reference to existing_meth to go away. Since you have a borrowed reference, that means existing_meth is possibly invalid after here. There might be larger architectural reasons why that's not possible, so a comment explaining might be nice. You could also move the existing_inf decref to line 272 or below, to ensure existing_meth stays alive until its last use.
There was a problem hiding this comment.
The dict holds a strong ref to existing_info so this is safe, but it would still be nice to move the decref.
| } | ||
| if (existing_item != NULL) { | ||
| PyObject *registered = PyTuple_GET_ITEM(existing_item, 1); | ||
| Py_DECREF(existing_item); |
There was a problem hiding this comment.
same pattern here, either leave a comment explaining why this is safe or move the decref below the last use of registered.
There was a problem hiding this comment.
Yeah, nicer to reorganize, claude is right, that claude didn't do this perfectly ;) (even if fine as the old code also relied on borrowing being OK)
There was a problem hiding this comment.
I think there's still an issue - the DECREF has to be after the last use of registered on line 5259 below.
|
I tried to trigger the crash on my Mac for about 5 minutes running the command in #31112 (comment) and didn't see any failures. Hard to prove a negative but I think we can conclude this avoids the race that causes the crash. |
| /* New private fields related to dispatching */ | ||
| void *_dispatch_cache; | ||
| /* A PyListObject of `(tuple of DTypes, ArrayMethod/Promoter)` */ | ||
| /* Ordered dict `tuple of DTypes -> (tuple of DTypes, ArrayMethod/Promoter)` */ |
There was a problem hiding this comment.
I find it little confusing because there is a separate collections.OrderdDict, dicts are insertion ordered by default so maybe just remove the "ordered"?
There was a problem hiding this comment.
True, but I want to remind whoever is about to change it that the order may matter.
ngoldbaum
left a comment
There was a problem hiding this comment.
One last comment, I think there's still a refcounting issue in one of the code paths I commented on. Also the PR description still needs a rewrite before merging. Approving though since these are just touchup-level changes.
| } | ||
| if (existing_item != NULL) { | ||
| PyObject *registered = PyTuple_GET_ITEM(existing_item, 1); | ||
| Py_DECREF(existing_item); |
There was a problem hiding this comment.
I think there's still an issue - the DECREF has to be after the last use of registered on line 5259 below.
kumaraditya303
left a comment
There was a problem hiding this comment.
LGTM, thanks for fixing this
|
Thanks Sebastian! |
Change
_loopsto use a dict instead to be able to usesetdefaultfor adding new loops.A dict is also ordered, although we need to use
.items()to iterate it, otherwise things get nicer over-all (although the key is repeated for now, to simplify passing aroundinfo).Closes gh-31112