Hypothesis strategy for chunking arrays#9374
Conversation
|
Can one of the admins verify this patch? Admins can comment |
|
ok to test |
|
Q: Should this utility expose |
|
For some reason the new strategy is not showing up on the API docs page |
|
which I would think is hypothesis' fault? |
|
Thanks for opening this @TomNicholas! I have a few thoughts:
|
|
Thanks for your comments @jsignell !
I noticed there are recent changes to the code for that decorator in hypothesis so it might be that requiring a more recent version would fix this typing error? Looks like this PR was intended to make typing work with
This depends on whether you think the
This is really an intuition question for dask devs about which they think is more important for finding bugs.
I kind of don't really want to do that because they imply different strategies implementations under the hood, so allowing both is twice as much work 😅
I'm putting these here so that I can import them in xarray (/xGCM/xhistogram...)! So I am happy to maintain them, but I also want the |
Ah ok. Thanks for explaining. I took a look and realized that
|
Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
|
Coming back to this after a year 😅 I've become interested in this problem again, and would like to merge this, but I'm just fighting with the CI to get everything to pass. |
|
I've been adding hypothesis to the test envs but I guess I could just add an |
Zac-HD
left a comment
There was a problem hiding this comment.
👋 exciting to see this active again!
My best guess is that the mypy error is because you're pinned to mypy == 1.1 - the current version is 1.7, and has a lot of bugfixes.
| chunks = data.draw(strategies.chunks(shape, axes=axes)) | ||
|
|
||
| # assert that chunks add up to array lengths along all axes | ||
| lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks) |
There was a problem hiding this comment.
| lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks) | |
| lengths = tuple(sum(chunks_along_axis) for chunks_along_axis in chunks) |
| if not isinstance(shape, tuple): | ||
| raise ValueError("shape argument must be a tuple of ints") | ||
|
|
||
| if min_chunk_length < 1 or not isinstance(min_chunk_length, int): | ||
| raise ValueError("min_chunk_length must be an integer >= 1") | ||
|
|
||
| if max_chunk_length is not None: | ||
| if max_chunk_length < 1 or not isinstance(min_chunk_length, int): | ||
| raise ValueError("max_chunk_length must be an integer >= 1") |
There was a problem hiding this comment.
I'd use hypothesis.errors.InvalidArgument, as the convention for "you passed an invalid argument to a strategy function" (it inherits from TypeError).
| if min_chunk_length > max_chunk_length_remaining: | ||
| # if we are at the end of the array we have no choice but to use a smaller chunk | ||
| chunk = remaining_length | ||
| else: | ||
| chunk = draw( | ||
| st.integers( | ||
| min_value=min_chunk_length, max_value=max_chunk_length_remaining | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Shrinking will tend to work better if the underlying elements are drawn from the same domain. In this case I'd implement the strategy by:
- calculating the range of valid chunk sizes and range of number-of-chunks (excluding the final chunk)
- drawing from
st.lists(st.integers(...), ...) - discard trailing elements which would take you over
ax_lengthand manually insert a final chunk of sizeax_length - sum(chunk_sizes)
You could even implement this as a non-@st.composite function which did some validation and precomputation, and then return st.lists(...).map(_turn_into_valid_chunks) to better amortize the setup cost.

Example of using this PR to create examples of arrays with a specific shape but arbitrary chunk structure:
pre-commit run --all-files