Hypothesis strategy for chunking arrays by TomNicholas · Pull Request #9374 · dask/dask

TomNicholas · 2022-08-11T04:14:43Z

Example of using this PR to create examples of arrays with a specific shape but arbitrary chunk structure:

In [9]: arr
Out[9]: 
array([[1, 2, 3],
       [4, 5, 6]])

In [10]: from dask.array.strategies import chunks

In [11]: dask.array.from_array(arr, chunks=chunks(arr.shape).example())
Out[11]: dask.array<array, shape=(2, 3), dtype=int64, chunksize=(1, 3), chunktype=numpy.ndarray>

In [12]: dask.array.from_array(arr, chunks=chunks(arr.shape).example())
Out[12]: dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 2), chunktype=numpy.ndarray>

Closes Hypothesis strategy for chunking arrays in different patterns #9373
Tests added / passed
Passes pre-commit run --all-files

GPUtester · 2022-08-11T04:14:45Z

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

pavithraes · 2022-08-11T11:28:26Z

ok to test

TomNicholas · 2022-08-11T19:17:39Z

Q: Should this utility expose max_chunk_lengths or be changed to expose max_num_chunks instead?

TomNicholas · 2022-08-11T19:59:14Z

For some reason the new strategy is not showing up on the API docs page

TomNicholas · 2022-08-11T20:03:21Z

mypy is failing with

dask/array/strategies.py:11: error: Untyped decorator makes function "block_lengths" untyped  [misc]
dask/array/strategies.py:45: error: Untyped decorator makes function "chunks" untyped  [misc]
Found 2 errors in 1 file (checked 244 source files)

which I would think is hypothesis' fault?

jsignell · 2022-08-17T14:57:26Z

Thanks for opening this @TomNicholas! I have a few thoughts:

The mypy failure definitely feels like a hypothesis issue. It seems like it would not be too hard to pass the type though the decorator though (I'm looking at https://stackoverflow.com/questions/65621789/mypy-untyped-decorator-makes-function-my-method-untyped) so they would likely welcome that change upstream.
I am not convinced that this belongs in the dask/array part of the codebase. This is something the people would only use in tests right so probably it should like in dask/array/tests/strategies.py
I don't have an opinion on whether to expose max_num_chunks or max_chunk_length. You could allow them both and error if they are both specified. Regardless of that choice these methods should be private at least to start.
Is this something that you are interested in maintaining? If so then great - I feel fine merging it into Dask if we can just ping you if something breaks :)

TomNicholas · 2022-08-17T17:32:54Z

Thanks for your comments @jsignell !

The mypy failure definitely feels like a hypothesis issue.

I noticed there are recent changes to the code for that decorator in hypothesis so it might be that requiring a more recent version would fix this typing error? Looks like this PR was intended to make typing work with @st.composite.

I am not convinced that this belongs in the dask/array part of the codebase. This is something the people would only use in tests right so probably it should like in dask/array/tests/strategies.py

This depends on whether you think the tests namespace should be entirely a private library internal or not (I do). In xarray at least (and in numpy) we use tests to refer to internal checks checks that are run by pytest, and testing to refer to any test-related public API that someone else might want to import. So for example xarray.tests imports assert_equal from xarray.testing. (And yes we have an xarray/tests/test_testing.py file 😁) However if dask has a different policy then I'm happy to defer, I just think this separation is neat.

I don't have an opinion on whether to expose max_num_chunks or max_chunk_length

This is really an intuition question for dask devs about which they think is more important for finding bugs. max_num_chunks would create uniform-length chunks by default (which we might want the strategy to arbitrarily deviate a bit from). max_chunk_length would create completely non-uniform chunks, but that might be overkill for finding bugs and/or mean an inefficient strategy.

You could allow them both and error if they are both specified.

I kind of don't really want to do that because they imply different strategies implementations under the hood, so allowing both is twice as much work 😅

Regardless of that choice these methods should be private at least to start.

Is this something that you are interested in maintaining? If so then great - I feel fine merging it into Dask if we can just ping you if something breaks :)

I'm putting these here so that I can import them in xarray (/xGCM/xhistogram...)! So I am happy to maintain them, but I also want the chunks strategy to be public so that I'm not relying on private API.

jsignell · 2022-08-22T13:55:30Z

I am not convinced that this belongs in the dask/array part of the codebase. This is something the people would only use in tests right so probably it should like in dask/array/tests/strategies.py

This depends on whether you think the tests namespace should be entirely a private library internal or not (I do). In xarray at least (and in numpy) we use tests to refer to internal checks checks that are run by pytest, and testing to refer to any test-related public API that someone else might want to import. So for example xarray.tests imports assert_equal from xarray.testing. (And yes we have an xarray/tests/test_testing.py file grin) However if dask has a different policy then I'm happy to defer, I just think this separation is neat.

Ah ok. Thanks for explaining. I took a look and realized that assert_eq is defined in dask/array/utils.py so I guess there is precedent. Given that, I am ok with the proposed placement and with it being public.

This is really an intuition question for dask devs about which they think is more important for finding bugs. max_num_chunks would create uniform-length chunks by default (which we might want the strategy to arbitrarily deviate a bit from). max_chunk_length would create completely non-uniform chunks, but that might be overkill for finding bugs and/or mean an inefficient strategy.

max_chunk_length feels slightly more dasky to me.

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>

…Nicholas/dask into hypothesis-chunking-strategy

TomNicholas · 2023-11-02T21:23:35Z

Coming back to this after a year 😅

I've become interested in this problem again, and would like to merge this, but I'm just fighting with the CI to get everything to pass.

TomNicholas · 2023-11-02T21:31:56Z

I've been adding hypothesis to the test envs but I guess I could just add an importorskip instead - that might be preferred rather than adding to the GPU test env?

Zac-HD

👋 exciting to see this active again!

My best guess is that the mypy error is because you're pinned to mypy == 1.1 - the current version is 1.7, and has a lot of bugfixes.

Zac-HD · 2023-11-13T00:57:38Z

+        chunks = data.draw(strategies.chunks(shape, axes=axes))
+
+        # assert that chunks add up to array lengths along all axes
+        lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks)


Suggested change

lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks)

lengths = tuple(sum(chunks_along_axis) for chunks_along_axis in chunks)

Zac-HD · 2023-11-13T00:59:11Z

+    if not isinstance(shape, tuple):
+        raise ValueError("shape argument must be a tuple of ints")
+
+    if min_chunk_length < 1 or not isinstance(min_chunk_length, int):
+        raise ValueError("min_chunk_length must be an integer >= 1")
+
+    if max_chunk_length is not None:
+        if max_chunk_length < 1 or not isinstance(min_chunk_length, int):
+            raise ValueError("max_chunk_length must be an integer >= 1")


I'd use hypothesis.errors.InvalidArgument, as the convention for "you passed an invalid argument to a strategy function" (it inherits from TypeError).

Zac-HD · 2023-11-13T01:08:43Z

+        if min_chunk_length > max_chunk_length_remaining:
+            # if we are at the end of the array we have no choice but to use a smaller chunk
+            chunk = remaining_length
+        else:
+            chunk = draw(
+                st.integers(
+                    min_value=min_chunk_length, max_value=max_chunk_length_remaining
+                )
+            )


Shrinking will tend to work better if the underlying elements are drawn from the same domain. In this case I'd implement the strategy by:

calculating the range of valid chunk sizes and range of number-of-chunks (excluding the final chunk)

drawing from st.lists(st.integers(...), ...)

discard trailing elements which would take you over ax_length and manually insert a final chunk of size ax_length - sum(chunk_sizes)

You could even implement this as a non-@st.composite function which did some validation and precomputation, and then return st.lists(...).map(_turn_into_valid_chunks) to better amortize the setup cost.

TomNicholas added 2 commits August 11, 2022 00:05

implemented chunks strategy

65f13da

remove cheeky print statements

58b4664

github-actions Bot added the array label Aug 11, 2022

TomNicholas added 2 commits August 11, 2022 00:39

black formatting and typing

302af9c

add to array docs

f8d4d98

github-actions Bot added the documentation Improve or add to documentation label Aug 11, 2022

test that chunks add up to lengths

ca45143

TomNicholas mentioned this pull request Aug 11, 2022

Hypothesis testing of different chunking patterns cubed-dev/cubed#86

Open

TomNicholas added 2 commits August 11, 2022 15:00

tests for invalid args to strategy

266290d

fix ommission in docstring

84a004c

TomNicholas added 3 commits August 11, 2022 15:22

neater error if importing strategy without hypothesis installed

5932486

add hypothesis to test dependencies

7ea02b0

add hypothesis to other CI envs

cdbafdf

This was referenced Aug 11, 2022

Strategy for chunking arrays HypothesisWorks/hypothesis#3433

Closed

Public hypothesis strategies for generating xarray data pydata/xarray#6911

Open

TomAugspurger reviewed Aug 22, 2022

View reviewed changes

Comment thread dask/array/strategies.py Outdated

TomNicholas mentioned this pull request Jul 18, 2023

WIP: Dynamic rechunking option for StoreToZarr pangeo-forge/pangeo-forge-recipes#546

Closed

3 tasks

TomNicholas and others added 5 commits November 2, 2023 15:45

Merge branch 'main' into hypothesis-chunking-strategy

3371f4f

add hypothesis back into optional deps

86f1f12

added hypothesis to CI environments

98fa6ab

correct import path in error message

da3401a

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>

don't run doctests on examples

2f61eff

TomNicholas added 4 commits November 2, 2023 16:27

Merge branch 'hypothesis-chunking-strategy' of https://github.com/Tom…

c7402fd

…Nicholas/dask into hypothesis-chunking-strategy

don't use example in tests

e01bd86

linting

6065937

add hypothesis to docs requirements so that API link works

a1c7b25

TomNicholas added 2 commits November 10, 2023 14:45

add hypothesis to final env

b781b7e

remove unused typing

7520e6d

TomNicholas mentioned this pull request Nov 10, 2023

Untyped decorator makes function untyped for strategies.composite decorator HypothesisWorks/hypothesis#3786

Closed

TomNicholas added 2 commits November 11, 2023 13:27

fix incorrect typing of return value of decorated function

10f8ea5

Merge branch 'main' into hypothesis-chunking-strategy

9d4e7c5

Zac-HD reviewed Nov 13, 2023

View reviewed changes

github-actions Bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Jun 10, 2024

TomNicholas mentioned this pull request Aug 22, 2024

Cubed xarray tests xarray-contrib/xarray-array-testing#4

Open

	lengths = tuple(sum(list(chunks_along_axis)) for chunks_along_axis in chunks)
	lengths = tuple(sum(chunks_along_axis) for chunks_along_axis in chunks)

Uh oh!

Conversation

TomNicholas commented Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GPUtester commented Aug 11, 2022

Uh oh!

pavithraes commented Aug 11, 2022

Uh oh!

TomNicholas commented Aug 11, 2022

Uh oh!

TomNicholas commented Aug 11, 2022

Uh oh!

TomNicholas commented Aug 11, 2022

Uh oh!

jsignell commented Aug 17, 2022

Uh oh!

TomNicholas commented Aug 17, 2022

Uh oh!

jsignell commented Aug 22, 2022

Uh oh!

Uh oh!

TomNicholas commented Nov 2, 2023

Uh oh!

TomNicholas commented Nov 2, 2023

Uh oh!

Zac-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Zac-HD Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

Zac-HD Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

Zac-HD Nov 13, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

TomNicholas commented Aug 11, 2022 •

edited

Loading