You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A common issue is that a user does not know in advance how many chunks a collection needs to have. The typical workaround to this is to generate as many chunks as it could possibly have and then leave some (most) of them either empty or oversplit.
Same problem is for
da.Arrays with an unknown number of dimensions
df.DataFrame with unknown columns
An elegant solution, which would require #5671 as a prerequisite, is the following:
Proposed core design
Enhance Client.publish_dataset to accept nameless collection(s). If you do, the function returns a single future and the collection is not listed by Client.list_datasets.
Calling result() on the future returns the persisted collection or tuple of collections.
When the future is descoped, the published collection is automatically forgotten.
In dask/dask, add dask.publish(*collections) and the shorthand collection.publish().
If scheduler==distributed, call Client.publish_dataset under the hood and return a delayed wrapping its output distributed.Future.
If scheduler in ("threads", "synchronous" "processes"), call persist() and then return a single dummy delayed object to the persisted collection(s).
Proposed extension: rechunk() nan chunks
Array.rechunk can't be applied to an array where one or more chunks have size nan, since it would create an unknown amount of output chunks. To fix this, it could gain an optional bool parameter delay:
crusaderky
changed the title
New feature: Create collections with number of chunks not known at graph definition time
[Tasks from tasks] Create collections with number of chunks not known at graph definition time
Apr 26, 2022
crusaderky
changed the title
[Tasks from tasks] Create collections with number of chunks not known at graph definition time
[Tasks from tasks] Create collections with shape and/or number of chunks not known at graph definition time
Apr 26, 2022
crusaderky
changed the title
[Tasks from tasks] Create collections with shape and/or number of chunks not known at graph definition time
[Tasks from tasks] Create collections with number of chunks not known at graph definition time
Apr 26, 2022
crusaderky
changed the title
[Tasks from tasks] Create collections with number of chunks not known at graph definition time
[Tasks from tasks] Create collections with dimensionality and/or number of chunks not known at graph definition time
Apr 26, 2022
crusaderky
changed the title
[Tasks from tasks] Create collections with dimensionality and/or number of chunks not known at graph definition time
[Tasks from tasks] Create collections with number of chunks not known at graph definition time
Apr 26, 2022
Coming back to this, I think the high-level wrapper in the example, rechunk, is quite a lazy design.
A much better approach would be to write a DelayedArray class, which replicates the API of Array and retains all possible metadata.
Same for dask.dataframe.repartition(partition_size=...): it should return a DelayedDataframe that knows (and pretty-prints) the column header, but doesn't know about the number of partitions.
A common issue is that a user does not know in advance how many chunks a collection needs to have. The typical workaround to this is to generate as many chunks as it could possibly have and then leave some (most) of them either empty or oversplit.
Same problem is for
An elegant solution, which would require #5671 as a prerequisite, is the following:
Proposed core design
Enhance
Client.publish_dataset
to accept nameless collection(s). If you do, the function returns a single future and the collection is not listed byClient.list_datasets
.Calling result() on the future returns the persisted collection or tuple of collections.
When the future is descoped, the published collection is automatically forgotten.
Sample usage
Proposed extension: compatibility with pure dask
In dask/dask, add
dask.publish(*collections)
and the shorthandcollection.publish()
.If scheduler==distributed, call
Client.publish_dataset
under the hood and return a delayed wrapping its output distributed.Future.If scheduler in ("threads", "synchronous" "processes"), call persist() and then return a single dummy delayed object to the persisted collection(s).
Proposed extension: rechunk() nan chunks
Array.rechunk
can't be applied to an array where one or more chunks have size nan, since it would create an unknown amount of output chunks. To fix this, it could gain an optional bool parameterdelay
:Similar treatment could be done for all functions in dask.array and dask.dataframe that currently don't work with nan chunks.
The text was updated successfully, but these errors were encountered: