-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve original Dask partitions by default in Dataset.to_parquet
#254
Conversation
@rjzamora thanks for this PR. I have 2 questions:
Currently, when we run multi-gpu training we expect that each parquet files has the number of partition that is divisible by the number of gpus. if we do multi-gpu training on 2 gpus, then we expect the parquet files have 2, 4, 6,8, .. partitions. if they dont, we do
Well, we might recommend |
This question is tricky to answer because the behavior depends on the options use in both the original With that said, you are correct that the number of files written by the There are already multiple ways to specify the number of files you want written out in No matter the number of files written out by Altogether, there is currently no argument on the read side that has the explicit purpose of specifying the desired number of partitions, and the changes in this PR do not affect this limitation in any way. With that said, if the user wants a perfect file-to-partition mapping at read time, we could certainly tweak the
This PR makes it so that merlin/NVT will not export one large parquet file unless the user specifically sets |
Modifies the default behavior of
Dataset.to_parquet
to be more consistent withdask.dataframe.DataFrame.to_parquet
in the sense that each output file will correspond to distinct Dask partition.This may cause breakage for some workflows that are expecting a different number of output files (one that is related to the number of worker processes). With that in mind, I feel pretty strongly that we need this change.
@rnyak - I'm hoping this will begin to clear up the confusion around what users need to do to control/preserve partition sizes. In general, I'd advise against anyone using the
out_files_per_proc
argument unless their primary goal is effective shuffling. The historical reason that we have not preserved Dask partitions is that we have prioritized shuffling over deterministic partition sizes. However, whenout_files_per_proc
is not specified, I think we can assume that partitioning can/should be much simpler. Does this make sense to you?