-
Notifications
You must be signed in to change notification settings - Fork 346
Description
This is probably a long shot but I thought I can ask about it here anyway.
Saving a DataFrame to Redshift can be seen as a 2-phase process:
- Writing files to S3
- COPYing data from S3 to a Redshift table
Of course the driver program is blocked on DataFrameWriter.save() for the duration of both phases.
During phase 2 no Spark jobs are running because things happen outside the cluster, i.e. between S3 and Redshift.
In my setup phase 2 takes much longer than phase 1 and amounts to substantial amount of time.
Perhaps this is uncommon, but I don't actually need to wait for the data to materialize in Redshift to proceed with next jobs in my driver program.
I am wondering if it would be a compelling feature to allow step 2 to happen asynchronously? I would imagine that in such scenario e.g. a Future representing phase 2 could somehow be passed to client code however I don't think Data Sources API allows for something like that.
If such a feature would be of interest and if we found a sane way to do it - I would be happy to implement.