Allow asynchronous COPY

This is probably a long shot but I thought I can ask about it here anyway.

Saving a `DataFrame` to Redshift can be seen as a 2-phase process:
1. Writing files to S3
2. COPYing data from S3 to a Redshift table

Of course the driver program is blocked on `DataFrameWriter.save()` for the duration of both phases.
During phase 2 no Spark jobs are running because things happen outside the cluster, i.e. between S3 and Redshift.
In my setup phase 2 takes much longer than phase 1 and amounts to substantial amount of time.
Perhaps this is uncommon, but I don't actually need to wait for the data to materialize in Redshift to proceed with next jobs in my driver program.

I am wondering if it would be a compelling feature to allow step 2 to happen asynchronously? I would imagine that in such scenario e.g. a `Future` representing phase 2 could somehow be passed to client code however I don't think Data Sources API allows for something like that.

If such a feature would be of interest and if we found a sane way to do it - I would be happy to implement.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow asynchronous COPY #197

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow asynchronous COPY #197

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions