Skip to content

Allow asynchronous COPY #197

@jacek-rzrz

Description

@jacek-rzrz

This is probably a long shot but I thought I can ask about it here anyway.

Saving a DataFrame to Redshift can be seen as a 2-phase process:

  1. Writing files to S3
  2. COPYing data from S3 to a Redshift table

Of course the driver program is blocked on DataFrameWriter.save() for the duration of both phases.
During phase 2 no Spark jobs are running because things happen outside the cluster, i.e. between S3 and Redshift.
In my setup phase 2 takes much longer than phase 1 and amounts to substantial amount of time.
Perhaps this is uncommon, but I don't actually need to wait for the data to materialize in Redshift to proceed with next jobs in my driver program.

I am wondering if it would be a compelling feature to allow step 2 to happen asynchronously? I would imagine that in such scenario e.g. a Future representing phase 2 could somehow be passed to client code however I don't think Data Sources API allows for something like that.

If such a feature would be of interest and if we found a sane way to do it - I would be happy to implement.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions