Support parallelising processing #60

RickMoynihan · 2021-05-12T14:09:08Z

We can support a -p N parallelism flag that runs the transformation in N threads. Hopefully cutting processing time drastically.

This should be relatively straightforward by:

Inspecting the dialect data and from that deriving what end of line tokens are etc.
Looking at the file length in bytes
Crudely splitting the file into equal portions equivalent to N
Refine the split offsets slightly by scanning from their to the next true end of line.
Wind N streams to their appropriate offsets
Read the header row and give it to each thread
Pass each stream to N threads
Have each thread output to a separate file rdf file (appropriately numbered).
Potentially if asked support a concat flag, that will reconcattenate the files together.

The key to making this fast is to avoid parsing the whole CSV into batches in the splitting step. Any final concat should also just do so at the file level without any parsing of RDF.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallelising processing #60

Support parallelising processing #60

RickMoynihan commented May 12, 2021

Support parallelising processing #60

Support parallelising processing #60

Comments

RickMoynihan commented May 12, 2021