You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can support a -p N parallelism flag that runs the transformation in N threads. Hopefully cutting processing time drastically.
This should be relatively straightforward by:
Inspecting the dialect data and from that deriving what end of line tokens are etc.
Looking at the file length in bytes
Crudely splitting the file into equal portions equivalent to N
Refine the split offsets slightly by scanning from their to the next true end of line.
Wind N streams to their appropriate offsets
Read the header row and give it to each thread
Pass each stream to N threads
Have each thread output to a separate file rdf file (appropriately numbered).
Potentially if asked support a concat flag, that will reconcattenate the files together.
The key to making this fast is to avoid parsing the whole CSV into batches in the splitting step. Any final concat should also just do so at the file level without any parsing of RDF.
The text was updated successfully, but these errors were encountered:
We can support a
-p N
parallelism flag that runs the transformation inN
threads. Hopefully cutting processing time drastically.This should be relatively straightforward by:
N
N
streams to their appropriate offsetsN
threadsThe key to making this fast is to avoid parsing the whole CSV into batches in the splitting step. Any final concat should also just do so at the file level without any parsing of RDF.
The text was updated successfully, but these errors were encountered: