Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved #295

complyue · 2022-03-04T08:43:01Z

Lines 276 to 293 in 614fce0

    
           tsk = Threads.@spawn begin 
        
               i = 1 
        
               for tsk in tsks 
        
                   cols = fetch(tsk) 
        
                   if i == 1 
        
                       foreach(x -> push!(columns(t), x), cols) 
        
                   elseif i == 2 
        
                       foreach(1:length(cols)) do i 
        
                           columns(t)[i] = ChainedVector([columns(t)[i], cols[i]]) 
        
                       end 
        
                   else 
        
                       foreach(1:length(cols)) do i 
        
                           append!(columns(t)[i], cols[i]) 
        
                       end 
        
                   end 
        
                   i += 1 
        
               end 
        
           end

I see from above that record batches will be parsed (esp. decompression could be rather intensive computation workload) in parallel if the Julia runtime has multithread enabled, which is great.

But according to the implementation, the original order of batches as they had been written will not be guaranteed as preserved, which I think is not ideal. I'm not sure how Arrow spec should say about this aspect, but I'm dealing with time series data recorded batch-by-batch where the order signifies a lot.

I'd like to draft a PR to preserve batch order with regard to this concern, and as I start tinkering with the codebase, I file this issue to ask your opinions about it.

(Btw, I'm also tinkering about a PR for #293, which is orthogonal wrt functionality, but seems closely related wrt implementation details. I'd think 2 separate PRs would make better clarity for review and release purpose, but if you can accept a single PR addressing the 2 things together, it could be a lot easier for me, given I'm not fluent in git rebasing and related skills.)

We already have a utility defined (`OrderedChannel`) that we use when writing record batches to ensure batches get _written_ in the same order they are provided; it makes sense to use the same utility when reading to ensure incoming record batches are _read_ in the appropriate order.

complyue mentioned this issue Mar 4, 2022

Solving #295 #296

Closed

quinnj closed this as completed in d239c34 Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved #295

Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved #295

complyue commented Mar 4, 2022

Order of record batches from "arrow file" format files (i.e. Arrow.Table) not preserved #295

Order of record batches from "arrow file" format files (i.e. Arrow.Table) not preserved #295

Comments

complyue commented Mar 4, 2022

Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved #295

Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved #295