Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of record batches from "arrow file" format files (i.e. Arrow.Table) not preserved #295

Closed
complyue opened this issue Mar 4, 2022 · 0 comments

Comments

@complyue
Copy link

complyue commented Mar 4, 2022

arrow-julia/src/table.jl

Lines 276 to 293 in 614fce0

tsk = Threads.@spawn begin
i = 1
for tsk in tsks
cols = fetch(tsk)
if i == 1
foreach(x -> push!(columns(t), x), cols)
elseif i == 2
foreach(1:length(cols)) do i
columns(t)[i] = ChainedVector([columns(t)[i], cols[i]])
end
else
foreach(1:length(cols)) do i
append!(columns(t)[i], cols[i])
end
end
i += 1
end
end

I see from above that record batches will be parsed (esp. decompression could be rather intensive computation workload) in parallel if the Julia runtime has multithread enabled, which is great.

But according to the implementation, the original order of batches as they had been written will not be guaranteed as preserved, which I think is not ideal. I'm not sure how Arrow spec should say about this aspect, but I'm dealing with time series data recorded batch-by-batch where the order signifies a lot.

I'd like to draft a PR to preserve batch order with regard to this concern, and as I start tinkering with the codebase, I file this issue to ask your opinions about it.

(Btw, I'm also tinkering about a PR for #293, which is orthogonal wrt functionality, but seems closely related wrt implementation details. I'd think 2 separate PRs would make better clarity for review and release purpose, but if you can accept a single PR addressing the 2 things together, it could be a lot easier for me, given I'm not fluent in git rebasing and related skills.)

@complyue complyue mentioned this issue Mar 4, 2022
quinnj added a commit that referenced this issue Mar 6, 2022
We already have a utility defined (`OrderedChannel`) that we use when
writing record batches to ensure batches get _written_ in the same order
they are provided; it makes sense to use the same utility when reading
to ensure incoming record batches are _read_ in the appropriate order.
@quinnj quinnj closed this as completed in d239c34 Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant