Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Improved performance of writing to CSV (20-25%) #382

Merged
merged 2 commits into from
Sep 5, 2021
Merged

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Sep 5, 2021

Currently, we write to CSV by using an Iterator<Item=Vec<u8>>. This requires a new allocation per non-null item.

This PR uses StreamingIterator to significantly reduce the number of allocations.

A StreamingIterator is an Iterator-like trait that allows yielding references of itself, &[u8] in this case. We maintain an internal buffer of Vec<u8> on the (streaming) iterator and re-use it across items within the array.

In summary, this replaces (1 alloc + write bytes) by a (maybe 1 realloc + write bytes) per non-null item. For types that require a fixed number of bytes (e.g. all our primitive types), it results in a single allocation per array, as opposed to an allocation per non-null item.

csv write i32 2^18      time:   [15.828 ms 15.866 ms 15.915 ms]                               
                        change: [-26.127% -25.784% -25.480%] (p = 0.00 < 0.05)
csv write utf8 2^18     time:   [33.100 ms 33.188 ms 33.276 ms]                                
                        change: [-14.104% -13.811% -13.508%] (p = 0.00 < 0.05)
csv write f64 2^18      time:   [23.661 ms 23.702 ms 23.748 ms]                               
                        change: [-20.623% -20.448% -20.239%] (p = 0.00 < 0.05)

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Sep 5, 2021
@codecov
Copy link

codecov bot commented Sep 5, 2021

Codecov Report

Merging #382 (964ca91) into main (b16d6b9) will increase coverage by 0.02%.
The diff coverage is 61.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #382      +/-   ##
==========================================
+ Coverage   81.11%   81.14%   +0.02%     
==========================================
  Files         330      331       +1     
  Lines       21770    21891     +121     
==========================================
+ Hits        17659    17763     +104     
- Misses       4111     4128      +17     
Impacted Files Coverage Δ
src/io/csv/write/mod.rs 76.00% <50.00%> (ø)
src/io/csv/write/serialize.rs 46.15% <57.14%> (+5.85%) ⬆️
src/io/csv/write/iterator.rs 71.42% <71.42%> (ø)
src/util/lexical.rs 75.00% <80.00%> (+12.50%) ⬆️
src/array/boolean/ffi.rs 0.00% <0.00%> (-5.89%) ⬇️
tests/it/array/primitive/mod.rs 100.00% <0.00%> (ø)
src/array/display.rs 61.70% <0.00%> (+2.57%) ⬆️
src/temporal_conversions.rs 85.00% <0.00%> (+5.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b16d6b9...964ca91. Read the comment docs.

@jorgecarleitao jorgecarleitao merged commit 4f8d793 into main Sep 5, 2021
@jorgecarleitao jorgecarleitao deleted the csv_write_fast branch September 5, 2021 11:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant