Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer-based csv2tsv (performance improvement) #301

Merged
merged 7 commits into from
Sep 6, 2020

Conversation

jondegenhardt
Copy link
Contributor

@jondegenhardt jondegenhardt commented Sep 6, 2020

This PR changes the algorithm used by csv2tsv to work on a buffer of data at a time rather than a character at a time. The motivation for this change is performance.

The main change is that new version writes longer blocks of characters to the output stream, where the original version wrote a single byte at a time. The output stream itself uses buffering, but still, writing longer blocks at a time to it is faster. Also, at one point a change in either the D library or compiler resulted in less optimal code and performance degraded. The exact cause is not clear.

The new algorithm still walks over the input CSV data one byte at a time. In this manner the algorithm is similar to the original. However, rather than immediately writing the byte to the output stream, the algorithm keeps track of the set of consecutive bytes that can be written unchanged. In addition, if the byte in the input data is simply being replaced by a different byte, the modification is done in place. The common case is the CSV field delimiters being replaced, e.g. a comma being replaced by a TAB. In this way longer sequences of bytes can be written to the output stream all at once. The current input region is written out whenever a sequence of consecutive bytes is interrupted. The common case is CSV fields surrounded by double quotes, which get removed when writing the TSV form.

Performance tests indicate the new algorithm is considerably faster than the original algorithm. Testing was done on a on Mac Mini (16GB RAM, SSD drives). Compared to the current csv2tsv version 2.0.0, the new version ran 40% faster on files with significant amounts of CSV escapes (double quotes on every field), and 60% faster on files with limited CSV escapes. Versus csv2tsv version 1.1.19 (the 2018 benchmark study version), the new version is 10% faster on files with significant CSV escapes and 40% faster on files with limited CSV escapes.

In short, performance is improved significantly over all previous versions. On "simple" CSV data that does not contain CSV escapes, the performance is now in the ballpark of Unix tr, where tr is only being used to convert commas to TAB characters. GNU tr is still about 20% faster, but this is a good indication the new version of csv2tsv has solid performance. GNU tr is of course not checking for CSV escapes and should have better overall performance.

This PR also adds an option for different replacement character for TAB and Newline found in the data. The previous version used the same replacement character for both. For now this is only a change to the internal code. It will be made available from the command line in a future PR.

A copy of the previous version was put in the directory csv2tsv/src_v1 so that the original version can be found more easily.

@codecov-commenter
Copy link

codecov-commenter commented Sep 6, 2020

Codecov Report

Merging #301 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #301   +/-   ##
=======================================
  Coverage   99.34%   99.35%           
=======================================
  Files          18       18           
  Lines        6763     6792   +29     
=======================================
+ Hits         6719     6748   +29     
  Misses         44       44           
Impacted Files Coverage Δ
common/src/tsv_utils/common/utils.d 100.00% <ø> (ø)
csv2tsv/src/tsv_utils/csv2tsv.d 100.00% <100.00%> (ø)

@jondegenhardt jondegenhardt merged commit 3bec503 into eBay:master Sep 6, 2020
@jondegenhardt jondegenhardt deleted the csv2tsv-in-blocks-aug2020 branch September 6, 2020 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants