Buffer-based csv2tsv (performance improvement) #301

jondegenhardt · 2020-09-06T06:59:33Z

This PR changes the algorithm used by csv2tsv to work on a buffer of data at a time rather than a character at a time. The motivation for this change is performance.

The main change is that new version writes longer blocks of characters to the output stream, where the original version wrote a single byte at a time. The output stream itself uses buffering, but still, writing longer blocks at a time to it is faster. Also, at one point a change in either the D library or compiler resulted in less optimal code and performance degraded. The exact cause is not clear.

The new algorithm still walks over the input CSV data one byte at a time. In this manner the algorithm is similar to the original. However, rather than immediately writing the byte to the output stream, the algorithm keeps track of the set of consecutive bytes that can be written unchanged. In addition, if the byte in the input data is simply being replaced by a different byte, the modification is done in place. The common case is the CSV field delimiters being replaced, e.g. a comma being replaced by a TAB. In this way longer sequences of bytes can be written to the output stream all at once. The current input region is written out whenever a sequence of consecutive bytes is interrupted. The common case is CSV fields surrounded by double quotes, which get removed when writing the TSV form.

Performance tests indicate the new algorithm is considerably faster than the original algorithm. Testing was done on a on Mac Mini (16GB RAM, SSD drives). Compared to the current csv2tsv version 2.0.0, the new version ran 40% faster on files with significant amounts of CSV escapes (double quotes on every field), and 60% faster on files with limited CSV escapes. Versus csv2tsv version 1.1.19 (the 2018 benchmark study version), the new version is 10% faster on files with significant CSV escapes and 40% faster on files with limited CSV escapes.

In short, performance is improved significantly over all previous versions. On "simple" CSV data that does not contain CSV escapes, the performance is now in the ballpark of Unix tr, where tr is only being used to convert commas to TAB characters. GNU tr is still about 20% faster, but this is a good indication the new version of csv2tsv has solid performance. GNU tr is of course not checking for CSV escapes and should have better overall performance.

This PR also adds an option for different replacement character for TAB and Newline found in the data. The previous version used the same replacement character for both. For now this is only a change to the internal code. It will be made available from the command line in a future PR.

A copy of the previous version was put in the directory csv2tsv/src_v1 so that the original version can be found more easily.

codecov-commenter · 2020-09-06T07:53:12Z

Codecov Report

Merging #301 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #301   +/-   ##
=======================================
  Coverage   99.34%   99.35%           
=======================================
  Files          18       18           
  Lines        6763     6792   +29     
=======================================
+ Hits         6719     6748   +29     
  Misses         44       44

Impacted Files	Coverage Δ
common/src/tsv_utils/common/utils.d	`100.00% <ø> (ø)`
csv2tsv/src/tsv_utils/csv2tsv.d	`100.00% <100.00%> (ø)`

jondegenhardt added 6 commits August 21, 2020 20:45

[WIP] Some design documentation.

0c28421

[WIP] Buffered csv2tsv: First working version.

1ca3218

[WIP] More csv2tsv unit tests.

dfb2213

[WIP] Update code documentation, code cleanup.

b78097b

Make the new buffered csv2tsv the main version.

4abe669

Update import statement.

a0a6b1d

Unit tests for missed line coverage.

be23b6e

jondegenhardt merged commit 3bec503 into eBay:master Sep 6, 2020

jondegenhardt deleted the csv2tsv-in-blocks-aug2020 branch September 6, 2020 09:43

jondegenhardt mentioned this pull request Sep 7, 2020

csv2tsv: Discard UTF-8 Byte Order Mark (BOM) #302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer-based csv2tsv (performance improvement) #301

Buffer-based csv2tsv (performance improvement) #301

jondegenhardt commented Sep 6, 2020 •

edited

Loading

codecov-commenter commented Sep 6, 2020 •

edited

Loading

Buffer-based csv2tsv (performance improvement) #301

Buffer-based csv2tsv (performance improvement) #301

Conversation

jondegenhardt commented Sep 6, 2020 • edited Loading

codecov-commenter commented Sep 6, 2020 • edited Loading

Codecov Report

jondegenhardt commented Sep 6, 2020 •

edited

Loading

codecov-commenter commented Sep 6, 2020 •

edited

Loading