Buffer-based csv2tsv (performance improvement) #301
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the algorithm used by
csv2tsv
to work on a buffer of data at a time rather than a character at a time. The motivation for this change is performance.The main change is that new version writes longer blocks of characters to the output stream, where the original version wrote a single byte at a time. The output stream itself uses buffering, but still, writing longer blocks at a time to it is faster. Also, at one point a change in either the D library or compiler resulted in less optimal code and performance degraded. The exact cause is not clear.
The new algorithm still walks over the input CSV data one byte at a time. In this manner the algorithm is similar to the original. However, rather than immediately writing the byte to the output stream, the algorithm keeps track of the set of consecutive bytes that can be written unchanged. In addition, if the byte in the input data is simply being replaced by a different byte, the modification is done in place. The common case is the CSV field delimiters being replaced, e.g. a comma being replaced by a TAB. In this way longer sequences of bytes can be written to the output stream all at once. The current input region is written out whenever a sequence of consecutive bytes is interrupted. The common case is CSV fields surrounded by double quotes, which get removed when writing the TSV form.
Performance tests indicate the new algorithm is considerably faster than the original algorithm. Testing was done on a on Mac Mini (16GB RAM, SSD drives). Compared to the current
csv2tsv
version 2.0.0, the new version ran 40% faster on files with significant amounts of CSV escapes (double quotes on every field), and 60% faster on files with limited CSV escapes. Versuscsv2tsv
version 1.1.19 (the 2018 benchmark study version), the new version is 10% faster on files with significant CSV escapes and 40% faster on files with limited CSV escapes.In short, performance is improved significantly over all previous versions. On "simple" CSV data that does not contain CSV escapes, the performance is now in the ballpark of Unix
tr
, wheretr
is only being used to convert commas to TAB characters. GNUtr
is still about 20% faster, but this is a good indication the new version ofcsv2tsv
has solid performance. GNUtr
is of course not checking for CSV escapes and should have better overall performance.This PR also adds an option for different replacement character for TAB and Newline found in the data. The previous version used the same replacement character for both. For now this is only a change to the internal code. It will be made available from the command line in a future PR.
A copy of the previous version was put in the directory
csv2tsv/src_v1
so that the original version can be found more easily.