Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsv-filter --label option #338

Merged
merged 5 commits into from
Mar 11, 2021
Merged

Conversation

jondegenhardt
Copy link
Contributor

Description

This PR adds a new feature to tsv-filter: Marking each record as either passing the filter test, or not.

Consider the following command, which identifies lines where the Color field is a primary color.

$ tsv-filter -H --or --str-eq Color:Red --str-eq Color:Yellow --str-eq Color:Blue data.tsv

The above filters out all records not satisfying the test. However, it is often desirable to keep all the records, instead marking the records to indicate the matches. The following command does this, adding a hew field, IsPrimaryColor populated with values 1 or 0 to indicate pass or not.

$ tsv-filter -H --label IsPrimaryColor --or --str-eq Color:Red --str-eq Color:Yellow --str-eq Color:Blue data.tsv

The label values can be customized using the --label-values option. To change the above to used true and false, run:

$ tsv-filter -H --label IsPrimaryColor --label-values true:false --or --str-eq Color:Red --str-eq Color:Yellow --str-eq Color:Blue data.tsv

Implementation

Adding the label field is straightforward. In the main loop, instead of choosing to output a line or not, an indicator is appended. However, the additional conditional tests in the loop caused a performance degradation. This was partly due to the recent addition of the --count option, which counts the number of records satisfying the criteria. The performance degradation was minor for wide files with long lines, but substantial for narrow files.

To regain performance the code was templatized to reduce the number of tests in the main loop. In addition, some changes to BufferedOutputRange to streamline that code. It had also added some additional checks as part of the recent --line-buffered support. Between the two changes all the original performance was regained, and possibly a bit more.

@jondegenhardt jondegenhardt merged commit 030993b into eBay:master Mar 11, 2021
@jondegenhardt jondegenhardt deleted the tsv-filter-label branch March 11, 2021 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant