-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix curate internal quotes take 2 #1565
Conversation
f71d105
to
ad029f6
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1565 +/- ##
=======================================
Coverage 72.29% 72.29%
=======================================
Files 79 79
Lines 8276 8276
Branches 1691 1691
=======================================
Hits 5983 5983
Misses 2008 2008
Partials 285 285 ☔ View full report in Codecov by Sentry. |
Let's discuss this topic more first. I'll open an issue for it. |
Saving this here for later…
|
This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the rule now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>
This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields (and some mangled quotes², e.g. the "institution" column for accession KJ556895). We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566>
I think this PR should be ready for review based on #1566 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me for the code. This will want a good changelog entry. We might also want to test its effects on a few existing ingest pipelines?
This construction reads a bit clearer and cleaner. It's also a good example of how to use `augur merge`. The limitation on non-seekable streams means the workflow now uses additional transient disk space, but it typically shouldn't be an issue. The way Augur's slow start up time impacts `augur merge` also contributes to a longer rule execution time, but it should be negligible in the context of the larger workflow and presumably we'll fix the slow start up eventually.¹ The output is semantically identical but has some syntactic changes re: quoting. It's worth noting that the pre-existing TSV format was _not_ IANA TSV, despite it (still) being treated as such in a few places, but was (and remains) a CSV-like TSV with some quoted fields. We really need to sort out our TSV formats³, but that's for a larger project. ¹ <nextstrain/augur#1628> ² <nextstrain/augur#1565> ³ <nextstrain/augur#1566> Ported-from: <nextstrain/measles@4d73b7f> Related-to: <nextstrain/measles#52> Related-to: <#65>
Reminder that this is approved and apparently just waiting for changelog and merge :) |
This reverts commit 915672e. Per discussion in <#1563 (comment)>, keep CSV-like TSV where quotes may be added or removed, but parsed values should be equivalent.
Resolves <#1312> We are expecting the CSV-like double quoting when there are internal quotes. If the field value is already correctly double quoted, then there should not be any additional quotes.
ad029f6
to
3f94f3e
Compare
Thanks for the ping @corneliusroemer! I've rebased onto master and added a Changelog entry. I'll merge once all the tests pass again. |
Description of proposed changes
Follow the general pattern of creating CSV-like TSVs as discussed in #1563 (comment).
We are expecting the CSV-like double quoting when there are internal quotes. If the field value is already correctly double quoted, then there should not be any additional quotes.
Related issue(s)
Resolves #1312
Checklist