-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
augur curate I/O: validate records have same fields #1518
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
Setup | ||
|
||
$ source "$TESTDIR"/_setup.sh | ||
|
||
Testing records are validated appropriately by augur curate. | ||
Create NDJSON file for testing validation catches records with different fields. | ||
|
||
$ cat >records.ndjson <<~~ | ||
> {"string": "string_1"} | ||
> {"string": "string_2"} | ||
> {"string": "string_3"} | ||
> {"string": "string_4", "number": 123} | ||
> ~~ | ||
|
||
This will always pass thru the records that pass validation but should raise an | ||
error when it encounters the record with mismatched fields. | ||
|
||
$ cat records.ndjson | ${AUGUR} curate passthru | ||
ERROR: Records do not have the same fields! Please check your input data has the same fields. | ||
{"string": "string_1"} | ||
{"string": "string_2"} | ||
{"string": "string_3"} | ||
[2] | ||
|
||
Passing the records through multiple augur curate commands should raise the | ||
same error when it encounters the record with mismatched fields. | ||
|
||
$ set -o pipefail | ||
$ cat records.ndjson \ | ||
> | ${AUGUR} curate passthru \ | ||
> | ${AUGUR} curate passthru \ | ||
> | ${AUGUR} curate passthru | ||
ERROR: Records do not have the same fields! Please check your input data has the same fields. | ||
{"string": "string_1"} | ||
{"string": "string_2"} | ||
{"string": "string_3"} | ||
[2] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[not a request for changes]
Do you have any good resources for understanding how pipes work in situations like this? I.e. the
augur curate X
is printing lines one-by-one, so does an individual NDJSON line flow through allcurate commands
before the next one starts flowing through? That's what this output makes it seem like. But when python flushes the print buffer must come into the equation right? And unix pipes presumably have some concept of backpressure / buffering?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the shell starts all the commands at the same time, with file descriptors arranged so that the STDOUT of the first command is the STDIN of the second, and so on down the line.
There is a buffer for the pipe, managed by the kernel, but if it's full, it blocks on the write side (and if it's empty, it blocks on the read side). This SO answer may be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is as @genehack described ☝️ (I've only skimmed the pipe man page and have a limited understanding here)
This should depend on the buffer size, where multiple records can flow through a single command to fill up the buffer before being passed to the next command.
In the case where the first command runs into an error, it should close it's "write end" so the subsequent commands will receive some end-of-file signal and terminate after writing their outputs as well. (Or exit immediately if
set -eo pipefail
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both! Reading those resources this is my understanding of what's happening (using
c1
for the first curate command, etc):c1
writes the first record to the appropriate fd, and there's no buffering on the python side (since we are usingprint()
with the default new line ending).c2
reads this more-or-less immediately and in parallel withc1
continuing to run. It writes output to its fd,c3
reads it and so on.c1
reads the invalid record, prints to stderr, and exits code 2. This is, AFAICT, seen byc2
no differently to an "end of file" and soc2
exits (code 0) once it's consumed all the data in its input buffer.pipefail
causes the entire pipeline to have exit code 2 becausec1
had exit code 2, but this is done after the pipeline has finished. I.e. it doesn't actually change any behaviour of the pipeline -- afterc1
has exitedc2
will continue to run while it has data to read on it's input buffer and so on.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James asked me to check his understanding, and generally yeah, it's about right AFAICT.
One small inaccuracy is that
sys.stdout
is block-buffered, not line-buffered, when not connected to a TTY. (And it can be unbuffered entirely if desired.)If you wanted to empirically test this understanding, you could
strace
the pipeline (e.g. thebash
process and its children) and inspect the order of read/write/exit operations for each process in the pipeline. (strace
is a Linux tool; there are equivalents for macOS, but I'm not as adept with them.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional things to consider if you want to dig into the order of 3 vs. 4 more is also how exactly stdout and stderr are interleaved by cram to match against the test file and the buffering mode of Python's
sys.stderr
(which is version dependent).