Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur curate I/O: validate records have same fields #1518

Merged
merged 3 commits into from
Jul 3, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Jul 2, 2024

Description of proposed changes

Validate records have the same fields before and after the subcommand modifies the records.

Includes a functional test to check for the input error, but there's no good way to check for the output error. Hopefully we just never see it!

Related issue(s)

Resolves #1510

Checklist

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

@joverlee521 joverlee521 requested a review from a team July 2, 2024 23:22
Copy link

codecov bot commented Jul 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.69%. Comparing base (9447c66) to head (6e35dd4).
Report is 224 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1518      +/-   ##
==========================================
+ Coverage   69.63%   69.69%   +0.05%     
==========================================
  Files          74       74              
  Lines        7812     7827      +15     
  Branches     1910     1914       +4     
==========================================
+ Hits         5440     5455      +15     
  Misses       2086     2086              
  Partials      286      286              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

[2]

Passing the records through multiple augur curate commands should raise the
same error when it encounters the record with mismatched fields.
Copy link
Member

@jameshadfield jameshadfield Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[not a request for changes]

Do you have any good resources for understanding how pipes work in situations like this? I.e. the augur curate X is printing lines one-by-one, so does an individual NDJSON line flow through all curate commands before the next one starts flowing through? That's what this output makes it seem like. But when python flushes the print buffer must come into the equation right? And unix pipes presumably have some concept of backpressure / buffering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the shell starts all the commands at the same time, with file descriptors arranged so that the STDOUT of the first command is the STDIN of the second, and so on down the line.

There is a buffer for the pipe, managed by the kernel, but if it's full, it blocks on the write side (and if it's empty, it blocks on the read side). This SO answer may be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is as @genehack described ☝️ (I've only skimmed the pipe man page and have a limited understanding here)

so does an individual NDJSON line flow through all curate commands before the next one starts flowing through?

This should depend on the buffer size, where multiple records can flow through a single command to fill up the buffer before being passed to the next command.

In the case where the first command runs into an error, it should close it's "write end" so the subsequent commands will receive some end-of-file signal and terminate after writing their outputs as well. (Or exit immediately if set -eo pipefail)

Copy link
Member

@jameshadfield jameshadfield Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both! Reading those resources this is my understanding of what's happening (using c1 for the first curate command, etc):

  1. c1 writes the first record to the appropriate fd, and there's no buffering on the python side (since we are using print() with the default new line ending).
  2. c2 reads this more-or-less immediately and in parallel with c1 continuing to run. It writes output to its fd, c3 reads it and so on.
  3. This happens for the first three (valid) records - they make it through all three curate commands and are written to stdout.
  4. At some point c1 reads the invalid record, prints to stderr, and exits code 2. This is, AFAICT, seen by c2 no differently to an "end of file" and so c2 exits (code 0) once it's consumed all the data in its input buffer.
  5. The pipefail causes the entire pipeline to have exit code 2 because c1 had exit code 2, but this is done after the pipeline has finished. I.e. it doesn't actually change any behaviour of the pipeline -- after c1 has exited c2 will continue to run while it has data to read on it's input buffer and so on.
  6. The order of steps (3) and (4) seems like it's a race-condition, but the fact that the pipeline's output has the error message (stderr) before the records (stdout) indicates that (4) comes before the first record has made it through the entire pipeline. Stdout seems to be line-buffered so I don't think that's important here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James asked me to check his understanding, and generally yeah, it's about right AFAICT.

One small inaccuracy is that sys.stdout is block-buffered, not line-buffered, when not connected to a TTY. (And it can be unbuffered entirely if desired.)

If you wanted to empirically test this understanding, you could strace the pipeline (e.g. the bash process and its children) and inspect the order of read/write/exit operations for each process in the pipeline. (strace is a Linux tool; there are equivalents for macOS, but I'm not as adept with them.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional things to consider if you want to dig into the order of 3 vs. 4 more is also how exactly stdout and stderr are interleaved by cram to match against the test file and the buffering mode of Python's sys.stderr (which is version dependent).

Validate records have the same fields before and after the subcommand
modifies the records.

Includes a functional test to check for the input error, but there's no
good way to check for the output error. Hopefully we just never see it!
@joverlee521 joverlee521 force-pushed the curate-io-validation branch from d84a0f4 to 9811491 Compare July 3, 2024 19:21
@joverlee521
Copy link
Contributor Author

Rebased on to master to ensure no merge conflicts in changelog and updated changelog.

@joverlee521 joverlee521 merged commit b69444a into master Jul 3, 2024
19 checks passed
@joverlee521 joverlee521 deleted the curate-io-validation branch July 3, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

augur curate I/O: verify NDJSON records have the same fields
4 participants