-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Concurrent file uploads #53
Conversation
Co-authored-by: Matías Bellone <toote@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have sworn your tests would have failed but there apparently is a bug in the stubbing library in which concurrent calls to the mock requiring a single stub for all calls 🤷
That said, I found that you were also missing 2 test cases:
- what happens when one of the curl calls fail
- waiting for uploads to finish (the iterations loop of the code you added)
I wanted to provide a rough implementation for you to look at but the first one is complicated due to my first comment. And for the latter I stumbled with a bug: apparently, the -p
option is not universal for all implementations of ps
.
If I create a test case with 2 concurrent uploads and change the stub for curl to be sleep 25; echo...
. The test still passes unless I also add a assert_output --partial "Waiting for uploads to finish"
in which case it fails and shows the output actually shows an error:
output (23 lines):
Uploading '3' files matching '**/*/junit-*.xml'
Uploading './tests/fixtures/junit-1.xml'...
Uploading './tests/fixtures/junit-2.xml'...
ps: unrecognized option: p
BusyBox v1.31.1 () multi-call binary.
tests/pre-exit-success.bats
Outdated
assert_output --partial "curl success 1" | ||
assert_output --partial "curl success 1" | ||
assert_output --partial "curl success 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these three lines do not guarantee that they appear 3 times, if curl success 1
appears just one time, they would all pass.
If you take my previous suggestion to tailor the output according to the file being uploaded, you can check for each independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to actually count the line amounts:
assert_equal "$(echo "$output" | grep -c "curl success")" "3"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented your alternative suggestion, I think that sufficiently covers the intent of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the cross-platform compatibility issues I've replaced ps -p
with kill -0
it was the only test I could find that had consistent behaviour across platforms to test a pid exists. kill -0
can have some surprising behaviour depending on the users permission to the pid... but I think given all of the pids we are dealing with are children of the script it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been attempting to add the missing tests, but I just seem to keep running into the error:
2023-06-19 06:37:02 UTC | ✗ Concurrency waits when the queue is full
-- | --
| 2023-06-19 06:37:04 UTC | (from function `unstub' in file /usr/lib/bats/bats-mock/stub.bash, line 70,
| 2023-06-19 06:37:04 UTC | in test file tests/pre-exit-success.bats, line 89)
| 2023-06-19 06:37:04 UTC | `unstub curl' failed
I'm not very familiar with bats, it seems as though I'm likely configuring it wrong but bats-mock is hiding the underlying error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty, I think I have a working implementation! Unfortunately it looks like I need to check for "error, continuing" rather than the timeout message specifically... but afaict that might be unavoidable as messages outside of the scope of the command group seem to get lost.
Regardless, it still tests that the first group is failing and that we return only one success, which might be close enough for our purposes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my tests suggest that nothing actually happens if the process is killed so you can not check for either output:
Uploading '3' files matching '**/*/junit-*.xml'
Uploading './tests/fixtures/junit-1.xml'...
Uploading './tests/fixtures/junit-2.xml'...
sleeping for data=@"./tests/fixtures/junit-1.xml"
sleeping for data=@"./tests/fixtures/junit-2.xml"
Upload '2678' has been running for more than '3' seconds, killing it
Upload '2679' has been running for more than '3' seconds, killing it
Uploading './tests/fixtures/junit-3.xml'...
curl success data=@"./tests/fixtures/junit-3.xml"
But I believe it is testing that the process does not break explosively :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, weird, because I was seeing:
Uploading '3' files matching '**/*/junit-*.xml'
Uploading './tests/fixtures/junit-1.xml'...
Uploading './tests/fixtures/junit-3.xml'...
sleeping for data=@"./tests/fixtures/junit-1.xml"
curl success data=@"./tests/fixtures/junit-3.xml"
/plugin/hooks/pre-exit: line 180: kill: (83) - No such process
Uploading './tests/fixtures/junit-2.xml'...
Error uploading, will continue
I wonder why we're seeing different output 🤔. Probably not a big deal, interesting though.
Edit: The test command I am using is docker run --rm -ti -e CURL_STUB_DEBUG=/dev/tty -v $PWD:/plugin buildkite/plugin-tester:v4.1.0 bats tests/pre-exit-success.bats -f 'Concurrency gracefully handles command-group timeout'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run it, the first batch is 1
and 2
, the second batch is just 3
so the first two processes get killed. In your run, the first batch was 1
and 3
, only one of those gets killed so the second one generates the No such process
line (which points to the fact that the command needs stderr redirection to /dev/null
It may be related to how files are laid out in the filesystem for the order in which find
reports them back.
Most interestingly: the last one fails because of the curl
stub only having one line (in your scenario it would need two) and that causes the Error uploading
line, otherwise you should be seeing sleeping for
on the 2
file as well. Even if the test succeeded for the output, it would fail when unstubbing curl
:(
Instead of trying to figure out exactly how to work around this non-determinism, I would suggest you add a | sort
to the FIND_CMD[@]
call in line 156 so that the order of files is always the same no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that fixed that issue :)
fix: Concurrency tests
Co-authored-by: Matías Bellone <toote@users.noreply.github.com>
fix: Cross-platform compatability
* fix: Use timeout var instead of magic numbers for concurrency
* fix: Add missing concurrency tests
fix: Various test fixes and new timeout test
hooks/pre-exit
Outdated
if [[ "$ANNOTATE" != "false" ]]; then | ||
save-report-url "${CURL_RESP_FILE}" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized there is a bug here: the file is shared among all upload
calls so there is a race condition as to which URL is saved for annotations and which are not. There are two alternatives to fix this:
1- make a note/check about annotation not compatible with concurrent uploads
2- change the upload
function to accept another parameter that defines the filename to save the file that you can then pass through to the save-report-url
Whichever you prefer I'll be OK with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. I'll investigate this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed now, I use mktemp
to generate a temporary file in the command group if CURL_RESP_FILE
is unset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed now, I use mktemp
to generate a temporary file in the command group if CURL_RESP_FILE
is unset.
Hmmm it seems there is a failing test, but I'm unable to reproduce. The tests run successfully both locally for me and on our buildkite cluster. |
I've been thinking what can be done about it without actually committing some debugging information and the best I could come up with wold be to change that test's assertion (the one that is failing) to one that is more standard so that more information is printed when it fails. In particular, do something like: assert_line --regexp '^curl success .*junit-1.xml"'
assert_line --regexp '^curl success .*junit-2.xml"'
assert_line --regexp '^curl success .*junit-3.xml"' While this may not cause the test to succeed when it wasn't, it will probably show more information (including the whole output) when it does fail. Based on the explanation below, I believe we need to pick one of two bad options: This would leave the tests ready to be run when the underlying bats-mock library is fixed (probably with some modifications) but means that functionality may break if the tests are never run. 2- ask someone to re-run the test until the race condition is not triggered This takes advantage of all the hard-work you've done and actually tests this feauter. On the other hand, not only it puts additional burden on the staff when the test fails, but also reduces the reliability of tests as an indicator that something is wrong with changes for every PR and merge going forward. @pzeballos: do you (or anyone else in the team) have any opinion on the matter? NoteWhile writing the suggestion and making sure that it would work I was able to duplicate the issue. There are actually two scenarios in which the test can fail, but both are due to a race condition in bats-mock not correctly supporting concurrent execution of stubs. In the first scenario (which is easier to explain), the 3rd scenario runs after the first and second one do concurrently. Because there is a single stub for
In the other scenario, which I believe is what happened in the run of your commit, the 3rd stub is executed after the first one finished but before the second one
|
😳 this is definitely a tricky case! bats 😅 And we'll see what we can do to bring the tests back (I manually retired the step until it passed) Thank you very much @js-murph for all your efforts and the amazing contribution you are making to the plugin 🤗 Do you think you can do that last change (add a skip with a message)? 🙏🏻 and I believe that we'll be ready to finally merge it 🤞🏻 |
Thank you both! I've pushed a skip on the |
Thank you very much @js-murph! 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
What?
This adds the ability for the test-collector to upload multiple files in parallel.
Why?
For large projects we are seeing instances where the performance is significantly slower than alternatives for displaying test-summary annotations, sometimes in the magnitude of several minutes slower. This adds an option for configuring the concurrency of the uploads that in my testing so far has helped halve the time spent performing this work.