Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Taskcluster runs aren't uploaded to wpt.fyi successfully #604

Closed
foolip opened this issue Oct 1, 2018 · 6 comments
Closed

Some Taskcluster runs aren't uploaded to wpt.fyi successfully #604

foolip opened this issue Oct 1, 2018 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@foolip
Copy link
Member

foolip commented Oct 1, 2018

https://wpt.fyi/test-runs currently shows stable and beta runs from Taskcluster, but not the dev run.

https://api.github.com/repos/web-platform-tests/wpt/commits/ee2e69bfb1d44c4013a8ce94ca6932f86d63aa31/statuses does include "TaskGroup: success" pointing to https://tools.taskcluster.net/groups/ecpaEJHuRfmPunkTiCmsKQ which appears to be in good condition.

This is the weekly run, which is why I noticed. It's possible that runs are being randomly dropped in a way that is less noticeable.

@Hexcles, can you investigate?

@foolip foolip added the bug Something isn't working label Oct 1, 2018
@Hexcles Hexcles changed the title Experimental runs from Taskcluster for commit ee2e69bfb1 were not received Some Taskcluster runs aren't uploaded to wpt.fyi successfully Oct 2, 2018
@Hexcles
Copy link
Member

Hexcles commented Oct 2, 2018

The working theory is that my webhook takes too long to respond, sometimes close to 1min, which is well beyond GitHub's 10s timeout for webhooks. The client (GitHub server) then closes the connection, which sometimes causes AppEngine to terminate the request handling thread, leading to incomplete results.

We should make sure the webhooks respond within 10 seconds, which I think is totally possible given that there are quite a few HTTP requests we can parallelize.

@foolip
Copy link
Member Author

foolip commented Oct 3, 2018

@Hexcles would it work to download the results at the other end of the task queue, or is the contract that once the results receiver API has returned, that the client doesn't need to keep storing the results for some unknown number of minutes/hours?

@Hexcles
Copy link
Member

Hexcles commented Oct 3, 2018

@foolip yeah that would also work, but I think the download should be well within 10s once parallelized.

@Hexcles
Copy link
Member

Hexcles commented Oct 3, 2018

I did a little more digging into the logs. In addition to the potential timeout issue, Taskcluster API itself also sometimes fails. Namely, the endpoint for downloading the test artifacts has failed a few times in the past week. I suspect we are hitting the endpoint too fast and/or too soon after a task finishes (the artifacts might not be available on their cloud storage yet). I'll add retry.

@Hexcles
Copy link
Member

Hexcles commented Oct 18, 2018

The latency of the effective webhook requests (the ones that actually upload results, not the no-op ones) has decreased from 85s to 5s, which is well within the GitHub timeout (10s). And with the retry mechanism built in, I believe this is largely solved.

Now we just need to wait for a prod release.

@Hexcles
Copy link
Member

Hexcles commented Nov 13, 2018

Closing this unless it happens again.

@Hexcles Hexcles closed this as completed Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants