Some Taskcluster runs aren't uploaded to wpt.fyi successfully #604

foolip · 2018-10-01T08:08:00Z

https://wpt.fyi/test-runs currently shows stable and beta runs from Taskcluster, but not the dev run.

https://api.github.com/repos/web-platform-tests/wpt/commits/ee2e69bfb1d44c4013a8ce94ca6932f86d63aa31/statuses does include "TaskGroup: success" pointing to https://tools.taskcluster.net/groups/ecpaEJHuRfmPunkTiCmsKQ which appears to be in good condition.

This is the weekly run, which is why I noticed. It's possible that runs are being randomly dropped in a way that is less noticeable.

@Hexcles, can you investigate?

Hexcles · 2018-10-02T23:29:51Z

The working theory is that my webhook takes too long to respond, sometimes close to 1min, which is well beyond GitHub's 10s timeout for webhooks. The client (GitHub server) then closes the connection, which sometimes causes AppEngine to terminate the request handling thread, leading to incomplete results.

We should make sure the webhooks respond within 10 seconds, which I think is totally possible given that there are quite a few HTTP requests we can parallelize.

foolip · 2018-10-03T14:09:23Z

@Hexcles would it work to download the results at the other end of the task queue, or is the contract that once the results receiver API has returned, that the client doesn't need to keep storing the results for some unknown number of minutes/hours?

Hexcles · 2018-10-03T14:47:14Z

@foolip yeah that would also work, but I think the download should be well within 10s once parallelized.

Hexcles · 2018-10-03T21:38:47Z

I did a little more digging into the logs. In addition to the potential timeout issue, Taskcluster API itself also sometimes fails. Namely, the endpoint for downloading the test artifacts has failed a few times in the past week. I suspect we are hitting the endpoint too fast and/or too soon after a task finishes (the artifacts might not be available on their cloud storage yet). I'll add retry.

Hexcles · 2018-10-18T14:47:50Z

The latency of the effective webhook requests (the ones that actually upload results, not the no-op ones) has decreased from 85s to 5s, which is well within the GitHub timeout (10s). And with the retry mechanism built in, I believe this is largely solved.

Now we just need to wait for a prod release.

Hexcles · 2018-11-13T20:40:07Z

Closing this unless it happens again.

foolip added the bug Something isn't working label Oct 1, 2018

foolip assigned Hexcles Oct 1, 2018

foolip mentioned this issue Oct 1, 2018

Show upload status #605

Closed

Hexcles changed the title ~~Experimental runs from Taskcluster for commit ee2e69bfb1 were not received~~ Some Taskcluster runs aren't uploaded to wpt.fyi successfully Oct 2, 2018

Hexcles mentioned this issue Oct 2, 2018

[webhook] Upload test results in parallel #613

Merged

Hexcles mentioned this issue Oct 12, 2018

[receiver] Concurrently fetch results and retry failures #648

Merged

Hexcles closed this as completed Nov 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Taskcluster runs aren't uploaded to wpt.fyi successfully #604

Some Taskcluster runs aren't uploaded to wpt.fyi successfully #604

foolip commented Oct 1, 2018

Hexcles commented Oct 2, 2018

foolip commented Oct 3, 2018

Hexcles commented Oct 3, 2018

Hexcles commented Oct 3, 2018

Hexcles commented Oct 18, 2018 •

edited

Loading

Hexcles commented Nov 13, 2018

Some Taskcluster runs aren't uploaded to wpt.fyi successfully #604

Some Taskcluster runs aren't uploaded to wpt.fyi successfully #604

Comments

foolip commented Oct 1, 2018

Hexcles commented Oct 2, 2018

foolip commented Oct 3, 2018

Hexcles commented Oct 3, 2018

Hexcles commented Oct 3, 2018

Hexcles commented Oct 18, 2018 • edited Loading

Hexcles commented Nov 13, 2018

Hexcles commented Oct 18, 2018 •

edited

Loading