Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

what's wrong with PR test? #19373

Closed
Neutron3529 opened this issue Oct 18, 2020 · 8 comments
Closed

what's wrong with PR test? #19373

Neutron3529 opened this issue Oct 18, 2020 · 8 comments
Labels

Comments

@Neutron3529
Copy link
Contributor

currently, many PR failed due to bad network connection.
not only me myself, but also many PR is affected.

[2020-10-17T00:02:07.148Z] 2020-10-17 00:02:05,525 - root - ERROR - ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

[2020-10-17T00:02:07.148Z] Traceback (most recent call last):

[2020-10-17T00:02:07.148Z]   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen

[2020-10-17T00:02:07.148Z]     chunked=chunked)

[2020-10-17T00:02:07.148Z]   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request

[2020-10-17T00:02:07.148Z]     six.raise_from(e, None)

[2020-10-17T00:02:07.148Z]   File "<string>", line 3, in raise_from

[2020-10-17T00:02:07.148Z]   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request

[2020-10-17T00:02:07.148Z]     httplib_response = conn.getresponse()

[2020-10-17T00:02:07.148Z]   File "/usr/lib/python3.6/http/client.py", line 1346, in getresponse

[2020-10-17T00:02:07.148Z]     response.begin()

[2020-10-17T00:02:07.148Z]   File "/usr/lib/python3.6/http/client.py", line 307, in begin

[2020-10-17T00:02:07.148Z]     version, status, reason = self._read_status()

[2020-10-17T00:02:07.148Z]   File "/usr/lib/python3.6/http/client.py", line 268, in _read_status

[2020-10-17T00:02:07.148Z]     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")

[2020-10-17T00:02:07.148Z]   File "/usr/lib/python3.6/socket.py", line 586, in readinto

[2020-10-17T00:02:07.148Z]     return self._sock.recv_into(b)

[2020-10-17T00:02:07.148Z] ConnectionResetError: [Errno 104] Connection reset by peer

such exception could not resolved by any PR.
I re-run test for at least 5 times, and most of the test failed.

what's wrong?

@leezu
Copy link
Contributor

leezu commented Oct 19, 2020

It's a CI issue and need to be fixed. It's unrelated to your PR.

cc @sandeep-krishnamurthy @josephevans

@leezu leezu added CI and removed needs triage labels Oct 19, 2020
@josephevans
Copy link
Contributor

josephevans commented Oct 19, 2020

I think there's two issues today. First, I found an expired GPG key that was preventing R packages from being installed on Ubuntu 16.04. I created PR #19377 for this. Waiting for my PR to pass to backport.

Second, I see errors trying to uncompress the Packages file from https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/

It looks like a new file was recently pushed, so maybe this has been fixed by nvidia.

@leezu
Copy link
Contributor

leezu commented Oct 19, 2020

@josephevans thank you for looking into the issue. But please note that the issues you mention are unrelated as they only affect the v1.x branch, whereas the issue described here affects the master branch.

@leezu
Copy link
Contributor

leezu commented Oct 19, 2020

The nvidia issue may affect more than the 16.04 mentioned above. NVIDIA/nvidia-docker#1402 contains some more info (though the issue was closed as it isn't directed to the correct owner)

@josephevans
Copy link
Contributor

Ok, I believe I finally found the culprit. Our AMIs that are used for Jenkins slaves have auto-update turned on, and based on the logfiles of the slave instances, it looks like docker was being auto-updated and restarted, which was killing the log-output of the containers (and therefore jenkins jobs.)

I've created a new AMI for mxnetlinux_cpu hosts with updated software versions, which also adds an option to the docker config to hopefully prevent this in the future. See https://docs.docker.com/config/containers/live-restore/ - Thanks @leezu for the recommendation.

@josephevans
Copy link
Contributor

CI seems much more stable today with the new AMI. Released an updated AMI to address ARMv8 test failures due to qemu installation. We should no longer be seeing the docker connection issues (or unexpected EOF errors) and 2 of 3 PRs to fix the other CI issue (expired GPG key) has been merged.

This issue can be closed.

@sandeep-krishnamurthy
Copy link
Contributor

Thank you so much Joe.

@Neutron3529
Copy link
Contributor Author

Seems the CI works now, close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants