Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Jenkin CI server (xgboost-ci.net) down for emergency maintenance #4921

Closed
hcho3 opened this issue Oct 9, 2019 · 6 comments
Closed

[CI] Jenkin CI server (xgboost-ci.net) down for emergency maintenance #4921

hcho3 opened this issue Oct 9, 2019 · 6 comments
Assignees

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Oct 9, 2019

All Jenkins jobs are now failing due to broken SSL connections, so git clone is failing and also S3 upload is failing. I suspect this has to do with outdated software packages for the worker images we are using.

At any rate, I will need to re-generate the worker images (AMIs), so I'm shutting down the server for a few hours for a day. I will do my best to bring the server back to normal.

@dmlc/xgboost-committer

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 9, 2019

"Wouldn't Docker solve the issue of outdated packages"? At the Checkout stage, the CI worker uses the system Git to checkout the XGBoost repository. The Docker image is not available yet, since the Dockerfile is itself part of the repository. Hence, over time, we would still be affected by outdated system packages. Also, unfortunately, we cannot use Docker for Windows yet (NVIDIA/nvidia-docker#429).

@hcho3 hcho3 self-assigned this Oct 9, 2019
@hcho3 hcho3 changed the title [CI] Jenkin CI server (xgboost-ci.net) down for maintanence [CI] Jenkin CI server (xgboost-ci.net) down for emergency maintenance Oct 9, 2019
@trivialfis
Copy link
Member

Thanks for looking into this.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 9, 2019

Update: git clone is now working correctly.

https://xgboost-ci.net/blue/organizations/jenkins/xgboost/detail/PR-4924/2/pipeline/12

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 9, 2019

Note to myself: List of software packages to set up Linux workers.

  • System-wide update (apt-get upgrade)
  • Python 3.x + PIP (python3, python3-pip)
  • awscli Python package
  • Git
  • Docker CE
  • OpenJDK 8
  • CUDA (for GPU only)
  • NVIDIA docker (for GPU only)

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 9, 2019

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 9, 2019

It looks like Jenkins CI is back to normal. However, I will keep a close eye on it today and tomorrow to see everything is all right.

Note. As a bonus, the Linux GPU workers are now using shiny new Turing GPUs (G4 instances)

@hcho3 hcho3 closed this as completed Oct 9, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jan 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants