This repository holds the configuration as code for the CI of the upstream TVM project hosted on GitHub. Specifically, this repository currently handles configuration of a public Jenkins instance, as well as a dedicated set of worker nodes--this Jenkins instance is located at https://ci.tlcpack.ai
This details the individual parts that interact in TVM's CI. For details on operations, see https://github.com/tlc-pack/ci.
graph TD
Commit --> GitHub
GitHub --> |`push` webhook| WebhookServer(Webhook Server)
JobExecutor(Job Executor)
WebhookServer --> JobExecutor
JobExecutor --> EC2Fleet(EC2 Fleet Plugin)
EC2Fleet --> |capacity request| EC2(EC2 Autoscaler)
JobExecutor --> WorkerEC2Instance
Docker --> |build cache, artifacts| S3
WorkerEC2Instance --> Docker
Docker --> |docker pull| G(Docker Hub)
Docker --> |docker push / pull| ECR
Docker --> |Execute jobs| CIScripts(CI Scripts)
RepoCITerraform(ci-terraform repo) --> |terraform| ECR
RepoCITerraform(ci-terraform repo) --> |terraform| EC2
RepoCITerraform(ci-terraform repo) --> |terraform| S3
RepoCI(ci repo) --> |configuration via Ansible| WorkerEC2Instance
RepoCIPacker(ci-packer) --> |AMIs| EC2
Monitoring_Scrapers(Jenkins Scraper) --> Monitoring_DB(Postrgres)
Grafana --> Monitoring_DB
GitHub --> Windows
GitHub --> MacOS
Developers --> |check PR status|JenkinsUI(Jenkins Web UI)
Monitoring_Scrapers --> |fetch job data| JenkinsUI
Developers --> |git push| Commit
Developers --> |create PR| GitHub
subgraph Jenkins Head Node
WebhookServer
JobExecutor
EC2Fleet
JenkinsUI
end
subgraph GitHub Actions
Windows
MacOS
end
subgraph Configuration / Terraform
RepoCITerraform
RepoCI
RepoCIPacker
end
subgraph Monitoring
Monitoring_DB
Grafana
Monitoring_Scrapers
end
subgraph AWS
subgraph Jenkins Workers
WorkerEC2Instance(Worker EC2 Instance)
subgraph "Worker EC2 Instance"
Docker
CIScripts
end
end
EC2
ECR
S3
end
jenkins
- configuration for the Jenkins head nodeterraform
- Terraform code to provision CI resources in AWSpacker
- Packer configurations for AWS AMIs
Restarting Jenkins is an occasional but necessary service interruption. To minimize developer impact when updating TVM's Jenkins, follow these steps:
-
Notify users:
- message the TVM Discord with a couple hours notice
PSA that we'll be restarting Jenkins soon to <insert reason> -- we will need to retrigger in-flight builds as part of this process, so expect CI slowdowns for the next few hours.
- In Jenkins under Manage Jenkins > Configure System > System Message set it to something like
<p style="text-align: center; padding: 10px; background-color: #dc5f5f; font-weight: bold; color: white; border-radius: 8px;">Jenkins will restart on 3/22/22 at 10 AM PDT (<a style="color: #c4e9ff" href="https://discuss.tvm.apache.org/t/ci-jenkins-restart-tuesday-3-21-22/12366/2">details</a>)</p>
-
Save a list of in-flight jobs (i.e. by saving the webpage at ci.tlcpack.ai to disk)
-
Ensure the latest Terraform defintions have been applied via the
terraform_apply.yml
workflow- Pull the Terraform output to the head node by running
prepare.yml
- Trigger a
workflow_dispatch
eventdeploy.yml
- Pull the Terraform output to the head node by running
-
Wait for Jenkins to come up (5-ish minutes)
-
Cancel any jobs that Jenkins re-queued (due to this issue Jenkins may re-schedule old jobs). Restart any jobs that sent webhooks while Jenkins was down. These JavaScript snippets can help:
// cancel all jobs from the main Jenkins page at ci.tlcpack.ai const cancel = (x) => { let href = x.parentNode.href; console.log(href) new Ajax.Request(href); } document.querySelectorAll("img[alt=\"cancel this build\"]").forEach(cancel) document.querySelectorAll("img[alt=\"terminate this build\"]").forEach(cancel)
// list unique in-flight and queued job URLs from the saved webpage HTML let builds = Array.from(document.getElementById("executors").querySelectorAll("table[tooltip]")).map(x => x.previousSibling.href) let builds = []; document.getElementById("buildQueue").querySelectorAll("a[tooltip]").forEach(a => { builds.push(a.href) }) builds = [...new Set(builds)]; for (const b of builds) { console.log(b); }
-
Check the in-flight PRs from the saved list of jobs and restart the builds for those that didn't automatically restart. Some PRs may have also been submitted while Jenkins was down in which case there will be no Jenkins status reported to the PR. These PRs will need to be re-pushed by the author.
-
Monitor CI for the next day to ensure that autoscaled nodes are being allocated / deallocated as necessary
Dashboards of CI data can be found:
- within Jenkins at https://ci.tlcpack.ai/monitoring (HTTP / JVM stats)
- at https://monitoring.tlcpack.ai (job status, worker status)
-
If the
Build Queue
section on https://ci.tlcpack.ai/ is filled with jobs, hover over them to see their time in queue. If it is longer than a few minutes, it is likely that the CI autoscalers are at capacity. Verify this by comparing the fleet size underEC2 Fleet Status
at the bottom and the limits set in https://ci.tlcpack.ai/configureClouds/. -
If Jenkins is slow or unresponsive, SSH into the head node and restart Jenkins manually via:
sudo systemctl restart jenkins
-
If Jenkins nodes are running out of disk, SSH into the node and check what is taking up disk space. The limits in Terraform may need to be changed but should be kept as low as possible.