Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can use of the "resume build" button be tracked? #1969

Closed
sam-github opened this issue Oct 18, 2019 · 3 comments
Closed

Can use of the "resume build" button be tracked? #1969

sam-github opened this issue Oct 18, 2019 · 3 comments
Labels

Comments

@sam-github
Copy link
Contributor

Detecting flakiness of builds is currently distributed to human beings, since if a PR fails to build it isn't necessarily a problem with flaky tests or flaky infrastructure... it could be the PR has a problem.

It occurs to me that there is a case where we can be pretty sure that the problem isn't the PR, its when the build is "resumed", and the same SHA builds sucessfully. This could mean that a change introduced in the PR is actually flaky, becuase it only sometimes passes, but humans usually interpret this as "my PR is good, something else was flaky" , which is also the interpretation of node-core-utils.

Is it possible to get from Jenkins a report on when builds were resumed, and what specifically passed on resume that had failed last time?

It strikes me it might be a gold mine for fixing flakiness in our CI.

@rvagg
Copy link
Member

rvagg commented Oct 20, 2019

I did some grepping on the CI machine and it looks like there is a signifier but it doesn't look like that shows up on the UI.

com.tikal.jenkins.plugins.multijob.MultiJobResumeControl only shows up on a small number of builds and may be what we're after.

For example, in node-test-commit-linux, the 30253 build has:

    <com.tikal.jenkins.plugins.multijob.MultiJobResumeControl plugin="jenkins-multijob-plugin@1.32">
      <run class="matrix-build" resolves-to="hudson.model.Run$Replacer" plugin="matrix-project@1.14">
        <id>node-test-commit-linux#30251</id>
      </run>
    </com.tikal.jenkins.plugins.multijob.MultiJobResumeControl>

If we look at https://ci.nodejs.org/job/node-test-commit-linux/30253/, it's not obvious that this is anything special, but flip to https://ci.nodejs.org/job/node-test-commit-linux/30251/, the one linked in the config, we find the identical gitref and some failures. #30253, #30251 is not, these tests are flaky.

Complications:

  • We don't keep builds for long, I think we might be on a 7 or 5 day cycle, so any analysis would need to be done regularly(ish)
  • It's all in XML, yay
  • It's all locked away on ci.nodejs.org, which build/infra folks have access to - although it's not a super critical resource and we could discuss being slightly less restrictive if someone wants to spend time coming up with an analysis solution that can also exfiltrate the results to a usable place.

@rvagg
Copy link
Member

rvagg commented Oct 20, 2019

FYI here's how they can be found, and the list of resumed builds in node-test-commit-linux in the last week:

$ grep -i '<com.tikal.jenkins.plugins.multijob.MultiJobResumeControl' /var/lib/jenkins/jobs/node-test-commit-linux/builds/*/build.xml | awk -F/ '{print $(NF-1)}'
30203
30216
30220
30253
30258
30286
30288
30312

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants