Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problem with running the DSL in multiple threads #751

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

marc-guenther
Copy link
Member

We use GPars eachParallel() to split up generation of jobs
into multiple threads. This significantly reduced our
run time from 25min to 5min.

Unfortunately, it leads to a race condition, which causes
random jobs to be lost (and deleted) on each run.

Ultimately, it all boils down to this non thread-safe statement
in JobParent.processItem():
referencedJobs << job

Making the 5 collections asSynchronized() fixed the problem for us.

We use GPars eachParallel() to split up generation of jobs
into multiple threads. This significantly reduced our
run time from 25min to 5min.

Unfortunately, it leads to a race condition, which causes
random jobs to be lost (and deleted) on each run.

Ultimately, it all boils down to this non thread-safe statement
in JobParent.processItem():
  referencedJobs << job

Making the 5 collections asSynchronized() fixed the problem for us.
@jenkinsadmin
Copy link
Member

Thank you for this pull request! Please check this document for how the Jenkins project handles pull requests.

@daspilker
Copy link
Member

Script execution time is becoming an issue as projects grow. The DSL was not designed to be thread-safe. Applying this change would probably fix the concurrency problems for you, but would not make the DSL thead-safe in general. And I think it would be a major effort to build a thread-safe DSL. I want to explore other means for speeding up execution time before addressing concurrency issues. Can you take a look at #782 and test if that change reduces script execution time for you?

@marc-guenther
Copy link
Member Author

Well, if I understand correctly, #782 only helps if you have multiple scripts, but we have only one script which iterates over >300 repositories, and creates >1000 jobs.
Regarding thread-safety, would you advise us to not use this approach? Cause we run it now like this for almost a month at least 50 times a day, and so far we have not noticed any obvious problems. If there is anything I could do to investigate, I would like to do so, just to be safe. From a quick glance at the code, I couldn't see any other shared global state except for these 5 variables.

@daspilker
Copy link
Member

Our seed job generates ~1000 in 90 seconds, so I guess that the time-consuming part is iterating over >300 repositories. I assume that you are using something like the GitHub API to list all repos and branches. And that's slow, not the DSL itself is slow. Can you confirm?

If your approach works for you, that's fine. But if we declare the DSL thread-safe, user's will come up with "interesting" ways of using it, e.g.

def startJob = job('start-all') {
  // ...
}

doInParallel { branch ->
  job("${branch}-job") {
    // ...
  }
  startJob.with {
    publishers {
      downstream("${branch}-job")
    }
  }
}

That will cause problems in PublisherContext. So other collections would need to be thread-safe as well.

And then there is the map containing the DslEnvironments in JenkinsJobManagement. It's used by the extension mechanism and would need to be thread-safe. And all extensions would need to be reentrant.

IMHO trying to fix all concurrency issues is a bottomless pit. Which I want to avoid if possible.

@daspilker
Copy link
Member

My pointer to PublisherContext was wrong, the example will cause concurrency problems with the list of configure blocks in Item.groovy#L6. All top-level methods are implemented as configure blocks.

@marc-guenther
Copy link
Member Author

90 secs, that's fast. I just checked, we create 2580 items (jobs, folders and views) in ~5min (runtime of our seed script) plus ~2min runtime of the job-dsl-plugin (after our script finished).

We use Github API to iterate over repositories, yes, but unfortunately that's not the only bottleneck. Before parallelizing anything, runtime was >25min, but if I parallelize only Github API calls and synchronize on DSL commands, runtime still increases to 10min (+2min).

Your "interesting" example is, yes, interesting :) Fortunately we don't do anything like that, we do not share jobs between threads, they stay in the thread where they are created. Do you see any concurrency problems if we limit ourselves to this? I guess JobManagement is still considered shared state between all Items?

@daspilker
Copy link
Member

Yes, JobManagement is shared between items. But if you do not use the "interesting" code, you should be safe. There may be plugins providing misbehaving DSL methods through the extension points. I did a quick code review of all extensions listed on the API Viewer homepage (https://jenkinsci.github.io/job-dsl-plugin/) and those are safe as of now.

It would be nice if we could improve the single-thread performance of Job DSL. I never optimized that and I have no idea what the hot spots are. Can you attach a profiler to Jenkins or take some thread dumps while your seed job is running?

@jiff-infrastructure
Copy link

Can one of the admins verify this patch?

@marc-guenther
Copy link
Member Author

Just for the record, we successfully use this patch now for 2 years (we have to manually apply it for every new release of the plugin). There never were any problems.

@CorevistCI
Copy link

Build finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants