Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-65821] Introducing some synchronisation mechanisms to prevent some race condition #153

Merged

Conversation

twasyl
Copy link
Contributor

@twasyl twasyl commented Jun 4, 2021

JENKINS-65821

On some big pipelines, the graph may take some time to be generated due to possible synchronous updates to Maps. Trying to introduce some synchronisation mechanism to prevent those race conditions.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your master branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@twasyl
Copy link
Contributor Author

twasyl commented Jun 4, 2021

@jglick FYI

@amuniz
Copy link
Member

amuniz commented Jun 4, 2021

Wouldn't be better to use a rw lock so reads do not block each other?

@twasyl
Copy link
Contributor Author

twasyl commented Jun 4, 2021

Wouldn't be better to use a rw lock so reads do not block each other?

I also thought about that, but was not really sure due to some discussion that were happening. I could probably follow this path.

Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to improve performance, or is this intended to fix some concurrency issue you ran into? It's not clear to me how adding locks would increase performance. If you did see a concurrency issue, do you have stack traces or any more details about the issue?

(I don't see any specific problems with a patch like this, but depending on the details of the issue this might not be the best place to add synchronization. Also, since there are always fallback code paths here in case the caches are empty, and the computed values should always be the same for the same Pipeline graph, you might be able to just switch to ConcurrentHashMap to resolve any concurrency issues without needing to add explicit locks to this code, maybe with some minor updates to use methods like compute if you are seeing a lot of concurrent reads and want to make sure the cached values are used.)

@jglick
Copy link
Member

jglick commented Jun 4, 2021

I have seen the thread dumps and I suspect the bug was improper concurrent access to HashMap leading to corruption and infinite loops. Would be best to file in Jira and include (sanitized) dumps there.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right.

Comment on lines 52 to 53
blockStartToEnd.put(((BlockEndNode) newHead).getStartNode().getId(), newHead.getId());
String overallEnclosing = nearestEnclosingBlock.get(((BlockEndNode) newHead).getStartNode().getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort of thing that makes me dubious you could a lockless data structure for this purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwnusbaum suggests that you probably could, if you wanted to review the logic carefully.

@jglick
Copy link
Member

jglick commented Jun 4, 2021

leading to corruption and infinite loops

Meaning that the PR description here is incorrect. The problem seems to have been a race condition and a full-blown spin hang, not just less-than-ideal performance.

@twasyl twasyl changed the title Introducing some synchronisation mechanisms to try improving performance Introducing some synchronisation mechanisms to prevent some race condition Jun 7, 2021
@twasyl twasyl changed the title Introducing some synchronisation mechanisms to prevent some race condition [JENKINS-65821] Introducing some synchronisation mechanisms to prevent some race condition Jun 7, 2021
@twasyl twasyl requested a review from jglick June 7, 2021 13:34
@jtnord
Copy link
Member

jtnord commented Jun 7, 2021

what is the impact of this change - AFAICT it will force multiple read threads (the UI) to be single threaded.
Would a ConcurrentHashMap be better than this level of synchronization here? or if you do not need re-entrant locks using a read/write lock?

@jglick
Copy link
Member

jglick commented Jun 7, 2021

@jtnord yes perhaps. Unclear if concurrent access is common enough to be worth the extra risk & complexity though. These methods should all complete quickly enough that I doubt we want to bother optimizing more.

@bitwiseman bitwiseman merged commit 927249b into jenkinsci:master Jun 7, 2021
@twasyl twasyl deleted the synchronization-for-optimisation branch June 7, 2021 18:35
@twasyl twasyl mentioned this pull request Jun 7, 2021
6 tasks
bitwiseman added a commit that referenced this pull request Jun 7, 2021
Backport pull request #153 from twasyl/synchronization-for-optimisation
@timja
Copy link
Member

timja commented Jun 11, 2021

FYI we're seeing performance problems on ci.jenkins.io and this is a likely culprit

https://www.irccloud.com/pastebin/JGLey7jD/

https://www.irccloud.com/pastebin/zWjk17uH/

Yesterday @daniel-beck noticed:

more than a 100 threads are blocked at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getCurrentHeads for a generate-data build

ci.jenkins.io is down atm because it's system log filled up with the stacktrace posted at the top

@jglick
Copy link
Member

jglick commented Jun 14, 2021

this is a likely culprit

You mean, a regression from this PR, or the bug that this PR purports to fix?

@timja
Copy link
Member

timja commented Jun 15, 2021

Regression from this PR, will see if ci.jenkins.io mis-behaves again and get a thread dump

@jglick
Copy link
Member

jglick commented Jun 15, 2021

I see. We can try switching to a ConcurrentHashMap.

@timja
Copy link
Member

timja commented Jun 16, 2021

Few people are hitting this on Jira: https://issues.jenkins.io/browse/JENKINS-65885

@olamy
Copy link
Member

olamy commented Jun 16, 2021

Definitely using a simple ConcurrentHashMap worths a try.

@jglick
Copy link
Member

jglick commented Jun 16, 2021

Oh that is a deadlock, not just higher thread contention. In such a case this must be reverted immediately and an emergency fix cut. @bitwiseman @twasyl @car-roll @dwnusbaum whoever else may be listening

@bitwiseman
Copy link
Contributor

I'll file as soon as I'm back at my desk.

@bitwiseman
Copy link
Contributor

FYI, StandardGraphLookupView implements GraphListener.Synchronous:

/**
* Listener which should be notified of events immediately as they occur.
* You must be very careful not to acquire locks or block.
* If you do not implement this marker interface, you will receive notifications in batched deliveries.
*/
interface Synchronous extends GraphListener {}

I'm not sure, but it seems like synchronized is not a good idea here.

bitwiseman added a commit that referenced this pull request Jun 16, 2021
[JENKINS-65885] Revert "Merge pull request #153 from twasyl/synchronization-for-optimisation"
bitwiseman added a commit that referenced this pull request Jun 23, 2021
…isation"

This reverts commit 927249b, reversing
changes made to 620362f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants