Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] timeout when broker registry hangs and monitor broker registry (ExtensibleLoadManagerImpl only) #23382

Merged
merged 6 commits into from
Oct 3, 2024

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented Oct 1, 2024

Motivation

  • metadata store could fail the broker lock registration. In this case, we should retry and also monitor the registry periodically and try to register it if not registered yet.
  • also we need to verify if the broker is healthy or not before starting the orphan bundle cleanups, as the broker might receive "out-dated" notifications.

Modifications

  • add monitor logic to periodically check if the broker is registered and to try to register it.
  • add timeout logic when the broker re-registration hangs.
  • add retry logic when the broker re-registration fails.
  • add health check logic before starting the orphan bundle cleanup jobs and canceling the cleanup jobs.
  • add brokerId in broker health admin api

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#82

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 1, 2024
@heesung-sn heesung-sn changed the title [fix][broker] timeout when broker registry hangs and monitor broker registry [fix][broker] timeout when broker registry hangs and monitor broker registry (ExtensibleLoadManagerImpl only) Oct 1, 2024
@lhotari lhotari added this to the 4.0.0 milestone Oct 1, 2024
@lhotari
Copy link
Member

lhotari commented Oct 1, 2024

@heesung-sn please add the release labels that this applies to.

@heesung-sn heesung-sn marked this pull request as draft October 1, 2024 22:05
@heesung-sn
Copy link
Contributor Author

I realized we need to add more validation upon this broker delete/create notifications as these notification could be "out-dated".

@lhotari
Copy link
Member

lhotari commented Oct 2, 2024

@heesung-sn I ran into a hanging test recently. Is this something that is related?

ZkSessionExpireTest.cleanup hangs in ModularLoadManagerImpl.disableBroker

"main" #1 [2383] prio=5 os_prio=0 cpu=101566.56ms elapsed=3520.41s tid=0x00007f3618030a30 nid=2383 waiting on condition  [0x00007f3621f54000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)
	- parking to wait for  <0x0000100041604dc8> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@21.0.4/LockSupport.java:221)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@21.0.4/CompletableFuture.java:1864)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@21.0.4/ForkJoinPool.java:3780)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@21.0.4/ForkJoinPool.java:3725)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@21.0.4/CompletableFuture.java:1898)
	at java.util.concurrent.CompletableFuture.join(java.base@21.0.4/CompletableFuture.java:2117)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.disableBroker(ModularLoadManagerImpl.java:603)
	at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.disableBroker(ModularLoadManagerWrapper.java:47)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:949)
	at org.apache.pulsar.broker.service.BrokerService.unloadNamespaceBundlesGracefully(BrokerService.java:936)
	at org.apache.pulsar.broker.PulsarService.closeAsync(PulsarService.java:520)
	at org.apache.pulsar.broker.PulsarService.close(PulsarService.java:479)
	at org.apache.pulsar.broker.service.NetworkErrorTestBase.cleanup(NetworkErrorTestBase.java:215)
	at org.apache.pulsar.broker.service.ZkSessionExpireTest.cleanup(ZkSessionExpireTest.java:50)

Thread dump: https://gist.github.com/lhotari/68578ba1354d4833af489b9ac3658c8c
Analysed in jstack.review: https://jstack.review/?https://gist.github.com/lhotari/68578ba1354d4833af489b9ac3658c8c#tda_1_dump

I created #23388 and #23389 for ZkSessionExpireTest flakiness issues that appeared in subsequent builds.

@heesung-sn
Copy link
Contributor Author

@heesung-sn I ran into a hanging test recently. Is this something that is related?

Potentially, I think this could be related, regarding why the broker registry lock creation didn't go through. Unfortunately, we couldn't confirm this because we haven't put error log when creating the lock.

I see a potential deadlock report in NativeReferenceQueue$Lock. I wonder if java 21 has any bugs there.
Screen Shot 2024-10-02 at 4 19 30 AM

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 89.89899% with 10 lines in your changes missing coverage. Please review.

Project coverage is 74.54%. Comparing base (bbc6224) to head (051295d).
Report is 625 commits behind head on master.

Files with missing lines Patch % Lines
...xtensions/channel/ServiceUnitStateChannelImpl.java 84.78% 5 Missing and 2 partials ⚠️
...ker/loadbalance/extensions/BrokerRegistryImpl.java 94.11% 1 Missing and 1 partial ⚠️
...ache/pulsar/client/admin/internal/BrokersImpl.java 85.71% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23382      +/-   ##
============================================
+ Coverage     73.57%   74.54%   +0.97%     
- Complexity    32624    34015    +1391     
============================================
  Files          1877     1936      +59     
  Lines        139502   145365    +5863     
  Branches      15299    15893     +594     
============================================
+ Hits         102638   108366    +5728     
+ Misses        28908    28681     -227     
- Partials       7956     8318     +362     
Flag Coverage Δ
inttests 27.49% <24.24%> (+2.91%) ⬆️
systests 24.48% <0.00%> (+0.16%) ⬆️
unittests 73.88% <89.89%> (+1.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...g/apache/pulsar/broker/admin/impl/BrokersBase.java 73.02% <100.00%> (+0.57%) ⬆️
...dbalance/extensions/ExtensibleLoadManagerImpl.java 79.25% <100.00%> (-0.83%) ⬇️
...ache/pulsar/client/admin/internal/BrokersImpl.java 88.23% <85.71%> (+0.73%) ⬆️
...ker/loadbalance/extensions/BrokerRegistryImpl.java 84.21% <94.11%> (-0.56%) ⬇️
...xtensions/channel/ServiceUnitStateChannelImpl.java 85.35% <84.78%> (+0.04%) ⬆️

... and 601 files with indirect coverage changes

@BewareMyPower BewareMyPower dismissed their stale review October 3, 2024 06:41

Comments resolved

@heesung-sn heesung-sn merged commit eee9283 into apache:master Oct 3, 2024
51 checks passed
@lhotari
Copy link
Member

lhotari commented Oct 4, 2024

@heesung-sn Cherry-picking this to branch-3.0 doesn't get applied cleanly. Would you be able to handle cherry-picking the required changes to branch-3.0?

heesung-sn added a commit to heesung-sn/pulsar that referenced this pull request Oct 23, 2024
…egistry (ExtensibleLoadManagerImpl only) (apache#23382)

(cherry picked from commit eee9283)
heesung-sn added a commit to heesung-sn/pulsar that referenced this pull request Oct 23, 2024
…egistry (ExtensibleLoadManagerImpl only) (apache#23382)

(cherry picked from commit eee9283)
heesung-sn added a commit that referenced this pull request Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants