Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic cluster topology refresh blocks commands execution. #2908

Open
ashok-shukla opened this issue Jul 8, 2024 · 5 comments
Open

Periodic cluster topology refresh blocks commands execution. #2908

ashok-shukla opened this issue Jul 8, 2024 · 5 comments
Labels
for: team-attention An issue we need to discuss as a team to make progress status: mre-available Minimal Reproducible Example is available status: waiting-for-triage

Comments

@ashok-shukla
Copy link

ashok-shukla commented Jul 8, 2024

Bug Report

Current Behavior

Periodic cluster topology refresh blocks commands execution.

From JFR,

  1. topology refresh took 1.5 sec
    Start Time Duration End Time Event Thread sources
    7/5/24, 5:49:06.714 PM 1.500 s 7/5/24, 5:49:08.215 PM lettuce-nioEventLoop-5-2 xxxx,xxxx

  2. Thread is parked to execute get command during the topology refresh period

Start Time	Duration	End Time	Event Thread	Address of Object Parked	Class Parked On	Park Timeout	Park Until
7/5/24, 5:49:06.734 PM	1.380 s	7/5/24, 5:49:08.115 PM	io-executor-thread-9	0xE9789188	java.util.concurrent.CompletableFuture$Signaller	60.000 s	12/2/55, 4:47:04.192 PM

get command stack:
  Stack Trace	Count	Percentage
  void jdk.internal.misc.Unsafe.park(boolean, long):-1	1	100 %
  void java.util.concurrent.locks.LockSupport.parkNanos(Object, long):269	1	100 %
  boolean java.util.concurrent.CompletableFuture$Signaller.block():1866	1	100 %
  void java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool$ManagedBlocker):3780	1	100 %
  void java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool$ManagedBlocker):3725	1	100 %
  Object java.util.concurrent.CompletableFuture.timedGet(long):1939	1	100 %
  Object java.util.concurrent.CompletableFuture.get(long, TimeUnit):2095	1	100 %
  boolean io.lettuce.core.protocol.AsyncCommand.await(long, TimeUnit):83	1	100 %
  Object io.lettuce.core.internal.Futures.awaitOrCancel(RedisFuture, long, TimeUnit):244	1	100 %
  Object io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(Object, Method, Object[]):131	1	100 %
  Object io.lettuce.core.internal.AbstractInvocationHandler.invoke(Object, Method, Object[]):80	1	100 %
  Object jdk.proxy2.$Proxy119.get(Object):-1	1	100 %
  byte[] oracle.oic.v3.mtms.infra.cache.impl.AbstractCacheImpl.lambda$get$8(byte[], String):280	1	100 %
  .....
  .....

Expected behavior/code

As for reference guide, from Lettuce 6.0, Cluster topology refresh is now non-blocking.
But from JFR, it shows as blocking.

Environment

  • Lettuce version(s): 6.2.6.RELEASE
  • Redis version: 6.2.6

Possible Solution

Additional context

We are executing commands using synchronous api. Our redis cluster consists of 9 nodes: 3 master and each master has 2 replicas.

@tishun
Copy link
Collaborator

tishun commented Jul 9, 2024

Hey @ashok-shukla ,

what lead you to conclude that the parked thread is handling topology refresh?
Is there some minimum reproducible example of the problem we can take a look at?

@tishun
Copy link
Collaborator

tishun commented Jul 9, 2024

(Asynchronous topology refresh is handled in #1107)

@tishun tishun added the status: waiting-for-feedback We need additional information before we can continue label Jul 9, 2024
@ashok-shukla
Copy link
Author

parked thread is not handling topology refresh. but waiting for topology refresh.

io-executor-thread-9: is parked from 7/5/24, 5:49:06.734 PM to 7/5/24, 5:49:08.115 PM.
lettuce-nioEventLoop-5-2: doing topology refresh from 7/5/24, 5:49:06.714 PM to 7/5/24, 5:49:08.215 PM.

I do not have a minimum reproducible example now. if possible, I will try to create one.

@tishun tishun added status: waiting-for-triage and removed status: waiting-for-feedback We need additional information before we can continue labels Jul 10, 2024
@ashok-shukla
Copy link
Author

reproducer:
cache.zip

  1. download cache.zip and unzip
  2. update REDIS_HOST and REDIS_PORT in CheckTopologyRefresh.java as per your redis cluster env. Our redis cluster nodes running in kubernetes and we just point to redis cluster kubernetes service as redis host.
  3. Update ssl options as per you setup.
  4. build the project: mvn clean install
  5. run the jar with following command:
    java -Xms1024m -Xmx1024m -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=100 -XX:+FlightRecorder -XX:StartFlightRecording=duration=300s,filename=lettuce5min.jfr -jar cache-1.0-SNAPSHOT-jar-with-dependencies.jar 5 | tee test5min.log
  6. compare topology refresh time in lettuce5min.jfr with time in test5min.log. Check response time during that period.

@tishun
Copy link
Collaborator

tishun commented Jul 11, 2024

Awesome! Much appreciated!

@tishun tishun added the for: team-attention An issue we need to discuss as a team to make progress label Jul 17, 2024
@tishun tishun added the status: mre-available Minimal Reproducible Example is available label Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: team-attention An issue we need to discuss as a team to make progress status: mre-available Minimal Reproducible Example is available status: waiting-for-triage
Projects
None yet
Development

No branches or pull requests

2 participants