Periodic cluster topology refresh blocks commands execution. #2908

ashok-shukla · 2024-07-08T14:37:40Z

Bug Report

Current Behavior

Periodic cluster topology refresh blocks commands execution.

From JFR,

topology refresh took 1.5 sec
Start Time Duration End Time Event Thread sources
7/5/24, 5:49:06.714 PM 1.500 s 7/5/24, 5:49:08.215 PM lettuce-nioEventLoop-5-2 xxxx,xxxx
Thread is parked to execute get command during the topology refresh period

Start Time	Duration	End Time	Event Thread	Address of Object Parked	Class Parked On	Park Timeout	Park Until
7/5/24, 5:49:06.734 PM	1.380 s	7/5/24, 5:49:08.115 PM	io-executor-thread-9	0xE9789188	java.util.concurrent.CompletableFuture$Signaller	60.000 s	12/2/55, 4:47:04.192 PM

get command stack:
  Stack Trace	Count	Percentage
  void jdk.internal.misc.Unsafe.park(boolean, long):-1	1	100 %
  void java.util.concurrent.locks.LockSupport.parkNanos(Object, long):269	1	100 %
  boolean java.util.concurrent.CompletableFuture$Signaller.block():1866	1	100 %
  void java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool$ManagedBlocker):3780	1	100 %
  void java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool$ManagedBlocker):3725	1	100 %
  Object java.util.concurrent.CompletableFuture.timedGet(long):1939	1	100 %
  Object java.util.concurrent.CompletableFuture.get(long, TimeUnit):2095	1	100 %
  boolean io.lettuce.core.protocol.AsyncCommand.await(long, TimeUnit):83	1	100 %
  Object io.lettuce.core.internal.Futures.awaitOrCancel(RedisFuture, long, TimeUnit):244	1	100 %
  Object io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(Object, Method, Object[]):131	1	100 %
  Object io.lettuce.core.internal.AbstractInvocationHandler.invoke(Object, Method, Object[]):80	1	100 %
  Object jdk.proxy2.$Proxy119.get(Object):-1	1	100 %
  byte[] oracle.oic.v3.mtms.infra.cache.impl.AbstractCacheImpl.lambda$get$8(byte[], String):280	1	100 %
  .....
  .....

Expected behavior/code

As for reference guide, from Lettuce 6.0, Cluster topology refresh is now non-blocking.
But from JFR, it shows as blocking.

Environment

Lettuce version(s): 6.2.6.RELEASE
Redis version: 6.2.6

Possible Solution

Additional context

We are executing commands using synchronous api. Our redis cluster consists of 9 nodes: 3 master and each master has 2 replicas.

The text was updated successfully, but these errors were encountered:

tishun · 2024-07-09T11:47:46Z

Hey @ashok-shukla ,

what lead you to conclude that the parked thread is handling topology refresh?
Is there some minimum reproducible example of the problem we can take a look at?

tishun · 2024-07-09T11:52:39Z

(Asynchronous topology refresh is handled in #1107)

ashok-shukla · 2024-07-09T14:45:26Z

parked thread is not handling topology refresh. but waiting for topology refresh.

io-executor-thread-9: is parked from 7/5/24, 5:49:06.734 PM to 7/5/24, 5:49:08.115 PM.
lettuce-nioEventLoop-5-2: doing topology refresh from 7/5/24, 5:49:06.714 PM to 7/5/24, 5:49:08.215 PM.

I do not have a minimum reproducible example now. if possible, I will try to create one.

ashok-shukla · 2024-07-10T22:40:06Z

reproducer:
cache.zip

download cache.zip and unzip
update REDIS_HOST and REDIS_PORT in CheckTopologyRefresh.java as per your redis cluster env. Our redis cluster nodes running in kubernetes and we just point to redis cluster kubernetes service as redis host.
Update ssl options as per you setup.
build the project: mvn clean install
run the jar with following command:
java -Xms1024m -Xmx1024m -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=100 -XX:+FlightRecorder -XX:StartFlightRecording=duration=300s,filename=lettuce5min.jfr -jar cache-1.0-SNAPSHOT-jar-with-dependencies.jar 5 | tee test5min.log
compare topology refresh time in lettuce5min.jfr with time in test5min.log. Check response time during that period.

tishun · 2024-07-11T08:45:35Z

Awesome! Much appreciated!

tishun added the status: waiting-for-feedback We need additional information before we can continue label Jul 9, 2024

tishun added status: waiting-for-triage and removed status: waiting-for-feedback We need additional information before we can continue labels Jul 10, 2024

tishun added the for: team-attention An issue we need to discuss as a team to make progress label Jul 17, 2024

tishun added the status: mre-available Minimal Reproducible Example is available label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodic cluster topology refresh blocks commands execution. #2908

Periodic cluster topology refresh blocks commands execution. #2908

ashok-shukla commented Jul 8, 2024 •

edited

Loading

tishun commented Jul 9, 2024

tishun commented Jul 9, 2024

ashok-shukla commented Jul 9, 2024

ashok-shukla commented Jul 10, 2024

tishun commented Jul 11, 2024

Periodic cluster topology refresh blocks commands execution. #2908

Periodic cluster topology refresh blocks commands execution. #2908

Comments

ashok-shukla commented Jul 8, 2024 • edited Loading

Bug Report

Current Behavior

Expected behavior/code

Environment

Possible Solution

Additional context

tishun commented Jul 9, 2024

tishun commented Jul 9, 2024

ashok-shukla commented Jul 9, 2024

ashok-shukla commented Jul 10, 2024

tishun commented Jul 11, 2024

ashok-shukla commented Jul 8, 2024 •

edited

Loading