fix using unavailable worker cached in client-pool #18693

yws-tracy · 2024-09-08T10:12:46Z

Alluxio Version:
2.9.4

Describe the bug

In our product env, we met lots of error,
"Unexpected error invoking REST endpoint: Failed to cache: Failed to connect to remote block worker: GrpcServerAddress{HostName=xxx-worker-20.xxx.com, SocketAddress=xxx-worker-20.xxx.com/13.26.4.81:29999

but we found the worker xxx-worker-20.xxx.com actually is healthy, running in good status. after making a deep troubleshooting, we found out the problem. there is bug in clientPoolKey in some cases.

when it need a BlockWorkerClient, it will call
FileSystemContext#acquireBlockWorkerClientInternal

we can see above , for the key ClientPoolKey of mBlockWorkerClientPoolMap, as long as the IP is the same , it will reuse ClientPool.

client will get workerConnection from clientPooMap which may have inCorrect worker in some cases.

In our production k8s network env, ip may be reused. for example, the pod worker1 may be deleted for some reason, and it's ip was recycled, when we scale out next time, the new worker (eg, worker20) may reuse the recycled ip of worker1.

When the client tries to establish worker20, it will failed because the cache of client pool will return the cached worker1's BlockWorkerClientPool which is invalid actually. the reason is that the worker20 has same ip as the invalid worker1 that was cached in the client pool

To Reproduce
In reuse ip scenario, it can be reproduced easily

delete worker1
scale out workerN using worker1's ip
workerN can not be established

Expected behavior
the new scaled workerN can serve normally, the client can establish correct connection with workerN

Urgency
only in reused ip scenario, it's not so much urgent
Are you planning to fix it
yes, I fix it in our production env using a simple way

…#18693

yws-tracy added the type-bug This issue is about a bug label Sep 8, 2024

yws-tracy added a commit to yws-tracy/alluxio that referenced this issue Sep 8, 2024

Fix using unavailable worker cached in client-pool to resolve Alluxio…

82b0601

…#18693

yws-tracy mentioned this issue Sep 8, 2024

Fix using unavailable worker cached in client-pool to resolve #18693 #18694

Closed

yws-tracy added a commit to yws-tracy/alluxio that referenced this issue Dec 1, 2024

Fix using unavailable worker cached in client-pool to resolve Alluxio…

dc4f207

…#18693

yws-tracy linked a pull request Dec 1, 2024 that will close this issue

Fix using unavailable worker cached in client-pool to resolve #18693 #18715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix using unavailable worker cached in client-pool #18693

fix using unavailable worker cached in client-pool #18693

yws-tracy commented Sep 8, 2024

fix using unavailable worker cached in client-pool #18693

fix using unavailable worker cached in client-pool #18693

Comments

yws-tracy commented Sep 8, 2024