Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix using unavailable worker cached in client-pool #18693

Open
yws-tracy opened this issue Sep 8, 2024 · 0 comments · May be fixed by #18715
Open

fix using unavailable worker cached in client-pool #18693

yws-tracy opened this issue Sep 8, 2024 · 0 comments · May be fixed by #18715
Labels
type-bug This issue is about a bug

Comments

@yws-tracy
Copy link

Alluxio Version:
2.9.4

Describe the bug

In our product env, we met lots of error,
"Unexpected error invoking REST endpoint: Failed to cache: Failed to connect to remote block worker: GrpcServerAddress{HostName=xxx-worker-20.xxx.com, SocketAddress=xxx-worker-20.xxx.com/13.26.4.81:29999

but we found the worker xxx-worker-20.xxx.com actually is healthy, running in good status. after making a deep troubleshooting, we found out the problem. there is bug in clientPoolKey in some cases.

when it need a BlockWorkerClient, it will call
FileSystemContext#acquireBlockWorkerClientInternal
image

image

image

image

we can see above , for the key ClientPoolKey of mBlockWorkerClientPoolMap, as long as the IP is the same , it will reuse ClientPool.

client will get workerConnection from clientPooMap which may have inCorrect worker in some cases.

In our production k8s network env, ip may be reused. for example, the pod worker1 may be deleted for some reason, and it's ip was recycled, when we scale out next time, the new worker (eg, worker20) may reuse the recycled ip of worker1.

When the client tries to establish worker20, it will failed because the cache of client pool will return the cached worker1's BlockWorkerClientPool which is invalid actually. the reason is that the worker20 has same ip as the invalid worker1 that was cached in the client pool

To Reproduce
In reuse ip scenario, it can be reproduced easily

  1. delete worker1
  2. scale out workerN using worker1's ip
  3. workerN can not be established

Expected behavior
the new scaled workerN can serve normally, the client can establish correct connection with workerN

Urgency
only in reused ip scenario, it's not so much urgent
Are you planning to fix it
yes, I fix it in our production env using a simple way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
1 participant