-
Notifications
You must be signed in to change notification settings - Fork 169
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the bug
The registerShuffle interface of ShuffleServer receives an empty remoteStoragePath, which eventually causes getShuffleResult to fail.
What is even more strange is that the registration of the same shuffle needs to be registered with a total of 15 ShuffleServers, and only one ShuffleServer has an empty remoteStoragePath passed in.
Why does ShuffleServer receive an empty remoteStoragePath?
org.apache.spark.shuffle.RssShuffleManager#registerShuffle will set the path in remoteStorage to an empty string(as below code), In a concurrent scenario, one thread is setting remoteStorage to an empty string, and another thread is using remoteStorage, which will eventually cause getShuffleResult to fail.
remoteStorage = new RemoteStorageInfo(sparkConf.get(RssSparkConfig.RSS_REMOTE_STORAGE_PATH.key(), ""));
getShuffleResult exception stack:
[ERROR] 2022-10-19 02:59:51,672 Grpc-997 HdfsStorageManager getStorageByAppId - Can't find HDFS storage for appId[application_1664275719770_10420755_1666119585202]
[ERROR] 2022-10-19 02:59:51,672 Grpc-997 ShuffleServerGrpcService getShuffleResult - Error happened when get shuffle result for appId[application_1664275719770_10420755_1666119585202], shuffleId[4], partitionId[13]
java.lang.NullPointerException
at org.apache.uniffle.server.ShuffleTaskManager.getFinishedBlockIds(ShuffleTaskManager.java:281)
at org.apache.uniffle.server.ShuffleServerGrpcService.getShuffleResult(ShuffleServerGrpcService.java:361)
at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:923)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:352)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Affects Version(s)
0.6.0
Uniffle Server Log Output
Abnormal ShuffleServer:
[INFO] 2022-10-19 02:59:46,066 Grpc-215 ShuffleServerGrpcService registerShuffle - Get register request for appId[application_1664275719770_10420755_1666119585202], shuffleId[4], remoteStorage[] with 66 partition ranges
Normal ShuffleServer:
[INFO] 2022-10-19 02:59:47,385 Grpc-974 ShuffleServerGrpcService registerShuffle - Get register request for appId[application_1664275719770_10420755_1666119585202], shuffleId[4], remoteStorage[hdfs://xxxxxxx/tmp/rss/shuffle_data] with 67 partition rangesUniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!