Skip to content

[Bug] Get shuffle result failed caused by concurrent calls to registerShuffle #273

@leixm

Description

@leixm

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

The registerShuffle interface of ShuffleServer receives an empty remoteStoragePath, which eventually causes getShuffleResult to fail.

What is even more strange is that the registration of the same shuffle needs to be registered with a total of 15 ShuffleServers, and only one ShuffleServer has an empty remoteStoragePath passed in.

Why does ShuffleServer receive an empty remoteStoragePath?
org.apache.spark.shuffle.RssShuffleManager#registerShuffle will set the path in remoteStorage to an empty string(as below code), In a concurrent scenario, one thread is setting remoteStorage to an empty string, and another thread is using remoteStorage, which will eventually cause getShuffleResult to fail.
remoteStorage = new RemoteStorageInfo(sparkConf.get(RssSparkConfig.RSS_REMOTE_STORAGE_PATH.key(), ""));

getShuffleResult exception stack:

[ERROR] 2022-10-19 02:59:51,672 Grpc-997 HdfsStorageManager getStorageByAppId - Can't find HDFS storage for appId[application_1664275719770_10420755_1666119585202]
[ERROR] 2022-10-19 02:59:51,672 Grpc-997 ShuffleServerGrpcService getShuffleResult - Error happened when get shuffle result for appId[application_1664275719770_10420755_1666119585202], shuffleId[4], partitionId[13]
java.lang.NullPointerException
        at org.apache.uniffle.server.ShuffleTaskManager.getFinishedBlockIds(ShuffleTaskManager.java:281)
        at org.apache.uniffle.server.ShuffleServerGrpcService.getShuffleResult(ShuffleServerGrpcService.java:361)
        at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:923)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:352)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        

Affects Version(s)

0.6.0

Uniffle Server Log Output

Abnormal ShuffleServer:
[INFO] 2022-10-19 02:59:46,066 Grpc-215 ShuffleServerGrpcService registerShuffle - Get register request for appId[application_1664275719770_10420755_1666119585202], shuffleId[4], remoteStorage[] with 66 partition ranges

Normal ShuffleServer:
[INFO] 2022-10-19 02:59:47,385 Grpc-974 ShuffleServerGrpcService registerShuffle - Get register request for appId[application_1664275719770_10420755_1666119585202], shuffleId[4], remoteStorage[hdfs://xxxxxxx/tmp/rss/shuffle_data] with 67 partition ranges

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions