-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42366][SHUFFLE] Log shuffle data corruption diagnose cause #39918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| logger.info("Shuffle corruption diagnosis took {} ms, checksum file {}, cause {}", | ||
| duration, checksumFile.getAbsolutePath(), cause); | ||
| return cause; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not logged in the upper layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the FetchFailedException's diagnosis reason - and shows up there.
I am not strongly for or against this ...
+CC @Ngone51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if it exists, yes, we don't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up there but the executor logs. So maybe fine to log it too in shuffle service logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it~
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable.
cc @mridulm
|
Thanks everyone for the quick review. I want to add a few more information checksums so that we can analyze the cause of corruption. Like this 23/02/07 13:56:13.833 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-124b60fb-1de0-4893-8140-544029128392/shuffle_0_0_0.checksum.ADLER32, cause DISK_ISSUE, checksumByReader 196609, checksumByWriter 196608, checksumByReCalculation 196609
23/02/07 13:58:28.002 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-5d2ab0e1-6f7f-437e-902c-0e6ac247a0ea/shuffle_0_0_0.checksum.ADLER32, cause NETWORK_ISSUE, checksumByReader 196608, checksumByWriter 196609, checksumByReCalculation 196609
23/02/07 13:58:59.072 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took -1 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-ceaeb1b3-efb2-4ef1-9f42-28387222d948/shuffle_0_0_0.checksum.ADLER32, cause UNKNOWN_ISSUE, checksumByReader 196609, checksumByWriter -1, checksumByReCalculation -1
23/02/07 13:58:59.072 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took -1 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-ceaeb1b3-efb2-4ef1-9f42-28387222d948/shuffle_0_0_0.checksum.ADLER32, cause UNKNOWN_ISSUE, checksumByReader 196609, checksumByWriter -1, checksumByReCalculation -1
|
|
At the first impression, it should be |
We can analyze the cause of shuffle block corruption by collecting shuffle service log, so the INFO level log is meaningful. |
|
Sorry but I'm not sure that I can agree with you that
|
This problem doesn't occur very often, I don't have exact statistics yet.
OK, I adjusted the INFO level to debug level. |
|
I doubt we use the DEBUG level in this case. The corruption cause here can only be either the disk issue or the network issue right now. And both of them could be temporary (problematic disk could be persistent but spark doesn't guarantee writing files on the same disk partition each time) or difficult to reproduce. So I'm afraid using the DEBUG level could miss the cause easily in the first place. |
|
I don't think INFO level is much helpful neither. Eventually, the problematic machine (or the disk or NIC) should be excluded. Is there any other way to mitigate the situation, @Ngone51 ? |
|
BTW, @Ngone51 and @cxzl25 . I'm wondering if we are in the same understanding.
Both (3) and (4) are not my intention here. I agree to have INFO level in the initial PR and proposed to put the additional info in DEBUG level. |
Thanks @dongjoon-hyun . +1 from me. I actually disagreed (3). It was a misunderstanding between us. |
INFO LevelDEBUG Level |
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM from my side.
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, merging to master
|
Merged to master. |
|
Thank you all! |
What changes were proposed in this pull request?
Output the cause in the
diagnoseCorruptionmethod.Why are the changes needed?
It is convenient to collect the reason of shuffle corruption from the shuffle service Log deployed by YARN Nodemanager.
Does this PR introduce any user-facing change?
No
How was this patch tested?
exist UT