[SPARK-42366][SHUFFLE] Log shuffle data corruption diagnose cause #39918

cxzl25 · 2023-02-07T02:55:27Z

What changes were proposed in this pull request?

Output the cause in the diagnoseCorruption method.

Why are the changes needed?

It is convenient to collect the reason of shuffle corruption from the shuffle service Log deployed by YARN Nodemanager.

Does this PR introduce any user-facing change?

No

How was this patch tested?

exist UT

dongjoon-hyun · 2023-02-07T04:58:46Z

...k-shuffle/src/main/java/org/apache/spark/network/shuffle/checksum/ShuffleChecksumHelper.java

    }
+    logger.info("Shuffle corruption diagnosis took {} ms, checksum file {}, cause {}",
+            duration, checksumFile.getAbsolutePath(), cause);
    return cause;


This is not logged in the upper layer?

This is part of the FetchFailedException's diagnosis reason - and shows up there.

I am not strongly for or against this ...
+CC @Ngone51

Oh, if it exists, yes, we don't need this.

It's up there but the executor logs. So maybe fine to log it too in shuffle service logs.

dongjoon-hyun

Looks reasonable.

cc @mridulm

cxzl25 · 2023-02-07T06:00:53Z

Thanks everyone for the quick review.

I want to add a few more information checksums so that we can analyze the cause of corruption.

Like this

23/02/07 13:56:13.833 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-124b60fb-1de0-4893-8140-544029128392/shuffle_0_0_0.checksum.ADLER32, cause DISK_ISSUE, checksumByReader 196609, checksumByWriter 196608, checksumByReCalculation 196609

23/02/07 13:58:28.002 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-5d2ab0e1-6f7f-437e-902c-0e6ac247a0ea/shuffle_0_0_0.checksum.ADLER32, cause NETWORK_ISSUE, checksumByReader 196608, checksumByWriter 196609, checksumByReCalculation 196609

23/02/07 13:58:59.072 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took -1 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-ceaeb1b3-efb2-4ef1-9f42-28387222d948/shuffle_0_0_0.checksum.ADLER32, cause UNKNOWN_ISSUE, checksumByReader 196609, checksumByWriter -1, checksumByReCalculation -1

23/02/07 13:58:59.072 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took -1 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-ceaeb1b3-efb2-4ef1-9f42-28387222d948/shuffle_0_0_0.checksum.ADLER32, cause UNKNOWN_ISSUE, checksumByReader 196609, checksumByWriter -1, checksumByReCalculation -1

dongjoon-hyun · 2023-02-07T06:30:46Z

At the first impression, it should be DEBUG level log instead of INFO.
What is the additional benefit for the users, @cxzl25 ?

cxzl25 · 2023-02-07T07:27:30Z

What is the additional benefit for the users

We can analyze the cause of shuffle block corruption by collecting shuffle service log, so the INFO level log is meaningful.

dongjoon-hyun · 2023-02-07T07:56:20Z

Sorry but I'm not sure that I can agree with you that the INFO level log is meaningful for that kind of analysis.

May I ask why Shuffle block corruption become a frequent event in your cluster?
Why you cannot use DEBUG and TRACE level for your analysis when you can control via log4j property files?

cxzl25 · 2023-02-07T08:36:58Z

May I ask why Shuffle block corruption become a frequent event in your cluster?

This problem doesn't occur very often, I don't have exact statistics yet.
Adding these logs is to do in-depth research and find out the possible reasons for this problem.

Why you cannot use DEBUG and TRACE level for your analysis when you can control via log4j property files?

OK, I adjusted the INFO level to debug level.

Ngone51 · 2023-02-07T09:22:37Z

I doubt we use the DEBUG level in this case. The corruption cause here can only be either the disk issue or the network issue right now. And both of them could be temporary (problematic disk could be persistent but spark doesn't guarantee writing files on the same disk partition each time) or difficult to reproduce. So I'm afraid using the DEBUG level could miss the cause easily in the first place.

dongjoon-hyun · 2023-02-07T18:55:38Z

I don't think INFO level is much helpful neither. Eventually, the problematic machine (or the disk or NIC) should be excluded. Is there any other way to mitigate the situation, @Ngone51 ?

dongjoon-hyun · 2023-02-07T19:01:40Z

BTW, @Ngone51 and @cxzl25 . I'm wondering if we are in the same understanding.

I already gave +1 for this previous INFO-level information approval.
We had a discussion about the log level for this new suggestion.

[SPARK-42366][SHUFFLE] Log shuffle data corruption diagnose cause #39918 (comment)

@cxzl25 converted everything from INFO into DEBUG
@Ngone51 expresses a worry about DEBUG level.

Both (3) and (4) are not my intention here. I agree to have INFO level in the initial PR and proposed to put the additional info in DEBUG level.

Ngone51 · 2023-02-08T02:12:46Z

I agree to have INFO level in the initial PR and proposed to put the additional info in DEBUG level.

Thanks @dongjoon-hyun . +1 from me. I actually disagreed (3). It was a misunderstanding between us.

This reverts commit cd12044.

This reverts commit 7a13d1f.

cxzl25 · 2023-02-08T06:16:36Z

INFO Level

23/02/08 14:12:50.098 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-facc29e8-0712-4dc7-bc00-f047c7cd818c/shuffle_0_0_0.checksum.ADLER32, cause DISK_ISSUE

DEBUG Level

23/02/08 14:12:04.699 main DEBUG ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-abecd08b-149d-46d3-a0fa-2d063cd2cbe7/shuffle_0_0_0.checksum.ADLER32, cause DISK_ISSUE, checksumByReader 196609, checksumByWriter 196608, checksumByReCalculation 196609

LuciferYang

+1, LGTM

dongjoon-hyun

+1, LGTM from my side.

mridulm

Looks good to me, merging to master

mridulm · 2023-02-09T16:37:20Z

Merged to master.
Thanks for fixing this @cxzl25 !
Thanks for reviews @dongjoon-hyun, @Ngone51, @LuciferYang :-)

dongjoon-hyun · 2023-02-09T17:15:03Z

Thank you all!

log

2ad67a5

github-actions bot added the CORE label Feb 7, 2023

dongjoon-hyun reviewed Feb 7, 2023

View reviewed changes

dongjoon-hyun approved these changes Feb 7, 2023

View reviewed changes

Ngone51 approved these changes Feb 7, 2023

View reviewed changes

cxzl25 added 2 commits February 7, 2023 16:32

checksum

7a13d1f

debug level

cd12044

cxzl25 added 3 commits February 8, 2023 14:02

Revert "debug level"

35173e7

This reverts commit cd12044.

Revert "checksum"

dde9d73

This reverts commit 7a13d1f.

put the additional info in DEBUG level

1f5d182

LuciferYang approved these changes Feb 8, 2023

View reviewed changes

dongjoon-hyun approved these changes Feb 9, 2023

View reviewed changes

mridulm approved these changes Feb 9, 2023

View reviewed changes

mridulm closed this in 201a91b Feb 9, 2023

[SPARK-42366][SHUFFLE] Log shuffle data corruption diagnose cause #39918

[SPARK-42366][SHUFFLE] Log shuffle data corruption diagnose cause #39918

Uh oh!

Conversation

cxzl25 commented Feb 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

mridulm Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

Ngone51 Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 7, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cxzl25 commented Feb 7, 2023

Uh oh!

dongjoon-hyun commented Feb 7, 2023

Uh oh!

cxzl25 commented Feb 7, 2023

Uh oh!

dongjoon-hyun commented Feb 7, 2023

Uh oh!

cxzl25 commented Feb 7, 2023

Uh oh!

Ngone51 commented Feb 7, 2023

Uh oh!

dongjoon-hyun commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 7, 2023

Uh oh!

Ngone51 commented Feb 8, 2023

Uh oh!

cxzl25 commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

INFO Level

DEBUG Level

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

mridulm commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented Feb 7, 2023 •

edited

Loading

cxzl25 commented Feb 8, 2023 •

edited

Loading

mridulm commented Feb 9, 2023 •

edited

Loading