Skip to content

Conversation

@cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Feb 7, 2023

What changes were proposed in this pull request?

Output the cause in the diagnoseCorruption method.

Why are the changes needed?

It is convenient to collect the reason of shuffle corruption from the shuffle service Log deployed by YARN Nodemanager.

Does this PR introduce any user-facing change?

No

How was this patch tested?

exist UT

@github-actions github-actions bot added the CORE label Feb 7, 2023
}
logger.info("Shuffle corruption diagnosis took {} ms, checksum file {}, cause {}",
duration, checksumFile.getAbsolutePath(), cause);
return cause;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not logged in the upper layer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the FetchFailedException's diagnosis reason - and shows up there.

I am not strongly for or against this ...
+CC @Ngone51

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if it exists, yes, we don't need this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up there but the executor logs. So maybe fine to log it too in shuffle service logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it~

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable.

cc @mridulm

@cxzl25
Copy link
Contributor Author

cxzl25 commented Feb 7, 2023

Thanks everyone for the quick review.

I want to add a few more information checksums so that we can analyze the cause of corruption.

Like this

23/02/07 13:56:13.833 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-124b60fb-1de0-4893-8140-544029128392/shuffle_0_0_0.checksum.ADLER32, cause DISK_ISSUE, checksumByReader 196609, checksumByWriter 196608, checksumByReCalculation 196609

23/02/07 13:58:28.002 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-5d2ab0e1-6f7f-437e-902c-0e6ac247a0ea/shuffle_0_0_0.checksum.ADLER32, cause NETWORK_ISSUE, checksumByReader 196608, checksumByWriter 196609, checksumByReCalculation 196609

23/02/07 13:58:59.072 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took -1 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-ceaeb1b3-efb2-4ef1-9f42-28387222d948/shuffle_0_0_0.checksum.ADLER32, cause UNKNOWN_ISSUE, checksumByReader 196609, checksumByWriter -1, checksumByReCalculation -1

23/02/07 13:58:59.072 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took -1 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-ceaeb1b3-efb2-4ef1-9f42-28387222d948/shuffle_0_0_0.checksum.ADLER32, cause UNKNOWN_ISSUE, checksumByReader 196609, checksumByWriter -1, checksumByReCalculation -1

@dongjoon-hyun
Copy link
Member

At the first impression, it should be DEBUG level log instead of INFO.
What is the additional benefit for the users, @cxzl25 ?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Feb 7, 2023

What is the additional benefit for the users

We can analyze the cause of shuffle block corruption by collecting shuffle service log, so the INFO level log is meaningful.

@dongjoon-hyun
Copy link
Member

Sorry but I'm not sure that I can agree with you that the INFO level log is meaningful for that kind of analysis.

  • May I ask why Shuffle block corruption become a frequent event in your cluster?
  • Why you cannot use DEBUG and TRACE level for your analysis when you can control via log4j property files?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Feb 7, 2023

May I ask why Shuffle block corruption become a frequent event in your cluster?

This problem doesn't occur very often, I don't have exact statistics yet.
Adding these logs is to do in-depth research and find out the possible reasons for this problem.

Why you cannot use DEBUG and TRACE level for your analysis when you can control via log4j property files?

OK, I adjusted the INFO level to debug level.

@Ngone51
Copy link
Member

Ngone51 commented Feb 7, 2023

I doubt we use the DEBUG level in this case. The corruption cause here can only be either the disk issue or the network issue right now. And both of them could be temporary (problematic disk could be persistent but spark doesn't guarantee writing files on the same disk partition each time) or difficult to reproduce. So I'm afraid using the DEBUG level could miss the cause easily in the first place.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 7, 2023

I don't think INFO level is much helpful neither. Eventually, the problematic machine (or the disk or NIC) should be excluded. Is there any other way to mitigate the situation, @Ngone51 ?

@dongjoon-hyun
Copy link
Member

BTW, @Ngone51 and @cxzl25 . I'm wondering if we are in the same understanding.

  1. I already gave +1 for this previous INFO-level information approval.
  2. We had a discussion about the log level for this new suggestion.
  1. @cxzl25 converted everything from INFO into DEBUG
  2. @Ngone51 expresses a worry about DEBUG level.

Both (3) and (4) are not my intention here. I agree to have INFO level in the initial PR and proposed to put the additional info in DEBUG level.

@Ngone51
Copy link
Member

Ngone51 commented Feb 8, 2023

I agree to have INFO level in the initial PR and proposed to put the additional info in DEBUG level.

Thanks @dongjoon-hyun . +1 from me. I actually disagreed (3). It was a misunderstanding between us.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Feb 8, 2023

INFO Level

23/02/08 14:12:50.098 main INFO ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-facc29e8-0712-4dc7-bc00-f047c7cd818c/shuffle_0_0_0.checksum.ADLER32, cause DISK_ISSUE

DEBUG Level

23/02/08 14:12:04.699 main DEBUG ShuffleChecksumHelper: Shuffle corruption diagnosis took 0 ms, checksum file /private/var/folders/49/rnfrb9c53sg3z4f3fjyqfvz00000gp/T/spark-abecd08b-149d-46d3-a0fa-2d063cd2cbe7/shuffle_0_0_0.checksum.ADLER32, cause DISK_ISSUE, checksumByReader 196609, checksumByWriter 196608, checksumByReCalculation 196609

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, merging to master

@mridulm mridulm closed this in 201a91b Feb 9, 2023
@mridulm
Copy link
Contributor

mridulm commented Feb 9, 2023

Merged to master.
Thanks for fixing this @cxzl25 !
Thanks for reviews @dongjoon-hyun, @Ngone51, @LuciferYang :-)

@dongjoon-hyun
Copy link
Member

Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants