Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Jun 13, 2024

What changes were proposed in this pull request?

Log TID for "input split" in HadoopRDD and NewHadoopRDD

Why are the changes needed?

This change should benefit both structured logging enabled/disabled cases.

When structured logging is disabled, and executor cores > 1, the logs of tasks are mixed in stdout, something like

24/06/12 21:40:10 INFO Executor: Running task 26.0 in stage 2.0 (TID 10)
24/06/12 21:40:10 INFO Executor: Running task 27.0 in stage 2.0 (TID 11)
24/06/12 21:40:11 INFO HadoopRDD: Input split: hdfs://.../part-00025-53bc40ae-399f-4291-b5ac-617c980deb86-c000:0+124138257
24/06/12 21:40:11 INFO HadoopRDD: Input split: hdfs://.../part-00045-53bc40ae-399f-4291-b5ac-617c980deb86-c000:0+121726684

it's hard to say which file is read by which task because they run in parallel.

If something goes wrong, the log prints TID and exception stack trace, the error may related to the input data, sometimes that exception message is clear enough to show which file that input data comes from, but sometimes not, in the latter case, the current log is not clear enough to allow us to identify the bad file quickly.

24/06/12 21:40:18 ERROR Executor: Exception in task 27.0 in stage 2.0 (TID 11)
(... exception message)
(... stacktraces)

When structured logging is enabled, exposing TID as a LogKey makes the logs more selective.

Does this PR introduce any user-facing change?

Yes, it supplies additional information in logs.

How was this patch tested?

Review, as it only touches log contents.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Jun 13, 2024
@pan3793
Copy link
Member Author

pan3793 commented Jun 13, 2024

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM


private val split = theSplit.asInstanceOf[HadoopPartition]
logInfo(log"Input split: ${MDC(INPUT_SPLIT, split.inputSplit)}")
logInfo(log"Task (TID ${MDC(TASK_ID, context.taskAttemptId())}) input split: " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use TASK_ATTEMPT_ID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.taskAttemptId() is task id, this was clarified in #45834 (comment)

@gengliangwang
Copy link
Member

Thanks, merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants