Skip to content

Conversation

@jackylee-ch
Copy link
Contributor

What changes were proposed in this pull request?

After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 30, 2024
@jackylee-ch
Copy link
Contributor Author

cc @LuciferYang @yikf

@LuciferYang
Copy link
Contributor

LuciferYang commented May 31, 2024

also cc @yaooqinn @HyukjinKwon @cloud-fan @pan3793

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Can we have a test case?

@yaooqinn
Copy link
Member

Please remove the [MINOR] tag and file a Jira ticket for this


override def createWriter(partitionId: Int, realTaskId: Long): DataWriter[InternalRow] = {
val taskAttemptContext = createTaskAttemptContext(partitionId)
val taskAttemptContext = createTaskAttemptContext(partitionId, realTaskId.toInt & Int.MaxValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same as math.abs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it same as Math.abs(realTaskId.toInt)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val sparkAttemptNumber = TaskContext.get().taskAttemptId().toInt & Int.MaxValue

sparkAttemptNumber = taskContext.taskAttemptId().toInt & Integer.MAX_VALUE,

There are at least two other similar cases here, should we unify them as math.abs? Of course, this should be another PR. @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think perf matters here, and math.abs is definitely more readable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we can unify the above three cases to math.abs in a follow-up.

Copy link
Contributor

@LuciferYang LuciferYang Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan If overflow, realTaskId.toInt & Int.MaxValue and math.abs are not equal:

scala> val realTaskId = Long.MaxValue
val realTaskId: Long = 9223372036854775807

scala> val a = realTaskId.toInt
val a: Int = -1

scala> val b = realTaskId.toInt & Int.MaxValue
val b: Int = 2147483647

scala> val c= math.abs(realTaskId.toInt)
val c: Int = 1
scala> val realTaskId = Int.MaxValue.toLong + 1
val realTaskId: Long = 2147483648

scala> val a = realTaskId.toInt
val a: Int = -2147483648

scala> val b = realTaskId.toInt & Int.MaxValue
val b: Int = 0

scala> val c= math.abs(realTaskId.toInt)
val c: Int = -2147483648

Meanwhile, when an overflow occurs, math.abs may still return a negative value, so I suggest we continue using & Int.MaxValue

@jackylee-ch
Copy link
Contributor Author

Please remove the [MINOR] tag and file a Jira ticket for this

Sure, I will add a follow up for SPARK-42478 and a suite test for this pr.

@LuciferYang
Copy link
Contributor

No,SPARK-42478 is not part of the Spark 4.0 cycle, please use a new jira ticket. @jackylee-ch

@LuciferYang LuciferYang changed the title [SPARK][SQL][MINOR] Fix: V2Write use the same TaskAttemptId for different task attempts [SPARK-48484][SQL] Fix: V2Write use the same TaskAttemptId for different task attempts May 31, 2024
@LuciferYang
Copy link
Contributor

create SPARK-48484 for this one @jackylee-ch

private[this] val jobTrackerID = SparkHadoopWriterUtils.createJobTrackerID(new Date)
@transient private lazy val jobId = SparkHadoopWriterUtils.createJobID(jobTrackerID, 0)

override def createWriter(partitionId: Int, realTaskId: Long): DataWriter[InternalRow] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this pr, why not naming it as taskAttemptId ? does realTaskId can be something else ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use PrivateMethodTester in FileWriterFactorySuite to avoid expanding the scope of this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@yikf yikf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jackylee-ch , lgtm

@jackylee-ch jackylee-ch force-pushed the fix_v2write_use_same_directories_for_different_task_attempts branch from 1988368 to 74782eb Compare May 31, 2024 07:49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we just check createTaskAttemptContext, do we really need to inherit from SharedSparkSession? Can we just inherit from SparkFunSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a Configuration here as it will be used in createTaskAttemptContext. It's ok to me that we just create a new Configuration.

@jackylee-ch jackylee-ch force-pushed the fix_v2write_use_same_directories_for_different_task_attempts branch from 74782eb to 3bb15e5 Compare May 31, 2024 08:32
LuciferYang pushed a commit that referenced this pull request May 31, 2024
…ent task attempts

### What changes were proposed in this pull request?
After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.

Lead-authored-by: jackylee-ch <lijunqing@baidu.com>
Co-authored-by: Kent Yao <yao@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 67d11b1)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
LuciferYang pushed a commit that referenced this pull request May 31, 2024
…ent task attempts

### What changes were proposed in this pull request?
After #40064 , we always get the same TaskAttemptId for different task attempts which has the same partitionId. This would lead different task attempts write to the same directory.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46811 from jackylee-ch/fix_v2write_use_same_directories_for_different_task_attempts.

Lead-authored-by: jackylee-ch <lijunqing@baidu.com>
Co-authored-by: Kent Yao <yao@apache.org>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 67d11b1)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
@LuciferYang
Copy link
Contributor

Merged into master/3.5/3.4. thanks @jackylee-ch @yaooqinn @cloud-fan @ulysses-you @yikf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants