Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: remove WithTxn #60554

Merged
merged 1 commit into from
Feb 17, 2021
Merged

jobs: remove WithTxn #60554

merged 1 commit into from
Feb 17, 2021

Conversation

thoszhang
Copy link
Contributor

@thoszhang thoszhang commented Feb 13, 2021

This commit removes the Job's WithTxn method and its txn field,
which were used to set the next transaction used by a job, and have been
a source of many bugs due to setting an incorrect or nil transaction.
Now there is an explicit *kv.Txn argument on all the Job methods
that read from the store, where a nil argument still indicates that a
new txn should be created.

Closes #57534.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@thoszhang thoszhang force-pushed the remove-withtxn branch 6 times, most recently from 98b2ab9 to ea4f9a0 Compare February 16, 2021 23:19
@thoszhang thoszhang marked this pull request as ready for review February 16, 2021 23:20
@thoszhang thoszhang requested review from a team, pbardea and ajwerner and removed request for a team February 16, 2021 23:20
@thoszhang
Copy link
Contributor Author

I attempted to preserve the existing behavior in this PR, even in places where we seem to have intended to update the job inside a transaction closure using the transaction but neglected to use WithTxn. When I tried to "fix" these, I got some semi-consistent CI test timeouts (unclear from which change exactly), and decided that those changes were best made separately. All these places are commented. I don't know if it's worth actually opening a separate issue, but I might take a look if I have time.

@pbardea you were the Bulk IO person automatically assigned, and I don't think you have to review this, but maybe you happen to know something about the job updates in the backfiller and GC job.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A truly wonderful change!

Reviewed 27 of 27 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang and @pbardea)


pkg/jobs/jobsprotectedts/jobs_protected_ts.go, line 48 at r1 (raw file):

			return false, err
		}
		// TODO: This job update should possibly use the txn.

This should definitely use the txn. LoadJobWithTxn used to set the job.txn field to that txn (it did a WithTxn using the txn).


pkg/sql/backfill.go, line 1273 at r1 (raw file):

				fractionCompleted := origFractionCompleted + fractionLeft*fractionRangesFinished
				// TODO: This job update should possibly use the txn.
				if err := sc.job.FractionProgressed(ctx, nil /* txn */, jobs.FractionUpdater(fractionCompleted)); err != nil {

yes please, this one is just bad.


pkg/sql/create_stats.go, line 535 at r1 (raw file):

			// it can be cleaned up at a higher level.
			// TODO: This job update should possibly use the txn.
			if jobErr := r.job.FractionProgressed(

yes please.


pkg/sql/gcjob/gc_job_utils.go, line 120 at r1 (raw file):

				return err
			}
			// TODO: This job update should possibly use the txn.

the load meant it used to be


pkg/sql/gcjob/gc_job_utils.go, line 264 at r1 (raw file):

			return err
		}
		// TODO: This job update should possibly use the txn.

yes please

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 As Andrew pointed out, it looks like the GC job was already using the txn with the loading phase, and the backfiller just missed it - much clearer where we're missing these after this change. Thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang)

Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the two of you feel about merging this PR which keeps the existing behavior, even where it's incorrect, and making changes in subsequent PRs (possibly during the stability period)? I can open an issue to track this.

I couldn't figure out the test timeouts I was seeing when I did make the changes; nothing seemed obviously hung from the goroutine dump and I wasn't getting the timeouts consistently, so it's possible that combining some things into a single transaction caused more contention and restarts, or something, though that seems kind of implausible. So I think this all deserves more investigation. I can isolate the changes one by one and run the tests on my gceworker (or just in CI), but that will take some time.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @lucy-zhang)


pkg/jobs/jobsprotectedts/jobs_protected_ts.go, line 48 at r1 (raw file):

LoadJobWithTxn used to set the job.txn field to that txn (it did a WithTxn using the txn).

By "used to," do you mean before this PR? I thought the txn specified in LoadJobWithTxn only applies to the method itself, because it internally uses runInTxn, which has a defer func() { j.txn = nil }(). If that's not the case then I will happily make this change.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for uncovering the badness that we didn't even realize we were dealing with. Merge away

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lucy-zhang)


pkg/jobs/jobsprotectedts/jobs_protected_ts.go, line 48 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

LoadJobWithTxn used to set the job.txn field to that txn (it did a WithTxn using the txn).

By "used to," do you mean before this PR? I thought the txn specified in LoadJobWithTxn only applies to the method itself, because it internally uses runInTxn, which has a defer func() { j.txn = nil }(). If that's not the case then I will happily make this change.

😬

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @lucy-zhang)

This commit removes the `Job`'s `WithTxn` method and its `txn` field,
which were used to set the next transaction used by a job, and have been
a source of many bugs due to setting an incorrect or nil transaction.
Now there is an explicit `*kv.Txn` argument on all the `Job` methods
that read from the store, where a nil argument still indicates that a
new txn should be created.

Release note: None
@thoszhang
Copy link
Contributor Author

TFTR. I opened an issue (#60690) and the TODOs refer to it now.

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 17, 2021

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Feb 17, 2021

Build succeeded:

@craig craig bot merged commit be115b9 into cockroachdb:master Feb 17, 2021
@thoszhang thoszhang deleted the remove-withtxn branch February 17, 2021 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jobs: remove WithTxn
4 participants