Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload/tpcc: use multiple column families to avoid contention #30624

Merged
merged 1 commit into from
Oct 23, 2020

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Sep 25, 2018

This commit adds multiple column families to the warehouse,
district, and customer tables. These column families split
static columns up from columns that are modified by transactions.

This effectively removes all contention points between the NewOrder
and Payment transactions. The transactions will still contend with
other instances of their same txn type, but that's less of an issue
because they should end up being serialized early on, meaning than
intra-txn contention should rarely lead to transaction retries (for
instance, NewOrder serializes on district.d_next_o_id immediately).
These transactions are run about 90% of the time, so this should be a
performance win, especially without wait times.

Release justification: testing only

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

Also FWIW, my investigation in #30213 revealed that transaction restarts were directly correlated with 99th percentile latency in TPC-C. The graphs were almost identical. Tracing confirmed this, as I could see that nearly all long-running txns were restarted at least once. The situation I saw most often in the traces was newOrder txns being restated due to a contending payment txn.

I mentioned above that:

The transactions will still contend with
other instances of their same txn type, but that's less of an issue
because they should end up being serialized early on, meaning than
intra-txn contention should never lead to transaction retries

I'll clarify that intra-txn contention will still lead to transaction retries due to the read-modify-write operation that they often perform when incrementing counters (see #22778 and #29431), but these are performed as the first statement of the newOrder and payment txns. This means that txn restarts here should always be retriable on the server, so this PR eliminates client-side retries for all txns except the delivery txn (which is rare). That should make this a major performance win.

@nvanbenschoten
Copy link
Member Author

This is kind of a crazy idea, but this change would actually allow us to remove the client-side retry loops for all txns except the delivery txn. As discussed in #28105, this retry loop imposes 2 extra client-to-gateway round trips on each transaction. Since we're CPU bound on the upper ends of TPC-C, removing it might actually provide a moderate performance improvement because it would reduce network traffic each gateway handles and the amount of SQL parsing that it needs to perform.

@nvanbenschoten
Copy link
Member Author

I manually verified that without foreign keys, this change, in conjunction with the fixes for #18168 and #30618, break the dependency between newOrder txns and payment txns. The two transactions can be run on the same district and customer without blocking on each other.

In my limited performance testing on my laptop with --wait=false, I found that the change resulted in somewhere between a 4-5% throughput improvement. Note that this was comparing between #30618 being fixed and it not being fixed. The dataset had multiple column families in both versions of the comparison, so it's possible that the overhead of the column families themselves negates some or all of the benefits of this change. We'll have to measure this. On the other hand, we expect txn retries and txn contention to be more expensive in a distributed cluster, so it's also possible that this is an underestimate of the speedup this will provide.

However, the dependency between the txns was not broken with fks enabled. This is because of #30852. newOrder updates a secondary column familiy in the district table and payment inserts into history, which has the fk foreign key (h_w_id, h_d_id) references district (d_w_id, d_id). The corresponding fk check should not need to query all column families in district, but at the moment, it does.

@jordanlewis
Copy link
Member

@rohany @solongordon see the last point in @nvanbenschoten's last paragraph. Once the optimizer plans FKs, solving #38280 for lookup joins should break the last dependency here.

@jordanlewis
Copy link
Member

Actually, #30618 hasn't been solved yet either (making sure UPDATE RETURNING doesn't fetch too many column families). I think that's on the optimizer team, though. @RaduBerinde is this issue on the horizon?

@RaduBerinde
Copy link
Member

Not really, but I can prioritize it.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 1, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 3, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 9, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 9, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 11, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 14, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

This change passes down the minimal required return cols
to SQL so it knows to only return columns that's requested.
This fixes the output column calculation done by opt and
makes sure the execPlan's output columns are the same as
output cols of the opt plan.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 16, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

This change passes down the minimal required return cols
to SQL so it knows to only return columns that's requested.
This fixes the output column calculation done by opt and
makes sure the execPlan's output columns are the same as
output cols of the opt plan.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
ridwanmsharif pushed a commit to ridwanmsharif/cockroach that referenced this pull request Jul 16, 2019
Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

This change passes down the minimal required return cols
to SQL so it knows to only return columns that's requested.
This fixes the output column calculation done by opt and
makes sure the execPlan's output columns are the same as
output cols of the opt plan.

Fixes cockroachdb#30618.
Unblocks cockroachdb#30624.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 16, 2019
38594: opt: fetch minimal set of columns on returning mutations r=ridwanmsharif a=ridwanmsharif

Previously, we used to fetch all columns when a mutation contained
a `RETURNING` clause. This is an issue because it forces us to
retrieve unnecessary data and creates extra contention.
This change adds logic to compute the minimal set of required columns
and fetches only those.

Fixes #30618.
Unblocks #30624.

Release note: None

Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/tpccContend branch 2 times, most recently from 804dd41 to d753ea0 Compare October 17, 2019 17:05
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner October 17, 2019 17:05
This commit adds multiple column families to the `warehouse`,
`district`, and `customer` tables. These column families split
static columns up from columns that are modified by transactions.

This effectively removes all contention points between the `NewOrder`
and `Payment` transactions. The transactions will still contend with
other instances of their same txn type, but that's less of an issue
because they should end up being serialized early on, meaning than
intra-txn contention should rarely lead to transaction retries (for
instance, NewOrder serializes on district.d_next_o_id immediately).
These transactions are run about 90% of the time, so this should be a
performance win, especially without wait times.

Release justification: testing only
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/tpccContend branch from d753ea0 to 6b11918 Compare September 4, 2020 18:58
@nvanbenschoten nvanbenschoten changed the title [DNM] tpcc: use multiple column families to avoid contention tpcc: use multiple column families to avoid contention Sep 4, 2020
@nvanbenschoten
Copy link
Member Author

I'm reviving this PR, with slight modifications to the column families so that this all works properly with the handling column families in the optimizer, in foreign key checks, and with implicit SFU locking. With those changes in place, this change achieves its goal of removing all contention points between the NewOrder and Payment transactions.

This showed up very prominently in some of the tests I've been running recently. Specifically, I found that a great way to measure the impact of contention on a workload is to run the workload on our very own high-performance, in-memory database – cockroach demo. Doing so speeds everything else up (after disabling wait times in TPC-C), so that contention becomes significantly more pronounced. With that change in effect, I found that this change almost triples the maximum throughput we can sustain on a single warehouse, increasing from a little less than 2200 tpmC to a little over 5000.

Or course, standard TPC-C with wait times and proper durability isn't nearly as contended, but avoiding dependencies between these transactions still helps. We see this in a complimentary PR I'm about to open.

@jordanlewis do you mind giving this another pass?

@nvanbenschoten nvanbenschoten changed the title tpcc: use multiple column families to avoid contention workload/tpcc: use multiple column families to avoid contention Sep 4, 2020
@nvanbenschoten nvanbenschoten removed the X-noremind Bots won't notify about PRs with X-noremind label Sep 4, 2020
@nvanbenschoten
Copy link
Member Author

@jordanlewis friendly ping on this, now that we're close to the end of the stability period.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for missing this last time.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, though I didn't review the column family choices carefully myself . Are you going to regenerate the fixtures and so on? That's probably the most annoying thing about this change, but important, right?

Reviewed 6 of 6 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@nvanbenschoten
Copy link
Member Author

Thanks for taking a look!

Are you going to regenerate the fixtures and so on?

Nope! We recently ripped out all uses of TPC-C fixtures in perf-related tests in favor of IMPORT, specifically so that they no longer need to be regenerated whenever we make minor tweaks like this. The need to regenerate fixtures was a major impediment to tuning the workload, and that didn't seem well justified.

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 23, 2020

Build succeeded:

@craig craig bot merged commit 5681f5d into cockroachdb:master Oct 23, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Oct 26, 2020
This reverts commit 6b11918.

Fixes cockroachdb#55954.
Fixes cockroachdb#55953.
Fixes cockroachdb#55952.
Fixes cockroachdb#55951.
Fixes cockroachdb#55949.

In cockroachdb#30624, we split various column families up in the TPC-C workload. This
had the unintended consequence of breaking CDC roachtests, because CDC
does not support multiple column families per row:
```
CHANGEFEEDs are currently supported on tables with exactly 1 column family: warehouse has 2
```

This PR reverts that commit. I intend to come back to this and introduce
a `--families` flag to TPC-C (like we have to YCSB) that can be used to
selectively toggle column families on or off. However, I'll do this
after cockroachdb#51897 lands, because it's going to be more effort than it's worth
to conditionally set this flag based on whether the version of workload
used in the test supports it or not.
craig bot pushed a commit that referenced this pull request Oct 26, 2020
55983: Revert "tpcc: use multiple column families to avoid contention" r=nvanbenschoten a=nvanbenschoten

This reverts commit 6b11918.

Fixes #55954.
Fixes #55953.
Fixes #55952.
Fixes #55951.
Fixes #55949.

In #30624, we split various column families up in the TPC-C workload. This
had the unintended consequence of breaking CDC roachtests, because CDC
does not support multiple column families per row:
```
CHANGEFEEDs are currently supported on tables with exactly 1 column family: warehouse has 2
```

This PR reverts that commit. I intend to come back to this and introduce
a `--families` flag to TPC-C (like we have to YCSB) that can be used to
selectively toggle column families on or off. However, I'll do this
after #51897 lands, because it's going to be more effort than it's worth
to conditionally set this flag based on whether the version of workload
used in the test supports it or not.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/tpccContend branch November 12, 2020 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants