Describing safe, blocking, atomic, pure-mysql cut-over phase #82

shlomi-noach · 2016-06-26T13:58:46Z

Final-finally-finalizationally, here's an asynchronous, safe, atomic cut-over phase.
This solution doesn't cause "table outage" (as in #65).

Here are the steps for a safe, atomic cut-over:

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE
Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE
Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK
Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl
- This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue
Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in show processlist)
Connection 10: DROP TABLE tbl_old
Nothing happens yet; tbl is still locked. All other connections still blocked.
Connection 10: UNLOCK TABLES

BAM! The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes

We create tbl_old as a blocker for a premature swap
It is allowed for a connection to DROP a table it has under a WRITE LOCK
A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

What happens on failures?

Much fun. Just works; no rollback required.

If C10 errors on the CREATE we do not proceed.
If C10 errors on the LOCK statement, we do not proceed. The table is not locked. App continues to operate as normal.
If C10 dies just as C20 is about to issue the RENAME:
- The lock is released, the queries C1..C9, C11..C19 immediately operate on tbl.
- C20's RENAME immediately fails because tbl_old exists.
- The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything
If C10 dies while C20 is blocked on RENAME: Mostly similar to the above. Lock released, then C20 fails the RENAME (because tbl_old exists), then all queries resume normal operation
If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned: DROP, UNLOCK. Nothing terrible happens, some queries were blocked for some time. We will need to retry
If C20 dies just after C10 DROPs the table but before the unlock, same as above.
If both C10 and C20 die, no problem: LOCK is cleared; RENAME lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate on tbl.

No matter what happens, at the end of operation we look for the ghost table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to tbl, and the operation worked atomically.

A side note on failure is the matter of cleaning up the magic tbl_old. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.

Impact on app

App connections are guaranteed to be blocked, either until ghost is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.

Impact on replication

Replication only sees the RENAME. There is no LOCK in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.

The text was updated successfully, but these errors were encountered:

baloo · 2016-08-01T23:33:27Z

Hello @shlomi-noach ,

First thanks for opensourcing gh-ost :) It looks very good :)
The only question remaining in my mind was about this particular subject (cut-over phase), thanks for documenting it. But I believe I now have two more:

What ensures that no other connections writes to tbl table in-between those events:
- Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
- Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE
wouldn't it be safer to:
- Connection C10: LOCK TABLES tbl WRITE;
- Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
- Connection C10: LOCK TABLES tbl_old WRITE;
I believe this way the replication client knowns once the CREATE_TABLE occurs in the replication stream, no writes to tbl will occur ever. And it's safe to proceed further down (with the rename and so on).
Other question: You assume:

A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

From my understanding of the mysql documentation this is only guaranteed by the lock acquirement algorithm that mysql uses [https://dev.mysql.com/doc/refman/5.7/en/lock-tables.html]:
LOCK TABLES acquires locks as follows:
- Sort all tables to be locked in an internally defined order. From the user standpoint, this order is undefined.
- If a table is to be locked with a read and a write lock, put the write lock request before the read lock request.
- Lock one table at a time until the session gets all locks.
I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

shlomi-noach · 2016-08-02T05:22:38Z

@baloo, thank you for your review!

What ensures that no other connections writes to tbl table in-between those events:

There is nothing to ensure that, and there is no need to ensure that; the table creation is merely a step in the direction of beginning the cut-over. There is no problem that tbl still takes writes, these are still being read in the binary log and propagated to the ghost table.

Moreover, if we reverse the order as you suggest, implication is we waste more time during the locked phase.

wouldn't it be safer to:

Connection C10: LOCK TABLES tbl WRITE;

Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'

Connection C10: LOCK TABLES tbl_old WRITE;

This doesn't work like that in MySQL, see: http://dev.mysql.com/doc/refman/5.7/en/lock-tables.html:

... A session that requires locks must acquire all the locks that it needs in a single LOCK TABLES statement. ...

You cannot just grab more locks as you go along, you have to grab them all at once.

I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

The relevant code in MySQL is https://github.com/mysql/mysql-server/blob/a533e2c786164af9bd276660b972d93649434297/sql/mdl.cc#L2312. You are correct this relies on internal behavior. This behavior exists for over 15 years or so, and, having discussed (informally, Safe Harbour etc.) with engineers - has no intention of going away.

I'm glad you point it out, I'll make it explicit in the fine print.

This entire scheme came to be because of a limitation in MySQL, where you cannot rename a table you have under lock in the same session. I'm in good mind to influence next MySQL version to support the above, rendering this entire algorithm unnecessary.

baloo · 2016-08-02T15:45:45Z

Thanks for clarifications :)

shlomi-noach · 2018-10-29T09:04:13Z

Noteworthy that as of MySQL 8.0.13 there's support for RENAME TABLES under LOCK TABLES, specifically designed for gh-ost, see: https://mysqlserverteam.com/the-mysql-8-0-13-maintenance-release-is-generally-available/

gh-ost is yet to utilize this new functionality.

youzipi · 2019-12-01T18:55:08Z

why need to lock the origin table？
i only figure out one reason：to prevent continuous transaction on origin table, so as to create a time window for the swap operation?
otherwise may fall into a try-fail cycle?

as i already know, rename operation is in-place, no need to rebuild table and permit concurrent DML.

will that happen? or there are some other reasons?

mark @youzipi

shlomi-noach · 2019-12-03T07:45:32Z

rename operation is in-place, no need to rebuild table and permit concurrent DML.

True, but because gh-ost works asynchronously by applying binary log content, there may be some backlog in the binary logs. And so we lock the table to ensure we've applied any relevant changes from the binary log.

youzipi · 2019-12-26T13:20:53Z

Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

Is it necessary to lock the tbl_old?
as just lock tbl can block the C20's RENAME and if C10 fail, the existence of tbl_old can make the RENAME fail.

shlomi-noach · 2019-12-30T14:09:23Z

Is it necessary to lock the tbl_old?

It is necessary but for a different reason than you'd expect: in MySQL, if session A runs a LOCK TABLE some_table, then it is only allowed to operate on some_table. Specifically, it will not be allowed to rename another table.

youzipi · 2020-01-07T08:24:37Z

if session A runs a LOCK TABLE some_table, then it is only allowed to operate on some_table. Specifically, it will not be allowed to rename another table.

but lock and rename are in different session.
C20 did not hold a lock.
Is it comes from the old way of Facebook ,but make no sense in new solution?

shlomi-noach · 2020-01-07T08:41:06Z

but lock and rename are in different session.

You're right. I wasn't paying enough attention.
So, I'm trying to go three years back in time to recall my experiments. I'm assuming I had good reason to lock tbl_old. Possibly you are right and there is no need to lock it -- but also, there is no overhead to it. It is not being used in any way. I have little capacity right now to set up the testing scripts which checked all the locking behaviors. If you have the time, please go ahead. Again, I believe this may be interesting academically, but with little (no?) impact in reality.

Is it comes from the old way of Facebook ,but make no sense in new solution?

I don't understand the question. Facebook did not use any of this, they used two RENAME operations which left a hole/gap/puncture where for a brief period of time, the table does not exist.

My solution avoids that and ensures the table exists at all times.

ghost · 2020-04-25T12:19:41Z

hello

why can't use rename t1 to t1_del, t1_gho to t1 directly,

are there some problems?

shlomi-noach · 2020-04-25T15:10:51Z

It can’t happen because gh-ost tails the binary log, either on master or on the replica, and there may be still events to handle
In the binary logs during the cut-over.

ghost · 2020-04-26T03:45:06Z

It can’t happen because gh-ost tails the binary log, either on master or on the replica, and there may be still events to handle
In the binary logs during the cut-over.

Thanks.
i found 3 chapters which you posted several years ago.
http://code.openark.org/blog/mysql/solving-the-non-atomic-table-swap-take-ii

wukongHH · 2021-06-11T14:06:52Z

Noteworthy that as of MySQL 8.0.13 there's support for RENAME TABLES under LOCK TABLES, specifically designed for gh-ost, see: https://mysqlserverteam.com/the-mysql-8-0-13-maintenance-release-is-generally-available/

gh-ost is yet to utilize this new functionality.

Hi，has gh-ost already utilize this new functionality?

shlomi-noach · 2021-06-11T14:22:33Z

It is unfortunate that the implementation actually does not meet gh-ost's requirement, and using the new implementation sadly complicates the logic rather than simplifies it. I therefore made no progress meeting the new implementation.

wukongHH · 2021-06-11T14:28:25Z

Can I ask why the new implementation can't simplify the logic? I don't understand by myself.o(╥﹏╥)o

shlomi-noach · 2021-06-11T14:39:41Z

It was a couple years ago that I checked. It was about who has to be the owner of the lock. I honestly don’t remember now what the complexity was, and I didn’t write anything public. Sorry about that, and I’m not looking into it now.

…

On Fri, Jun 11, 2021 at 17:28 wukongHH ***@***.***> wrote: Can I ask why the new implementation can't simplify the logic? I don't understand by myself.o(╥﹏╥)o — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#82 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAT4WPWEZM3PA5MYWA2YOJ3TSIMRTANCNFSM4CH2HVNA> .

cenkore · 2021-06-16T09:22:57Z

@shlomi-noach

If C20 dies just after C10 DROPs the table but before the unlock, same as above.

This entry should be incorrect. In this case, the table name should be cut-over normally, because after the session is broken, the lock is released on C10 and C20 will get the lock and cut-over table directly. Finally gh-ost will fail with error Expected magic comment on _xxx_del, did not find it in retry stage and exit, but the cut-over has actually been completed.

shlomi-noach · 2021-06-21T09:43:10Z

I don't think I have the bandwidth to solve this again. Unless someone else has a definitive solution, consider using -cut-over=two-step

shlomi-noach mentioned this issue Jun 26, 2016

Describing safe, blocking, pure-mysql cut-over phase #65

Closed

shlomi-noach added enhancement review wanted labels Jun 26, 2016

shlomi-noach pushed a commit that referenced this issue Jun 27, 2016

an atomic cut-over implementation, as per issue #82

0191b28

shlomi-noach mentioned this issue Jun 27, 2016

an atomic cut-over implementation #84

Merged

shlomi-noach mentioned this issue Jul 8, 2016

supporting --cut-over-lock-timeout-seconds #90

Merged

shlomi-noach removed the review wanted label Jul 24, 2016

shlomi-noach mentioned this issue Jul 26, 2016

Removed legacy 'safe cut-over' #106

Merged

shlomi-noach added documentation and removed enhancement labels Aug 11, 2016

shlomi-noach closed this as completed Aug 11, 2016

shlomi-noach mentioned this issue Jul 3, 2017

about Switching Table operation in Gh-ost #436

Closed

zmoazeni mentioned this issue Aug 30, 2017

Possible problems with gh-ost, pool-of-threads, and hot tables #482

Closed

shlomi-noach mentioned this issue Dec 21, 2017

non-atomic table swap #530

Closed

syrnick mentioned this issue Oct 21, 2018

Atomic cutover is unsound on mysql 5.6.41 #653

Closed

ggunson mentioned this issue Dec 7, 2018

Cutover not succeeding #687

Open

sumonsun mentioned this issue Jul 31, 2019

lots of metalock ,mysql crash #768

Open

brightgogo mentioned this issue Oct 13, 2020

When the _xxxx_gho table is queried, data may be lost. #887

Open

shlomi-noach mentioned this issue Aug 2, 2021

WIP: NativeDDL cut-over with MySQL 8.0 locking vitessio/vitess#8573

Closed

3 tasks

shlomi-noach mentioned this issue Oct 11, 2022

vitess Online DDL atomic cut-over vitessio/vitess#11460

Merged

3 tasks

This was referenced Oct 21, 2022

add rocksdb as transactional engine #1189

Closed

Cut-off transaction is very slow, leading to downtimes in production[Question] #1191

Open

dm-2 mentioned this issue Nov 25, 2022

About adding unique index lost data - There are delete operation scenarios #1205

Open

qsweber mentioned this issue Sep 6, 2024

Atomic cut-over failing due to RENAME running on an unexpected connection id #1449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

shlomi-noach commented Jun 26, 2016 •

edited

Loading

baloo commented Aug 1, 2016 •

edited

Loading

shlomi-noach commented Aug 2, 2016 •

edited

Loading

baloo commented Aug 2, 2016

shlomi-noach commented Oct 29, 2018

youzipi commented Dec 1, 2019 •

edited

Loading

shlomi-noach commented Dec 3, 2019

youzipi commented Dec 26, 2019 •

edited

Loading

shlomi-noach commented Dec 30, 2019

youzipi commented Jan 7, 2020

shlomi-noach commented Jan 7, 2020 •

edited

Loading

ghost commented Apr 25, 2020

shlomi-noach commented Apr 25, 2020 •

edited

Loading

ghost commented Apr 26, 2020

wukongHH commented Jun 11, 2021

shlomi-noach commented Jun 11, 2021 •

edited

Loading

wukongHH commented Jun 11, 2021

shlomi-noach commented Jun 11, 2021 via email

cenkore commented Jun 16, 2021

shlomi-noach commented Jun 21, 2021

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

Comments

shlomi-noach commented Jun 26, 2016 • edited Loading

Here are the steps for a safe, atomic cut-over:

Some notes

What happens on failures?

Impact on app

Impact on replication

baloo commented Aug 1, 2016 • edited Loading

shlomi-noach commented Aug 2, 2016 • edited Loading

baloo commented Aug 2, 2016

shlomi-noach commented Oct 29, 2018

youzipi commented Dec 1, 2019 • edited Loading

shlomi-noach commented Dec 3, 2019

youzipi commented Dec 26, 2019 • edited Loading

shlomi-noach commented Dec 30, 2019

youzipi commented Jan 7, 2020

shlomi-noach commented Jan 7, 2020 • edited Loading

ghost commented Apr 25, 2020

shlomi-noach commented Apr 25, 2020 • edited Loading

ghost commented Apr 26, 2020

wukongHH commented Jun 11, 2021

shlomi-noach commented Jun 11, 2021 • edited Loading

wukongHH commented Jun 11, 2021

shlomi-noach commented Jun 11, 2021 via email

cenkore commented Jun 16, 2021

shlomi-noach commented Jun 21, 2021

shlomi-noach commented Jun 26, 2016 •

edited

Loading

baloo commented Aug 1, 2016 •

edited

Loading

shlomi-noach commented Aug 2, 2016 •

edited

Loading

youzipi commented Dec 1, 2019 •

edited

Loading

youzipi commented Dec 26, 2019 •

edited

Loading

shlomi-noach commented Jan 7, 2020 •

edited

Loading

shlomi-noach commented Apr 25, 2020 •

edited

Loading

shlomi-noach commented Jun 11, 2021 •

edited

Loading