Describing cut-over phase via UDF wait condition #50

shlomi-noach · 2016-06-02T13:00:24Z

I wrote these UTF functions (will cross link once in proper repo):

create_ghost_wait_condition()
destroy_ghost_wait_condition()
ghost_wait_on_condition()

For now, they use a singular, global wait condition (maybe in the future we will support multiple).
The wait condition is a lock which is not bound by a connection. So if I:

select create_ghost_wait_condition()

Then the lock is taken, and is kept taken even if my connection dies. Anyone can, at any time:

select destroy_ghost_wait_condition()

And release the lock.

The function ghost_wait_on_condition() returns immediately if condition is free, or blocks if condition is taken. Multiple ghost_wait_on_condition() can run concurrently and they will all wait. Once destroy_ghost_wait_condition() is called they all get released.

Cut-over via wait condition

We have these tables:

tbl - original table
_tbl_gst - ghost table

Sequence of events is:

create view _tbl_gst_v as select * from _tbl_gst where ghost_wait_on_condition() is not null with check option

Breakdown:
- ghost_wait_on_condition() is always not null
- with check condition means every insert, delete, update on existing rows will validate that the view definition is met, i.e. the where clause is satisfied, i.e. we wait on lock.
- the view uses merge algorithm; all queries are implicitly pushed down to the table (no temporary table buffer)
select create_ghost_wait_condition()
- Nothing happens yet. We got our lock but no one is attempting to use it.
rename table tbl to _tbl_old, _tbl_gst_v to tbl
- we push the view instead of the original table. The view reads/writes to _tbl_gst
- but it is blocked. insert|delete|update queries operating on _tbl_gst_v are blocked
  - note: except for update or delete that operate on non-existent rows (hence make no change, hence not visible in RBR anyhow, hence irrelevant)
Working on backlog, applying all those last changes read from the binary log onto _tbl_gst
- noteworthy: _tbl_gst itself is not blocked in any way.
select destroy_ghost_wait_condition()
- writes are now enabled on tbl (our view). It still reads/writes to _tbl_gst.
- all blocked queries are suddenly released and operating on the view
- new incoming queries operate on the view
rename table tbl to _tbl_gst_v_old, _tbl_gst to tbl
- we now move our view aside, putting our ghost table in its place.
drop view _tbl_gst_v_old
🍕

Will you please run this through your virtual interpreter in your brains? Let's assume the UDFs work perfectly well (and they're simple enough to support that assumption).

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2016-06-02T14:27:26Z

Known catch: what if gh-ost terminates prematurely?

If between steps 3 & 5, then we get a blocked view (I already coded a timeout for the wait condition, but that's obviously not an answer). The condition is not expiring.
If between 5,6, then we end up with read/writing to a view. Works but with condition/mutex overhead, and obviously not for the long run.

tomkrouper · 2016-06-02T23:20:57Z

I've read through this once and it hurts my brain. I know this isn't helpful, but I don't want you coming in in the morning thinking no one has commented. I think I'm gonna need to read through it a few more times to get my head around it.

My only concern so far is related to having multiple migrations going at once and having them end near the same time and having the UDF unlock one before it's time. But as I said, my brain, it hurts. So I might be missing the reason why it is not an issue. I'll give it another look in the morning.

shlomi-noach · 2016-06-03T07:12:38Z

I'm so glad to share the pain. If ever I get a burn-out, you can tell the psychiatrist the gh-ost cut-over phase is a major contributor.

shlomi-noach · 2016-06-03T07:15:24Z

My only concern so far is related to having multiple migrations going at once and having them end near the same time and having the UDF unlock one before it's time.

This is actually OK. Reason: gh-ost will first issue a create_ghost_wait_condition() so as to acquire the lock. This function returns an integer value indicating success or failure (the actual value to be determined yet. Say < 0 for failure). So if the lock is already being used by another migration, this migration would not be able to acquire it. It will retry a few times, which should work because the lock should only be taken for a couple seconds. When it succeeds to acquire the lock, it has the guarantee of being the only gh-ost to be doing the cut-over phase.

shlomi-noach · 2016-06-03T13:51:43Z

This is where the UDF code resides, for now: https://github.com/openark/udf-ghost-wait-condition
Code is subject to change.

tomkrouper · 2016-06-03T16:01:31Z

@shlomi-noach in chat you discussed fancy charts and pretty graphs* to help with this. I'm 💯 behind this idea. I think it would be very helpful.

= I may be over stating what you offered.

shlomi-noach · 2016-06-06T09:17:06Z

shlomi-noach · 2016-06-06T10:57:28Z

A few more words on how this idea came to be. As per #26 , there are two connections, the premature death of any would cause a premature cut-over:

A connection holding a voluntary lock
a connection issuing a query using the voluntary lock

The solution depicted here answers these two problems:

Instead of a voluntary lock, which expires upon connection death, we use a global lock, which is connection-independent. It gets released when we explicitly tell it to be released.
Instead of a blocking query, which can be killed along with its connection, we make the view, which is a way to persist a query. Now, anyone who wants to access the data will directly compete for the lock.

shlomi-noach · 2016-06-14T09:43:33Z

Thank you for spending the time reviewing this. I'm closing this issue as #65 came up, which I believe to solve the cut-over without UDF.

shlomi-noach closed this as completed Jun 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Describing cut-over phase via UDF wait condition #50

Describing cut-over phase via UDF wait condition #50

shlomi-noach commented Jun 2, 2016 •

edited

Loading

shlomi-noach commented Jun 2, 2016

tomkrouper commented Jun 2, 2016

shlomi-noach commented Jun 3, 2016

shlomi-noach commented Jun 3, 2016

shlomi-noach commented Jun 3, 2016

tomkrouper commented Jun 3, 2016

shlomi-noach commented Jun 6, 2016

shlomi-noach commented Jun 6, 2016

shlomi-noach commented Jun 14, 2016

Describing cut-over phase via UDF wait condition #50

Describing cut-over phase via UDF wait condition #50

Comments

shlomi-noach commented Jun 2, 2016 • edited Loading

Cut-over via wait condition

shlomi-noach commented Jun 2, 2016

tomkrouper commented Jun 2, 2016

shlomi-noach commented Jun 3, 2016

shlomi-noach commented Jun 3, 2016

shlomi-noach commented Jun 3, 2016

tomkrouper commented Jun 3, 2016

shlomi-noach commented Jun 6, 2016

shlomi-noach commented Jun 6, 2016

shlomi-noach commented Jun 14, 2016

shlomi-noach commented Jun 2, 2016 •

edited

Loading