Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describing cut-over phase via UDF wait condition #50

Closed
shlomi-noach opened this issue Jun 2, 2016 · 9 comments
Closed

Describing cut-over phase via UDF wait condition #50

shlomi-noach opened this issue Jun 2, 2016 · 9 comments

Comments

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Jun 2, 2016

I wrote these UTF functions (will cross link once in proper repo):

  • create_ghost_wait_condition()
  • destroy_ghost_wait_condition()
  • ghost_wait_on_condition()

For now, they use a singular, global wait condition (maybe in the future we will support multiple).
The wait condition is a lock which is not bound by a connection. So if I:

select create_ghost_wait_condition()

Then the lock is taken, and is kept taken even if my connection dies. Anyone can, at any time:

select destroy_ghost_wait_condition()

And release the lock.

The function ghost_wait_on_condition() returns immediately if condition is free, or blocks if condition is taken. Multiple ghost_wait_on_condition() can run concurrently and they will all wait. Once destroy_ghost_wait_condition() is called they all get released.

Cut-over via wait condition

We have these tables:

  • tbl - original table
  • _tbl_gst - ghost table

Sequence of events is:

  1. create view _tbl_gst_v as select * from _tbl_gst where ghost_wait_on_condition() is not null with check option

    Breakdown:

    • ghost_wait_on_condition() is always not null
    • with check condition means every insert, delete, update on existing rows will validate that the view definition is met, i.e. the where clause is satisfied, i.e. we wait on lock.
    • the view uses merge algorithm; all queries are implicitly pushed down to the table (no temporary table buffer)
  2. select create_ghost_wait_condition()

    • Nothing happens yet. We got our lock but no one is attempting to use it.
  3. rename table tbl to _tbl_old, _tbl_gst_v to tbl

    • we push the view instead of the original table. The view reads/writes to _tbl_gst
    • but it is blocked. insert|delete|update queries operating on _tbl_gst_v are blocked
      • note: except for update or delete that operate on non-existent rows (hence make no change, hence not visible in RBR anyhow, hence irrelevant)
  4. Working on backlog, applying all those last changes read from the binary log onto _tbl_gst

    • noteworthy: _tbl_gst itself is not blocked in any way.
  5. select destroy_ghost_wait_condition()

    • writes are now enabled on tbl (our view). It still reads/writes to _tbl_gst.
    • all blocked queries are suddenly released and operating on the view
    • new incoming queries operate on the view
  6. rename table tbl to _tbl_gst_v_old, _tbl_gst to tbl

    • we now move our view aside, putting our ghost table in its place.
  7. drop view _tbl_gst_v_old

  8. 🍕

Will you please run this through your virtual interpreter in your brains? Let's assume the UDFs work perfectly well (and they're simple enough to support that assumption).

@shlomi-noach
Copy link
Contributor Author

Known catch: what if gh-ost terminates prematurely?

  • If between steps 3 & 5, then we get a blocked view (I already coded a timeout for the wait condition, but that's obviously not an answer). The condition is not expiring.
  • If between 5,6, then we end up with read/writing to a view. Works but with condition/mutex overhead, and obviously not for the long run.

@tomkrouper
Copy link

I've read through this once and it hurts my brain. I know this isn't helpful, but I don't want you coming in in the morning thinking no one has commented. I think I'm gonna need to read through it a few more times to get my head around it.

My only concern so far is related to having multiple migrations going at once and having them end near the same time and having the UDF unlock one before it's time. But as I said, my brain, it hurts. So I might be missing the reason why it is not an issue. I'll give it another look in the morning.

@shlomi-noach
Copy link
Contributor Author

I'm so glad to share the pain. If ever I get a burn-out, you can tell the psychiatrist the gh-ost cut-over phase is a major contributor.

@shlomi-noach
Copy link
Contributor Author

My only concern so far is related to having multiple migrations going at once and having them end near the same time and having the UDF unlock one before it's time.

This is actually OK. Reason: gh-ost will first issue a create_ghost_wait_condition() so as to acquire the lock. This function returns an integer value indicating success or failure (the actual value to be determined yet. Say < 0 for failure). So if the lock is already being used by another migration, this migration would not be able to acquire it. It will retry a few times, which should work because the lock should only be taken for a couple seconds. When it succeeds to acquire the lock, it has the guarantee of being the only gh-ost to be doing the cut-over phase.

@shlomi-noach
Copy link
Contributor Author

This is where the UDF code resides, for now: https://github.com/openark/udf-ghost-wait-condition
Code is subject to change.

@tomkrouper
Copy link

@shlomi-noach in chat you discussed fancy charts and pretty graphs* to help with this. I'm 💯 behind this idea. I think it would be very helpful.

  • = I may be over stating what you offered.

@shlomi-noach
Copy link
Contributor Author

ghost-cutover-udf-wait 002
ghost-cutover-udf-wait 003
ghost-cutover-udf-wait 004
ghost-cutover-udf-wait 005
ghost-cutover-udf-wait 006
ghost-cutover-udf-wait 007
ghost-cutover-udf-wait 008
ghost-cutover-udf-wait 009

@shlomi-noach
Copy link
Contributor Author

A few more words on how this idea came to be. As per #26 , there are two connections, the premature death of any would cause a premature cut-over:

  1. A connection holding a voluntary lock
  2. a connection issuing a query using the voluntary lock

The solution depicted here answers these two problems:

  1. Instead of a voluntary lock, which expires upon connection death, we use a global lock, which is connection-independent. It gets released when we explicitly tell it to be released.
  2. Instead of a blocking query, which can be killed along with its connection, we make the view, which is a way to persist a query. Now, anyone who wants to access the data will directly compete for the lock.

@shlomi-noach
Copy link
Contributor Author

Thank you for spending the time reviewing this. I'm closing this issue as #65 came up, which I believe to solve the cut-over without UDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants