Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Transaction can't guarantee atomicity #487

Closed
ShooterIT opened this issue Feb 16, 2022 · 16 comments · Fixed by #1287
Closed

[BUG] Transaction can't guarantee atomicity #487

ShooterIT opened this issue Feb 16, 2022 · 16 comments · Fixed by #1287
Assignees
Labels
bug type bug

Comments

@ShooterIT
Copy link
Member

Describe the bug
The exclusive guard lock ensures you run the transaction in isolation, but it cannot guarantee the atomicity of the EXEC stage.
You would like to either have all commands applied to the db or none of them applied.
However, if some commands are executed and persisted to rocksdb and the process fails before the tx completes the db sees an intermediate state which is not desired.

Expected behavior
Guarantee the atomicity and consistency.

@Eshcar
Copy link

Eshcar commented Feb 16, 2022

Thanks for opening this issue @ShooterIT
As of late 2021 RocksDB supports transactions , such that no other threads can see the changes in a transaction until it has been committed.
I didnt check the code yet but I assume they already take care of consistency also in case of a failure during the commit phase.

I suggest to use the API they are providing.
This would also allow you to remove the global lock you are using to protect the EXEC stage.

@ShooterIT
Copy link
Member Author

Thanks @Eshcar Yes, i know RocksDB supports, but i worry transaction db would break kvrocks db usage, I will rethink it.
BTW, initially we just want to implement transaction like Redis

@Eshcar
Copy link

Eshcar commented Feb 17, 2022

@ShooterIT - can you demonstrate a problematic scenario that will break other kvrocks commands when using rocksdb txs?
I would be happy to assist in understanding these cases.

@ShooterIT
Copy link
Member Author

@Eshcar Sorry for the slow reply.

Currently, kvrocks directly access db when reading data from db, and write a writebatch when writing data. It doesn't care about rocksdb txs, if we want to use rocksdb txs, there are too many places that is needed to change. And in some places, we directly access DB for checkpoint, WAL..., So i am not sure we could use transaction db easily. If you have ideas, welcome to discuss.

@Eshcar
Copy link

Eshcar commented Mar 2, 2022

Indeed, supporting tx atomicity will create changes to many kvrocks files, if not all. But it has the potential of improving the quality of the code.

Here is my suggestion. At the core of it is the principle of encapsulation:

  1. db_ and storage_ fields of Database class should be private, and not be visible/accessible to all inheriting classes.
  2. Database should include an abstract DbHandler that is responsible for calling db_ and storage_ methods.
  3. SimpleDBHandler and TxDBHandler inherit from DbHandler and implement its API.
  4. When a redis command object is created its dbHandler is set (after calling the c-tor), according to the context.
  5. When running simple commands – set simpleDBHandler; when running a tx, all queued commands in the multi phase are set w the same txDBhandler.

This is the list of methods that reference db_ or storage_ and need to be covered by the handler:
db_:

  1. LatestSnapShot ss(db_)
  2. db_->Get(...)
  3. db_->NewIterator(…)

storage_:

  1. LockGuard guard(…)
  2. batch.Put(…)
  3. storage_->Write()
  4. storage_->GetLockManager()
  5. storage_->Delete()
  6. storage_->IsSlotIdEncoded()
  7. storage_->DeleteRange()

Maybe there are some more methods that I missed.

An example usage of the dbHandler would be:
[Examples of how to replace the code in inheriting commands, and how to implement the methods:]

  • Replace db_->Get() with a dbHandler->get()
  • SimpleDbHandler implements this by db_->Get()
  • TxDbHandler implement this by txn_db->Get(), where txn_db is opened by TransactionDB::Open when the transaction begins.
  • Replace batch.put() with dbHandler->batchPut()
  • SimpleDbHandler implements this by batch.put(), where the batch is only used by this command
  • TxDbHandler implement this by txn_db->Put(), where the same batch is used by all queued command
  • Replace storage_->Write() with dbHandler->write()
  • SimpleDbHandler implements this by storage_->Write()
  • TxDbHandler implement this by txn->Commit(), where txn is created by txn_db->BeginTransaction(), when the tx begins.

So the code and flow for commands is similar to the current code, the only difference is how specific calls are handled at the db_ level based on the context (tx or simple).

Do you see any inherent limitation in this suggestion? (except for the fact that it requires major refactoring of the code).

@ShooterIT
Copy link
Member Author

Impressive idea, truly thanks @Eshcar

Recently, I always thought that, it will be really cool to support ACID transaction for KV NoSql. Even I also want to support interactive transaction, and we can support more features transaction, such as optional sync, isolation level.

MULTI sync yes/no  isolation  ReadCommitted/RepeatableRead
xxx
xxx
EXEC

sync: we call fsync after transactions, i.e. truly guarantee data persistent in disk
isolation: we can easily implement different isolation using rocksdb capacity, of course, we don't support this in the first version.

For code refactor of storage, i ever had a big idea, i think it is not good to directly access db_(pointer of RocksDB) in different data type implementation(redis_zset.cc, redis_hash.cc...), we should not expose it. I think we should provide some simple interfaces in storage layer. such as Put(support multi keys), Get(support multi keys),Del(support multi keys), RangeRead, RangeDelete. Data type implementation layer just call these functions, and we could easily change these function implementation even we can change different storage engine. Of course, we need to change too many places.

Your suggested method is much accessible i think, though it also breaks many places. Recently, i may don't have enough spare time to do this job, i think you can have a try if you want.

@Eshcar
Copy link

Eshcar commented Mar 27, 2022

Thanks @ShooterIT
I do not have the capacity to implement/test/benchmark this solution.
I would be happy to work with/guide anyone from the community that is willing to implement this change.
Let me update soon if I can allocate alternative resources for this effort.

@git-hulk
Copy link
Member

git-hulk commented Feb 26, 2023

Hi, @Eshcar @ShooterIT I have an idea to workaround this bug. We can add a new function to create a shared WriteBatch for the Multi-Exec command like the below:

Status Storage::Begin() {
    is_txn_mode = true;
    txn_write_batch = make_shared<*rocksdb::WriteBatch>();
}

rocksdb::WriteBatch *Storage::GetWriteBatch() {
   if (is_txn_mode) {
      return txn_write_batch.get();
   }
   return make_shared<*rocksdb::WriteBatch>();
}

Status Storage::Commit() {
    is_txn_mode = false;
    // write txn WriteBatch to RocksDB
    txn_write_batch.reset();
}

For the Command Exec, we need to explicitly call the Begin() and Commit() to enter and leave the transaction mode. So that Kvrocks will create a new WriteBatch for each writes operation if it's NOT in transaction mode, and use the shared WriteBatch to collect all write operations if it's in the transaction mode.

@ShooterIT
Copy link
Member Author

it is limited, in transaction, there also are some reading operations, we also hope we can read committed writing operations, maybe you can implement searching WAL(rocksdb supports it) to support this feature, but we can't resolve conflicts when we commit transaction. so i prefer to just use rocksdb transaction to implement real transaction.

@git-hulk
Copy link
Member

git-hulk commented Feb 27, 2023

Yes, RocksDB also supports the WriteBatchWithIndex to implement the Read-Your-Own-Write feature, but it needs to involve all read operations, so I didn't plan to do it in the current stage. Does it make sense to make write operations atomic first, then involve read operations?

Refer: https://rocksdb.org/blog/2015/02/27/write-batch-with-index.html

@ShooterIT
Copy link
Member Author

but if you don't commit write(pending in txn_write_batch), you can't Read-Your-Own-Write in normal mode.

@git-hulk
Copy link
Member

git-hulk commented Feb 27, 2023

@ShooterIT Do you mean WriteBatchWithIndex can only be used in transaction mode? I see the MongoDB will search the txn_write_batch first if exists then from the DB, so it should work if it can be used in the normal model.

@ShooterIT
Copy link
Member Author

RocksDB also supports the WriteBatchWithIndex to implement the Read-Your-Own-Write feature, but it needs to involve all read operations, so I didn't plan to do it in the current stage

You said you don't want to use WriteBatchWithIndex currently, so you can't Read-Your-Own-Write. if you use it , of course, it can.

@git-hulk
Copy link
Member

oh...I got your point now. What I mean is to implement the atomic write operations first, then involves read operations at the next stage.

@ShooterIT
Copy link
Member Author

you shouldn't ask developers not have Read-Your-Own-Write operations in transactions,

set a b
multi
set a c
get a  // if you don't commit write batch, you will get `b` instead of `c`.
set xxx yyy
...
exec

@git-hulk
Copy link
Member

@ShooterIT Yes, we cannot require users to avoid this, what I mean is to separate them into two PRs to make the code review simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug type bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants