Integrate js-db and concurrency testing #419

CMCDragonkai · 2022-07-15T07:40:05Z

Description

This integrates the new js-db 5.0.0 which brings in rocksdb. It now has correct implementation of optimistic transactions using Snapshot Isolation.

In bringing this, all of our concurrency issues should start with a correct foundation. This means any race conflict at the DB layer will throw an ErrorDBTransactionConflict exception. These exceptions will need to either be handled with an automatic retry where it is appropriate, or when it is not appropriate, it should be bubbled all the way the user interface (CLI and GUI) and the user can decide what to do.

Write skews can be resolved with new methods getForUpdate which materialises the read-write conflict into a write-write conflict. Where these may occur, transaction locks should also be used.

The DBTransaction supports both optimistic transactions and pessimistic transactions in the same transaction object. Locking within the transaction should only be done in order to guard against write-skew thrashing (like counter races).

Because the DBTransaction has its own lock, and most operations should be done with respect to the DB, most usages of in-memory locking can be replaced with reifying it to the database. This means some in-memory constructs will be replaced with in-DB structures instead.

Transaction mutual exclusion should start at the boundary of the program. This means at the GRPC service request handler level. A transaction will be created there, and this context will be shared by the transitive closure of all nested functions and methods. This should be sufficient for most usages.

The EFS will continue to use the old leveldb until it also transitions to using the new 5.0.0 js-db. The EFS runs its own database atm, so it's independent of the main PK database. Conversion of EFS to using the new db will be a separate PR, it requires separate testing because it has to remove its own locking in favour of using transaction-native locking.

While integrating this new DB, we will also start experimenting with more formal model-checking tests against our codebase in order to test concurrent phenomena.

Issues Fixed

Tasks

1. keypath and options are swapped for relevant DB and trans methods.
2. GRPC handlers need to begin the transaction contexts.
3. Try and simplfy the if(tran == null) db.withTranF/G blocks. use tran => function() to avoid code bloat. For generators where possible just pass the generator down with a arrow function. Generally try and avoid wrapping it in a withTransactionG if possible.
4. Any simple get/set operations can just use DB directly if tran isn't provided. for single operations a transaction is pointless.
5. lockbox with re-enter-ency is provided, this can be used to avoid conflicts on counters and creation.
6. Remove ACL and Sigchain implementation of withTransactionF with the db version of it. The Sigchain version needs to use the transaction's locking now.
Update DB, EFS and logger dependencies.
Remove any ts-ignore comments for logger usage.

Final checklist

ghost · 2022-07-24T08:27:44Z

👇 Click on the image for a new way to code review

Make big changes easier — review code in small groups of related files
Know where to start — see the whole change at a glance
Take a code tour — explore the change with an interactive tour
Make comments and review — all fully sync’ed with github

Try it now!

Legend

CMCDragonkai · 2022-07-28T08:22:41Z

For better concurrency testing, as well as fast-check, we can bring in jest-extended. It's already starting to be used by EFS which has lot of concurrency behaviour to verify.

CMCDragonkai · 2022-07-31T09:55:25Z

Rebased on master.

This has now jest-extended, fast-check, DB 5.0.1.

CMCDragonkai · 2022-07-31T13:49:01Z

Transaction mutual exclusion should start at the boundary of the program. This means at the GRPC service request handler level. A transaction will be created there, and this context will be shared by the transitive closure of all nested functions and methods. This should be sufficient for most usages.

This means the DBTransaction object is passed into domain methods. So the methods of Queue for example might be:

public addSomething(...args, tran?: DBTransaction) {
}

We're using the last parameter as the optional argument here... but it may be rolled into a general options object too in the future.

But this is the design I've found that works as this also allows the domain objects to be used atomically as each method will create their own transaction context level.

CMCDragonkai · 2022-07-31T14:13:55Z

@tegefaulkes @emmacasolin this PR is an exploration of the concepts of queue, db and tracing. Several PRs are likely to be split off from this PR which will target staging. I'll be setting up a framework from this PR, and then there will be pieces of work sliced out.

The DB integration will require reworking how transactions are used for atomicity.
Tracing can be its own domain that creates span that abstracts over how we currently use logger.getChild
Queue involves a lot of interesting ideas

CMCDragonkai · 2022-08-05T14:27:43Z

By using raw: true, you can insert Buffer.from(lexi.pack(num)) directly as a key without first encoding it as a hex value. (Hex value encoding can still be useful though in case it has to maintain some defined set of values independent from other values when it is concatenated together).

Doing this currently with the timestamped level for the tasks.

CMCDragonkai · 2022-08-05T14:30:22Z

When presenting task ids to the end user, in a similar vein to claim ids, we have to show multibase encoded IDs, as that's the only way for users to be able to easily copy-paste them around. But they are decoded to regular IdSortable when possible. If they are carried around in JSON and record keys they are carried around with string encoding. Something like TaskIdString.

TaskId - the regular `IdSortable` type
TaskIdString - string encoded to be used in record keys
TaskIdEncoded - multibase encoded

CMCDragonkai · 2022-08-05T14:32:10Z

I'm using extractTs from IdSortable to acquire the timestamp portion of the TaskId, this way the ID and the timestamp that is being stored is the same. The timestamp can never change, but we could allow the delay to be dynamically updated, but this is not an essential feature for how we envision using the Queue.

CMCDragonkai · 2022-08-08T02:22:45Z

This will be the PR where we do both the DB integration and Queue/Scheduler experiments initially, then depending on whichever one is ready, they can be separated out to be merged in.

tegefaulkes · 2022-08-08T08:00:25Z

As discussed, I'll separate the changes here into two PRs, The queue changes will move into a new branch feature-queue.

CMCDragonkai · 2022-08-08T12:41:37Z

You can separate the changes already here into several smaller commits too like introducing jest-extended, and moving PromiseDeconstructed out.

CMCDragonkai · 2022-08-08T12:42:04Z

Don't forget to rebase on staging too. We are then making a stacked PR from this.

tegefaulkes · 2022-08-09T02:23:48Z

Summary of changes that need to be made for DB 5.x.x integration.

keypath and options are swapped for relevant DB and trans methods.
GRPC handlers need to begin the transaction contexts.
Any transaction conflicts bubble up to the user for now while we assess how much conflicting may occur.
Try and simplfy the if(tran == null) db.withTranF/G blocks. use tran => function() to avoid code bloat. For generators where possible just pass the generator down with a arrow function. Generally try and avoid wrapping it in a withTransactionG if possible.
Any simple get/set operations can just use DB directly if tran isn't provided. for single operations a transaction is pointless.
lockbox with re-enter-ency is provided, this can be used to avoid conflicts on counters and creation.

tests/acl/ACL.test.ts

CMCDragonkai · 2022-08-22T06:51:46Z

For all the places where you are using getForUpdate, I'm thinking that this is only necessary during write skews. And during the instance methods for the managers. During the creation/destruction of singleton instances, it shouldn't be necessary since they are singletons.

It's only when multiple concurrent calls are likely, as in the creation/destruction of non-singleton instances, and during method invocations likely triggered from multiple RPC handlers.

CMCDragonkai · 2022-08-22T06:52:10Z

Also make sure you're that you are sharing the same DBTransaction instance in all the lifecycle handlers.

CMCDragonkai · 2022-08-22T07:25:11Z

Wherever it is relevant, we can always add locks to ensure serialisation. In fact one could say that all handlers are serialised in some special way, but it is often too complex, and something that we want to minimise to where it is needed. Instead we are going to bubble up the transaction conflict up the handlers. Then we can then come down later from the PK client and add auto-retry where appropriate.

tegefaulkes · 2022-08-22T07:53:47Z

So If I understand the usage of getForUpdate, we'd want to use it where we get from key A and update key B or C based on the contents of A? In that case we'd want to use it for cases were we have relationships between data or indexing?

The example above is for renaming vaults were we update vaultId->vaultName, vaultName->vaultId and delete oldVaultName. Another example is the ACL where we have tight relationships between vaultId -> [nodeId, nodeId] --> permId -> permRef.

So we'd want to use getForUpdate to help enforce consistency for data we're trying to keep relationships between.

CMCDragonkai · 2022-08-22T08:02:03Z

What's an example of a ACL write skew? But yes that's the general idea. Have a read on write skew on wikipedia again to refresh memory. It also depends on whether there are any consistency constraints that are being violated here.

This is kind of why I wanted SSI to avoid this problem entirely since it is a complicated edge case.

tegefaulkes · 2022-08-22T08:09:10Z

For ACL there are two aspects.

we're following a chain of nodeId -> permId -> permRef and end up updating the data in the permRef.
While following this chain we're checking if the nodeIds are missing the permId or permRef and removing these nodes from the permRef data.

Here is a code snippet.

 /**
   * Gets the record of node ids to permission for a given vault id
   * The node ids in the record each represent a unique gestalt
   * If there are no permissions, then an empty record is returned
   */
  @ready(new aclErrors.ErrorACLNotRunning())
  public async getVaultPerm(
    vaultId: VaultId,
    tran?: DBTransaction,
  ): Promise<Record<NodeId, Permission>> {
    if (tran == null) {
      return this.db.withTransactionF((tran) =>
        this.getVaultPerm(vaultId, tran),
      );
    }
    const nodeIds = await tran.get<Record<NodeId, null>>(
      [...this.aclVaultsDbPath, vaultId.toBuffer()],
      false,
    );
    if (nodeIds == null) {
      return {};
    }
    const perms: Record<NodeId, Permission> = {};
    const nodeIdsGc: Set<NodeId> = new Set();
    for (const nodeIdString in nodeIds) {
      const nodeId: NodeId = IdInternal.fromString(nodeIdString);
      const permId = await tran.get(
        [...this.aclNodesDbPath, nodeId.toBuffer()],
        true,
      );
      if (permId == null) {
        // Invalid node id
        nodeIdsGc.add(nodeId);
        continue;
      }
      const permRef = (await tran.get(
        [...this.aclPermsDbPath, permId],
        false,
      )) as Ref<Permission>;
      if (!(vaultId in permRef.object.vaults)) {
        // Vault id is missing from the perm
        nodeIdsGc.add(nodeId);
        continue;
      }
      perms[nodeId] = permRef.object;
    }
    if (nodeIdsGc.size > 0) {
      // Remove invalid node ids
      for (const nodeId of nodeIdsGc) {
        delete nodeIds[nodeId];
      }
      await tran.put(
        [...this.aclVaultsDbPath, vaultId.toBuffer()],
        nodeIds,
        false,
      );
    }
    return perms;
  }

CMCDragonkai · 2022-08-22T08:46:15Z

Ok we have identified where write skews may exist:

O <-- O
 ^
  \+- O

If it is meant to be a 1:1 relationship, then this application constraint is being violated. It's not a transaction conflict.

Solve by adding a lock to force serialisation. Then by adding a getForUpdate which would force a conflict if a lock wasn't there.

This applies to assigning a vault name to a vault id, and also creating a new claim on the sigchain, as the sigchain must be linear.

tegefaulkes · 2022-08-22T09:22:43Z

I think this is good now but we'll have to see what the CI shows us. If it's good then I can start squashing and prepping to merge.

This includes using `getForUpdate` for any counter updates.

… creation

tegefaulkes · 2022-08-22T09:32:47Z

Should be good to merge now.

CMCDragonkai · 2022-08-22T09:40:57Z

CI failures, a timeout and a transaction conflict error.

CMCDragonkai · 2022-08-22T09:41:59Z

Does this PR actually resolve #244?

CMCDragonkai · 2022-08-22T09:46:20Z

For #244, the main idea is that NodeGraph should be taking a tran parameter as per the comment: #244 (comment).... so that way the node graph can be properly transactional and composed with other operations. I don't remember if there was anything else I missed, but @tegefaulkes took over the development of the node graph lately.

CMCDragonkai · 2022-08-22T10:04:06Z

Long term we should remove all the object map locks in favour of the transaction locks. This is because locks don't compose, and all the object maps are just representations of the source of truth on disk. Ideally we should use the tran.queueSuccess and such that can do in-memory changes only after a commit occurs. This means local state inside a transactional function is temporary, and only after a commit succeeds, does global (domain state) get mutated.

Generally I reckon we shouldn't have any operations that is mutating the in-memory state without also involving the underlying disk, so there shouldn't be a situation where they can't coordinate with the transaction locks.

The transaction locks are far more useful generally because they are re-entrant, and they don't release until transaction destruction.

Issue created: #443

src/vaults/VaultManager.ts

CMCDragonkai · 2022-08-22T10:40:34Z

I reckon we still need to fix that the key paths used for locking and also investigate why:

    ✕ fails after receiving undefined singly signed claim (20523 ms)

This is taking longer than 20s. Seems like a possible deadlock.

CMCDragonkai · 2022-08-22T10:41:35Z

I reckon we still need to fix that the key paths used for locking and also investigate why:
    ✕ fails after receiving undefined singly signed claim (20523 ms)
This is taking longer than 20s. Seems like a possible deadlock.

Actually I saw that other places also take about 20 seconds there. So I guess it's not a big deal. So we can merge without addressing this for now.

We got lots of things like queues, tasks, cancellation that is coming in to help with managing our asynchronous timelines and resources.

CMCDragonkai self-assigned this Jul 15, 2022

CMCDragonkai force-pushed the feature-concurrent-state branch 4 times, most recently from fb83c35 to 1974f1e Compare July 24, 2022 08:25

CMCDragonkai changed the title ~~WIP: Integrate js-db and integrate concurrency testing~~ WIP: Integrate js-db and concurrency testing Jul 25, 2022

CMCDragonkai mentioned this pull request Jul 27, 2022

CLI option --format=json should change the STDERR logging to output JSON structured logging #421

Merged

8 tasks

This was referenced Aug 11, 2023

Fix Failing Build Tests on Windows & MacOS MatrixAI/Polykey-CLI#14

Open

ci: merge staging to master #411

Merged

CMCDragonkai force-pushed the feature-concurrent-state branch 2 times, most recently from 1a872fe to 716bcd3 Compare July 31, 2022 09:19

CMCDragonkai linked an issue Aug 1, 2022 that may be closed by this pull request

Generic Non-Blocking Task Management ("Queue") for discovery and nodes domains #329

Closed

CMCDragonkai mentioned this pull request May 15, 2023

Integration Tests for testnet.polykey.com MatrixAI/Polykey-CLI#71

Closed

14 tasks

CMCDragonkai force-pushed the feature-concurrent-state branch from 389e06a to 05ff8b0 Compare August 5, 2022 13:57

CMCDragonkai assigned tegefaulkes Aug 8, 2022

tegefaulkes force-pushed the feature-concurrent-state branch from 88c2ec7 to 3c1b8c8 Compare August 8, 2022 08:15

CMCDragonkai removed a link to an issue Aug 8, 2022

Generic Non-Blocking Task Management ("Queue") for discovery and nodes domains #329

Closed

CMCDragonkai commented Aug 22, 2022

View reviewed changes

tests/acl/ACL.test.ts Show resolved Hide resolved

tegefaulkes added 7 commits August 22, 2022 19:28

fix: vaults locking and transactional locking

16c0279

fix: cleaning up unnecessary locks

48a6ea4

This includes using `getForUpdate` for any counter updates.

fix: some get/set operations just use DB if transaction not given.

8ea2e15

ci: reenabling check tests

0f4d5c2

build: updating @matrixai/db to ^5.0.3 and encryptedfs to ^3.5.6

39ceada

fix: removed ts-ignores for logger version mismatch

4fe9492

fix: using tran.getForUpdate for vaults renaming and sigchain claim…

1b109d9

… creation

tegefaulkes force-pushed the feature-concurrent-state branch from 32ecc60 to 1b109d9 Compare August 22, 2022 09:29

CMCDragonkai changed the title ~~WIP: Integrate js-db and concurrency testing~~ Integrate js-db and concurrency testing Aug 22, 2022

fix: added transaction lock on vaultId for vault renaming

1bc948b

CMCDragonkai commented Aug 22, 2022

View reviewed changes

src/vaults/VaultManager.ts Outdated Show resolved Hide resolved

fix: VaultManager vaultId locking path and using join('') on paths

c75a402

tegefaulkes merged commit 5d90a03 into staging Aug 23, 2022

CMCDragonkai mentioned this pull request Aug 23, 2022

Object map locks should be removed in favour of transaction locks #443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate js-db and concurrency testing #419

Integrate js-db and concurrency testing #419

CMCDragonkai commented Jul 15, 2022 •

edited by tegefaulkes

Loading

ghost commented Jul 24, 2022 •

edited by ghost

Loading

CMCDragonkai commented Jul 28, 2022

CMCDragonkai commented Jul 31, 2022

CMCDragonkai commented Jul 31, 2022

CMCDragonkai commented Jul 31, 2022

CMCDragonkai commented Aug 5, 2022 •

edited

Loading

CMCDragonkai commented Aug 5, 2022

CMCDragonkai commented Aug 5, 2022

CMCDragonkai commented Aug 8, 2022

tegefaulkes commented Aug 8, 2022

CMCDragonkai commented Aug 8, 2022

CMCDragonkai commented Aug 8, 2022

tegefaulkes commented Aug 9, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022 •

edited

Loading

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

Integrate js-db and concurrency testing #419

Integrate js-db and concurrency testing #419

Conversation

CMCDragonkai commented Jul 15, 2022 • edited by tegefaulkes Loading

Description

Issues Fixed

Tasks

Final checklist

ghost commented Jul 24, 2022 • edited by ghost Loading

Legend

CMCDragonkai commented Jul 28, 2022

CMCDragonkai commented Jul 31, 2022

CMCDragonkai commented Jul 31, 2022

CMCDragonkai commented Jul 31, 2022

CMCDragonkai commented Aug 5, 2022 • edited Loading

CMCDragonkai commented Aug 5, 2022

CMCDragonkai commented Aug 5, 2022

CMCDragonkai commented Aug 8, 2022

tegefaulkes commented Aug 8, 2022

CMCDragonkai commented Aug 8, 2022

CMCDragonkai commented Aug 8, 2022

tegefaulkes commented Aug 9, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

tegefaulkes commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022 • edited Loading

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Aug 22, 2022

CMCDragonkai commented Jul 15, 2022 •

edited by tegefaulkes

Loading

ghost commented Jul 24, 2022 •

edited by ghost

Loading

CMCDragonkai commented Aug 5, 2022 •

edited

Loading

CMCDragonkai commented Aug 22, 2022 •

edited

Loading