Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding AttachDocument()'s performance with Housekeeping and Garbage Collection #444

Closed
krapie opened this issue Jan 12, 2023 · 5 comments

Comments

@krapie
Copy link
Member

krapie commented Jan 12, 2023

Description:

There is a AttachDocument() API performance issue on current react-tldraw example.
Attaching document is taking almost 14 seconds to complete, and this is severally harming Yorkie's user experience.

I have investigated this issue and I think I have found the root cause of this problem.
I will explain why this is happening, and then mention some suggestions to solve this issue.

Why:

There were two main causes of this problem:

  1. Housekeeping's long period of housekeeping-deactivate-threshold
  2. Absence of detachDocument()or deactivateClient() for several reasons

I will explain those two causes with react-tldraw example mentioned above.
It's a long story, so you can scroll down and just see final timeline for this incident.

in my react-tldraw example, I called two functions client.detach() and client.deactivate() in handleDisconnect() on window close to successfully close Yorkie connections.

function handleDisconnect() {
   if (client === undefined || doc === undefined) return;

   client.detach(doc);
   client.deactivate();
 }

...

window.addEventListener('beforeunload', handleDisconnect);

However handleDisconnect() function was not invoked on window close. So detachDocument() or deactivateClient() were failed. I will call this failed client, a dead client.

But still this incident was in control because Yorkie server's Housekeeping service will deactivate clients that have not been used for a long time.
But, current Yorkie server's default configuration value of deactivate threshold DefaultHousekeepingDeactivateThreshold was 7 * 24 * time.Hour = 7 days (I assume that current Yorkie server on Yorkie SaaS is running on this value).

DefaultHousekeepingDeactivateThreshold = 7 * 24 * time.Hour

Therefore, housekeeping(deactivation) of client on react-tldraw did not happen. This made Yorkie server to keep storing dead client's syncedseqs because updateSyncedSeq() was not performed.

// UpdateSyncedSeq updates the syncedSeq of the given client.
func(c * Client) UpdateSyncedSeq(
...

if!isAttached {
    if _, err = c.collection(colSyncedSeqs).DeleteOne(ctx, bson.M {
        "doc_id": encodedDocID,
        "client_id": encodedClientID,
    }, options.Delete());
    err != nil {
        logging.From(ctx).Error(err)
        return fmt.Errorf("delete synced seq: %w", err)
    }
    return nil
}

After that, in every PushPull(), minSyncedTicket in UpdateAndFindMinSyncedTicket() was not updated because of this dead client's syncedseqs, which will always be the smallest value. To keep in mind, syncedseqs is used to perform Garbage Collection on document. For more information, check https://yorkie.dev/docs/internals#garbage-collection.

// 04. update and find min synced ticket for garbage collection.
// NOTE(hackerwins): Since the client could not receive the response, the
// requested seq(reqPack) is stored instead of the response seq(resPack).
minSyncedTicket, err: = be.DB.UpdateAndFindMinSyncedTicket(
...

Therefore, GC was not performed even several changes were made in the document because minSyncedTicket was always the same.
This made document/snapshot get bigger and bigger as more changes are made in the document.

// ApplyChangePack applies the given change pack into this document.
func (d *Document) ApplyChangePack(pack *change.Pack) error {
...

// 04. Do Garbage collection.
d.GarbageCollect(pack.MinSyncedTicket)

So, response snapshot of AttachDocument() got huge (in react-tldraw exmaple's case, 10.9MB), and this delayed network transfer but also slowed down browser's resource loading (in react-tldraw exmaple's case, 2.05 s network transfer + 7.12 s resource loading). This situation is especially common in whiteboard because tons of changes(shape change, coordinate move, etc) are made in collaborative whiteboard.

To conclude:

  1. detachDocument() and deactivateDocument() were not performed on tldraw close.
  2. Yorkie server's DefaultHousekeepingDeactivateThreshold was 7 days, so housekeeping(deactivation) on dead client was not performed.
  3. syncedseqs of dead client was preserved because updateSyncedSeq() was not performed.
  4. minSyncedTicket in UpdateAndFindMinSyncedTicket() were not changed.
  5. Garbage Collection on further document changes was not performed.
  6. Document -> Snapshot -> response of AttachDocument() got huge.
  7. Huge response delayed network transfer and slowed down browser's resource loading.

Suggestions:

Deactivating(Housekeeping) dead client is very crucial because just one dead client will block minSyncedTicket's update, and the whole document will grow due to GC failure. And this will effect other peer's attach performance.

I think there are several ways to solve this problem.

  • Server side: Make Yorkie server's housekeeping deactivate threshold shorter (setting it too short will make client deactivate too quickly, so adjustments should be made).
  • Client side: Ensure client to perform detach/deactivate. ex: navigator.sendBaecon(deactivate) in web browser.

But eventually, server is responsible for this, because client can be unexpectedly terminated due to several reasons (web browser force close, or device crash/blue screen).

I noticed that server can detect WatchDocuments() stream close on client's dead state. Maybe we can perform some actions when WatchDocuments() stream is closed. I will keep looking for the solution to solve this problem.

P.S:

@hackerwins
Copy link
Member

Great analysis. Based on the analysis, I have a few ideas for improvement:

  1. It would be more efficient to set the client deactivation settings, such as DefaultHousekeepingDeactivateThreshold, for each project individually instead of globally. This is because each Project has different configurations and workloads, and it would allow more precise deactivation conditions.

  2. To reduce the size, data that does not need to be permanently stored should be expressed as Presence. Since CRDT(Document) has loads such as tombstone or GC, data that needs to be propagated to peers but does not need to be stored must be expressed as Presence. This is also related to Extract Room from Document and move Presence from Client to Room #442.

  3. We can optimize Document by not applying the logical clock to non-conflicting properties inside a document. Since the logical clock adds extra overhead to the Document, it may be unnecessary to add it to non-conflicting properties. It is also possible to provide an API to allow users to specify conflicting properties.

  4. We can optimize Document by removing tombstone-marked objects from memory by replacing RHTPQMap with RHT in crdt.Object`.

@krapie
Copy link
Member Author

krapie commented Jan 13, 2023

Thank you for sharing your ideas!

Solution 1: Setting DefaultHousekeepingDeactivateThreshold in individual Project sounds really good to me. I will look into this solution as this solutions seems to be the most feasible way to fix current react-tldraw problem.

Solution 2 and 3 will definitely clear out the root cause of this problem as current react-tldraw example's workloads are almost coming from shape movement(coordinate change).
But I need to customize tldraw core to handle shape movement since there are no APIs to detect shape's final(fist/last) coordinate move. Maybe I can somehow manage to handle this without changing tldraw itself but this will make example very hard to follow.

About solution 4... I have little knowledge about RHTPQMap and RHT... so I have to study about this.

I will keep updating this issue if I have some progress on these solutions.

@krapie
Copy link
Member Author

krapie commented Jan 27, 2023

I have implemented some come for Solution 1: Setting DefaultHousekeepingDeactivateThreshold in individual Project, which is Project's ClientDeactivateThreshold.
PR is still in draft: #454

I have tested it, and it seems to work fine.
But this feature is still in progress, and I need more feedbacks to improve.
For more information about this feature, follow PR link above.

@krapie
Copy link
Member Author

krapie commented Feb 17, 2023

Close issue as PR: Add ClientDeactivateThreshold in Project merged.

@krapie
Copy link
Member Author

krapie commented Feb 28, 2023

After introducing ClientDeactivateThreshold in Project, UX experience got much better.

In react-tldraw examples I mentioned above, before it took 14 seconds for AttachDocument() to finish, but now it only takes 1.5 seconds(!!) to finish, which is 90% performance increase. (ClientDeactivateThreshold on react-tldraw project is set to 1 hour)

캡처

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants