Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding durability for ClojureScript #14

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

logseq-cldwalker
Copy link

@logseq-cldwalker logseq-cldwalker commented Dec 14, 2023

Hi @tonsky. Thanks for this handy library! This is a conversational PR to see if you'd be interested in our ClojureScript durability that @tiensonqin added as part of our datascript.storage cljs implementation. We are using this forked datascript in our product's feature branch and it is working well for our needs. The existing cljs tests on this repo are passing as our the relevant upstream datascript.storage cljs tests. I'm aware there's some minor whitespace and commenting that needs to be cleaned up here as well as cljs storage tests that need to be ported. I could help with those but for any design questions I'd defer to @tiensonqin. Would you be interested in a contribution like this?

@logseq-cldwalker logseq-cldwalker marked this pull request as draft December 14, 2023 15:26
@tonsky
Copy link
Owner

tonsky commented Dec 16, 2023

Yes! I initially skipped CLJS implementation because I thought localStorage is too small to be useful, and IndexedDB is async. What are you guys using for storage?

@logseq-cldwalker
Copy link
Author

Cool. We're using SQLite via OPFS. From the OPFS approaches that SQLite offers we chose the OPFS pool approach because it offers a sync handle. The async approach proved too difficult and we're unsure if it's doable

@tonsky
Copy link
Owner

tonsky commented Dec 19, 2023

Please let me know when it’s ready for review

@logseq-cldwalker logseq-cldwalker marked this pull request as ready for review December 19, 2023 22:13
@logseq-cldwalker
Copy link
Author

Sure. This is ready for review

@logseq-cldwalker
Copy link
Author

@tonsky Sorry. There was actually still another bug with deletion. Since I'm going to be away on holiday vacation, I'm putting this back until draft until I get back the first week of next year. Any low-effort feedback would be welcome. Cheers

@logseq-cldwalker logseq-cldwalker marked this pull request as draft December 21, 2023 14:38
@tonsky
Copy link
Owner

tonsky commented Dec 21, 2023

Sure. Will give it a look

Copy link
Owner

@tonsky tonsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, but I have couple of notes

src-clojure/me/tonsky/persistent_sorted_set.cljs Outdated Show resolved Hide resolved
(defprotocol IStorage
(restore [this address])
(accessed [this address])
(store [this node address])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure what address here does? Clojure version only takes node and storage chooses and returns address, so that it can be compatible with e.g. auto-increment. I think we should keep it consistent between versions

Copy link

@tiensonqin tiensonqin Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea here is to reuse the existing address to reduce the storage usage and get rid of gc if possible.
We observed 0.1MB increases for db.sqlite when editing one block in Logseq.

I agree that we should keep it consistent between versions, maybe someone can implement the same idea with the Clojure version, or I can remove both address and _dirty.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, you can’t reuse addresses though? Because user can keep a reference to an old version of database that is referencing the old version of the block? If you rewrite it the old db ref will stop working?

There’s actually a whole deal about it in Java version, you kind of need to know all alive references to run GC and not break any of them. That was the initial promise of Datomic and Datascript — dbs are immutable, if you keep a reference to them they’ll keep working

Copy link

@tiensonqin tiensonqin Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for not getting back to you sooner.
What I did is to keep using any existing address instead of generating a new address for each branch node and leaf, it seems that the Clojure version of Datascript has to remove all the unused addresses during GC here, those unused addresses increases the storage usage unless they're garbage collected. We aim to delete unused addresses immediately instead of delaying to GC to reduce the storage overhead.

(deftype Leaf [keys ^:mutable _address ^:mutable _dirty]
IStore
(store-aux [this storage]
(if (or _dirty (nil? _address))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea behind _dirty flag here? In Clojure version, if node has address, it is persisted. If address is nil, it is not persisted (==dirty?). Do we really need two separate flags for this?

src-clojure/me/tonsky/persistent_sorted_set.cljs Outdated Show resolved Hide resolved
@whilo
Copy link
Contributor

whilo commented Jan 3, 2024

@logseq-cldwalker What are your main insights into asynchronous support? @pkpkpk and I have also started working on async version of the persistent-sorted-set. I think both async and sync execution models have benefits and drawbacks.

@tiensonqin
Copy link

@tonsky Thanks for looking into this PR and all the beautiful work for the Clojure community!

@tiensonqin
Copy link

@whilo Hey!
I think the main reason for Logseq is that it'll require tons of effort to migrate the existing code base to be asynchronous, we still face the challenge that OPFS with SQLite can only be used in a web worker, which is great not to block the main UI thread, but it means both queries and transact will be async, we'll experiment soon with the idea to have an in-memory datascript db in the main UI thread for caching and sync the data with the full db from the worker.

It'll be nice to have async support so that people can choose to store the data in IndexedDB.

@logseq-cldwalker logseq-cldwalker marked this pull request as ready for review January 4, 2024 12:56
@tonsky
Copy link
Owner

tonsky commented Jan 4, 2024

@tiensonqin thanks to you for your consistent support and for such a significant contribution in such a tricky part of the system!

@whilo
Copy link
Contributor

whilo commented Jan 5, 2024

@tiensonqin Thank you for laying that out, that makes sense. A solution I originally developed for replikativ with the hitchhiker-tree and we are currently revisiting (replikativ/datahike#429) is to stream tree fragment deltas incrementally and then update the db root after everything is in store/storage. That way you can realize your synchronous DataScript scenario. You just need to transact first into a storage system somewhere and then react to its confirmation/updates. I think this would be a simple and nice model to synchronize logseq, but I don't know whether you want to treat the markdown files or the DataScript as the primary source of truth.

@logseq-cldwalker
Copy link
Author

@tonsky Are there things you're waiting for from us? I think Tienson addressed most of the feedback

@tonsky
Copy link
Owner

tonsky commented Jan 18, 2024

Oops, sorry, no. I’ll take a look

@tonsky
Copy link
Owner

tonsky commented Feb 1, 2024

@tiensonqin

Okay, I think I finally understand what are you doing. Sorry it took so long to catch up.

I like the idea! It can’t be the default mode though, default should be normal persistent data structure where you can keep references to old copies.

But as an option, I’d like to have it too. I can imagine an app that can only keep one reference to the latest DB at all times. If that eliminates GC, I see how it is beneficial.

So let’s say we want to get rid of GC at all. Right now you have two behaviours: some addressed get erased on next store (the ones that are getteing reused, marked with _dirty), some are erased immediately as database is changed (delete storage [unused-addresses])

I propose we move it all the next store.

  1. Add some sort of queue of freed up addresses to the top level of the tree (through dynamic var?)

  2. When node gets changed/split/merged, its address is added to that queue. Node itself does not remember last address, it just gets set to nil, same as in clj implementation. This will let us get rid of _dirty flag (has .-address === stored).

  3. When the time comes to store new version of the set, we first do (delete storage unused-addresses) and then let the storage allocate new addresses. Upside: it can happen in a batch call.

So it’s not exactly address reuse, more like freeing addresses as we go and allocating new ones.

If a storage doesn’t want to clean up freed addresses immediately, it can make the implementation noop.

Would that work for you?

@logseq-cldwalker
Copy link
Author

Tienson is out right now for Chinese New Year. Hopefully he can respond soon after he gets back. In the meantime, we've been able to use this PR successfully on databases up to 9.3M datoms which translates to ~1.4G on disk

@tonsky
Copy link
Owner

tonsky commented Feb 13, 2024

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants