Proposal for go-threads improvements #547

mcrakhman · 2021-08-06T05:24:45Z

Hello guys!

This is a draft proposal with regards to changing the current mechanics of go-threads. Before we go into the details we wanted to summarise the current state of go-threads sync and also the motivation behind our proposed changes.

Current state of go-threads sync

Go-threads has multiple logs for each thread, each log is a single writer log, therefore only one peer can write to it, thus we have counters for each log telling us how many records it has. If a log has a head then we maintain an invariant that we have every record prior to head.

Based on this logic a lot of checks work now (mainly GetRecords check, where we know that if the log has the head counter then it has all records before that and also putRecords check which also checks the counter to see if we need to add record and certain records before it).

The threads are synced either by pullThread/pushLog where we specifically request the records from all the known peers or when exchangeEdges happens and we exchange hashes of heads with all our peers to see if they have more info and we need to call getRecords.

How we use go-threads in Anytype

It will be good to start from the way we use go-threads in our app.

Each thread represents some document and each document consists of changes. Each change corresponds to a record, with only difference that it can connect to other changes in different logs. Sometimes we capture the current state of the document (e.g. all changes in the document) in one record which is called the snapshot. Snapshot-based approach may be useful for other apps as well. E.g. ThreadsDB can also benefit from it in case of huge DB.

In this snapshot among other things we store the reference to heads of the logs which the snapshot has “seen” at the time of its creation.

To build the document we start from current heads of the logs and then we try to get to the common snapshot. The main gist of it is that to build the document we don’t need to get all the records we can just get all the records after the common snapshot of all the changes. And we don’t need any records before this common snapshot.

Also we listen to any records which are added to the thread to rebuild the document as the time goes.

What problems do we have with current implementation

a. Our databases only grow in size and are too large

This becomes a problem especially for mobile devices as soon as the threads are shared by many users and have more data stored there.

And because the logs are only growing and we maintain an invariant that all records before the head are always present it means that we can’t get rid of records even when we don’t need them (see snapshots explanation above).

b. The synchronisation speed can be improved

Depending on the size of the thread we get stuff through bitswap (see putRecords implementation), also we get some unnecessary records (see a. above).

c. Inconsistent subscribing

We can miss records, because go-threads starts processing records as long as we create the app.Net object. And our app may not be ready for that.

d. Pulling records and threads from cold start takes too long

Mostly because again we get a lot of unnecessary records and we can't control what records do we download. Everything is decided by go-threads under the hood.

e. No garbage collecting

There is no way for us to get rid of unneeded records or mark them as such.

f. No way to prioritise what go-threads is downloading at the moment

Again everything is decided by go-threads under the hood and there is no way to control it.

The changes we propose

In general we want to make synchronisation to be configurable by client via some strategy (this can be either a config or a component which will determine the strategy). That will make go-threads more "dumb" and the client will have full control over it.

Of course we want to make changes backwards compatible, so by default the strategy will work in the same way as was before.

a. Remove the invariant that we have all the records before head

We will still have heads and their counters synchronised across devices, but we will not guarantee that we have everything before that. That will enable us to “garbage collect” all the records that we don’t need for building our documents.

It is a question if we for our convenience will maintain a list of ranges of downloaded records, looking something like:
{(hash A, counter 0), (hash B, counter 150)}, {(hash C, counter 390, hash D, counter 1000)}...

This will enable us to know if we have some record with counter just by doing a search through this list.

b. Introduce on-demand thread following

Drawing an analogy from tail -f the user can say that he wants to follow a certain thread and only then go-threads will try to synchronise all the records which come after the current head, but not before it.

c. Introduce pagination

A lot of the time we need just to get N records below a specific hash/counter. This can be head or any other record. But at the same time we don’t want go-threads to fully download the log (because we don’t need it).

Go-threads now lacks such an API, for example in GetRecords you only provide the offset (end point), but loading always starts from head of the server’s log, you cannot provide another starting point.

So essentially we want to be in control of how many records go-threads download and from which offset. Right now we cannot do that, because the records will be thrown away if we don’t fill the gap between our current head and the oldest received record. This topic is closely related to us killing the invariant that we must have all the records before head.

d. Change `exchangeEdges` so that it will only sync heads

But it will not try to get all the records unless we are in follow mode

e. Subscribe from particular record/counter

We want to be able to get all the records starting from some other record or counter. So no matter when we start subscribing we will still get all the needed records.

The text was updated successfully, but these errors were encountered:

sanderpick · 2021-08-12T17:00:58Z

How we use go-threads in Anytype

It will be good to start from the way we use go-threads in our app.

This is very useful, thanks for sharing the details.

Each thread represents some document and each document consists of changes. Each change corresponds to a record, with only difference that it can connect to other changes in different logs. Sometimes we capture the current state of the document (e.g. all changes in the document) in one record which is called the snapshot. Snapshot-based approach may be useful for other apps as well. E.g. ThreadsDB can also benefit from it in case of huge DB.

Yep, sounds useful. On a side note, we have been toying with the idea of moving ThreadDB out of the repo and creating a better interface for "plugins". How is your app layer tied into the core thread layer?

What problems do we have with current implementation

a. Our databases only grow in size and are too large

This becomes a problem especially for mobile devices as soon as the threads are shared by many users and have more data stored there.

And because the logs are only growing and we maintain an invariant that all records before the head are always present it means that we can’t get rid of records even when we don’t need them (see snapshots explanation above).

Makes sense. IIRC, we landed on the invariant so that any peer can full validate a log.

Does a snapshot "reset" the log in a sense? I.e, in terms of validation, can we simple say, stop traversing at snapshots?
When creating a thread, does it make sense from your use case to provide some auto-snapshot config or max record height? E.g., logs could be rollup up (and GC'd) at some counter size. Or is giving control to the user better (providing a snapshot API)?

b. The synchronisation speed can be improved

Depending on the size of the thread we get stuff through bitswap (see putRecords implementation), also we get some unnecessary records (see a. above).

👍

c. Inconsistent subscribing

We can miss records, because go-threads starts processing records as long as we create the app.Net object. And our app may not be ready for that.

👍 Something to consider when thinking about a common interface to the net layer.

d. Pulling records and threads from cold start takes too long

Mostly because again we get a lot of unnecessary records and we can't control what records do we download. Everything is decided by go-threads under the hood.

e. No garbage collecting

There is no way for us to get rid of unneeded records or mark them as such.

👍 These all sound related to snapshotting

f. No way to prioritise what go-threads is downloading at the moment

Again everything is decided by go-threads under the hood and there is no way to control it.

Makes sense!

The changes we propose

In general we want to make synchronisation to be configurable by client via some strategy (this can be either a config or a component which will determine the strategy). That will make go-threads more "dumb" and the client will have full control over it.

Of course we want to make changes backwards compatible, so by default the strategy will work in the same way as was before.

a. Remove the invariant that we have all the records before head

We will still have heads and their counters synchronised across devices, but we will not guarantee that we have everything before that. That will enable us to “garbage collect” all the records that we don’t need for building our documents.

💯

It is a question if we for our convenience will maintain a list of ranges of downloaded records, looking something like:
{(hash A, counter 0), (hash B, counter 150)}, {(hash C, counter 390, hash D, counter 1000)}...

This will enable us to know if we have some record with counter just by doing a search through this list.

👍 Related to the snapshot questions above, if the user controls snapshotting, sounds like each peer could get into a state where their snapshots are different / overlapping. Maybe that's fine, but it does add complexity when considering pagination. Snapshots at predictable intervals (based on the new counters), might make things simpler.

b. Introduce on-demand thread following

Drawing an analogy from tail -f the user can say that he wants to follow a certain thread and only then go-threads will try to synchronise all the records which come after the current head, but not before it.

👍

c. Introduce pagination

A lot of the time we need just to get N records below a specific hash/counter. This can be head or any other record. But at the same time we don’t want go-threads to fully download the log (because we don’t need it).

Go-threads now lacks such an API, for example in GetRecords you only provide the offset (end point), but loading always starts from head of the server’s log, you cannot provide another starting point.

So essentially we want to be in control of how many records go-threads download and from which offset. Right now we cannot do that, because the records will be thrown away if we don’t fill the gap between our current head and the oldest received record. This topic is closely related to us killing the invariant that we must have all the records before head.

💯

d. Change exchangeEdges so that it will only sync heads

But it will not try to get all the records unless we are in follow mode

👍

e. Subscribe from particular record/counter

We want to be able to get all the records starting from some other record or counter. So no matter when we start subscribing we will still get all the needed records.

So this is like replaying the records? Could this be combined with follow mode with a "since" param? Continuing with the analogy: tail --since=1m -f

This all sounds really good! Full support from our side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for go-threads improvements #547

Proposal for go-threads improvements #547

mcrakhman commented Aug 6, 2021

sanderpick commented Aug 12, 2021

How we use go-threads in Anytype

What problems do we have with current implementation

a. Our databases only grow in size and are too large

b. The synchronisation speed can be improved

c. Inconsistent subscribing

d. Pulling records and threads from cold start takes too long

e. No garbage collecting

f. No way to prioritise what go-threads is downloading at the moment

The changes we propose

a. Remove the invariant that we have all the records before head

b. Introduce on-demand thread following

c. Introduce pagination

d. Change `exchangeEdges` so that it will only sync heads

e. Subscribe from particular record/counter

Proposal for go-threads improvements #547

Proposal for go-threads improvements #547

Comments

mcrakhman commented Aug 6, 2021

Current state of go-threads sync

How we use go-threads in Anytype

What problems do we have with current implementation

a. Our databases only grow in size and are too large

b. The synchronisation speed can be improved

c. Inconsistent subscribing

d. Pulling records and threads from cold start takes too long

e. No garbage collecting

f. No way to prioritise what go-threads is downloading at the moment

The changes we propose

a. Remove the invariant that we have all the records before head

b. Introduce on-demand thread following

c. Introduce pagination

d. Change exchangeEdges so that it will only sync heads

e. Subscribe from particular record/counter

sanderpick commented Aug 12, 2021

How we use go-threads in Anytype

What problems do we have with current implementation

a. Our databases only grow in size and are too large

b. The synchronisation speed can be improved

c. Inconsistent subscribing

d. Pulling records and threads from cold start takes too long

e. No garbage collecting

f. No way to prioritise what go-threads is downloading at the moment

The changes we propose

a. Remove the invariant that we have all the records before head

b. Introduce on-demand thread following

c. Introduce pagination

d. Change exchangeEdges so that it will only sync heads

e. Subscribe from particular record/counter

d. Change `exchangeEdges` so that it will only sync heads

d. Change `exchangeEdges` so that it will only sync heads