diffsync should be stateless #2

janmonschke · 2015-05-26T23:03:24Z

At the moment, diffsync stores all user documents that are necessary for the sync-cycle in memory. This has two implications:

Applications with a lot of peers consume a lot of memory and will eventually run out of memory at some point
Scaling diffsync to more than one node is pretty much impossible unless there is application logic in place that routes peers that work on the same document onto the same note. Depending on the applications architecture, this is quite hard to achieve. Also, holding this state in memory violates best-practices of many hosting platforms, such as e.g. Heroku.

In my opinion, the users' documents should be kept outside of the diffsync node e.g. in a redis instance.

Luckily, diffsync internally already reads data via an asynchronous interface so that the code changes should actually be pretty minimal. Regarding testing it should also not be too hard.

The main implication for the user would be an elevated sync time which depends on the type of data store and the distribution of the application's parts (node, intermediate data and permanent data). I guess it is okay to have this overhead in favour of getting rid of this extra state.

What do you think?

seidtgeist · 2015-05-27T04:39:31Z

Is this about something other than https://github.com/janmonschke/diffsync#dataadapter?

My thoughts:

diffsync should always ship with a in-memory implementation by default so people can play with it
Could multiple diffsync servers let users work on the same document if there's a shared representation between servers? What then is the difference between clients and servers?

janmonschke · 2015-05-27T07:40:08Z

My thoughts about your thoughts: 💭 🌀 💭

Yes, it's about more than that. It is about how the internal sync documents are handled. But yeah, the API is basically the same with the addition of a removeData method that is used for scenarios when a client disconnects and the shadow documents are not needed anymore.
Yes, the in-memory implementation would still be in there for exactly the reason that you mentioned. Much like express is handling sessions.
Indeed, multiple diffsync servers could handle the clients for the same document and it is one step further into allowing clients behave exactly as servers.

janmonschke · 2015-05-27T19:18:01Z

Ouff, just saw that I did not implement the fetching of client shadow documents asynchronously 😄
But I should have done it like that in the first place ;)

janmonschke · 2015-05-27T20:08:44Z

Oh no, running into a bigger problem here. Let's say that clients can reside on arbitrary nodes and those nodes take care of fetching the correct master document and the correct shadow documents for each client.

This leads to the problem that for each sync request, each node has to make up to four DB requests:

Get the master document
Get the shadow documents
Write the shadow documents
Write the new master document

These requests can be reduced to two requests if shadow documents were embedded inside their master documents (which would be easy for the case of schema-free databases).

But the biggest culprit would be that the database had to lock the document from step 1 to step 4 and could only release it afterwards. Other nodes could attempt to write to the same document in the meantime which would result in dirty reads and loss of data when writing.

Am I right with my assumptions? Or do I oversee a very simple solution on how to scale this to more than one node without having a load-balancer in place that gathers clients working on the same document on the same nodes.

@episodeyang How did you handle this problem?

winton · 2015-08-30T23:48:50Z

It seems like Redis would be a good fit for storing the master and shadow, maybe paired with node-redlock.

Another alternative is to have the servers share the objects directly with each other. It seemed that Neil Fraser was more intrigued by this idea in his talk.

@janmonschke I'm interested in working on this problem so please let me know your thoughts on those two options.

janmonschke · 2015-08-31T08:25:44Z

@winton thx for you input :)

I definitely prefer option #2, which would still not make it stateless, but I also think this is the way Neil Fraser was advocating. Happy to follow your work on that!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffsync should be stateless #2

diffsync should be stateless #2

janmonschke commented May 26, 2015

seidtgeist commented May 27, 2015

janmonschke commented May 27, 2015

janmonschke commented May 27, 2015

janmonschke commented May 27, 2015

winton commented Aug 30, 2015

janmonschke commented Aug 31, 2015

diffsync should be stateless #2

diffsync should be stateless #2

Comments

janmonschke commented May 26, 2015

seidtgeist commented May 27, 2015

janmonschke commented May 27, 2015

janmonschke commented May 27, 2015

janmonschke commented May 27, 2015

winton commented Aug 30, 2015

janmonschke commented Aug 31, 2015