Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diffsync should be stateless #2

Open
janmonschke opened this issue May 26, 2015 · 6 comments
Open

diffsync should be stateless #2

janmonschke opened this issue May 26, 2015 · 6 comments

Comments

@janmonschke
Copy link
Owner

At the moment, diffsync stores all user documents that are necessary for the sync-cycle in memory. This has two implications:

  1. Applications with a lot of peers consume a lot of memory and will eventually run out of memory at some point
  2. Scaling diffsync to more than one node is pretty much impossible unless there is application logic in place that routes peers that work on the same document onto the same note. Depending on the applications architecture, this is quite hard to achieve. Also, holding this state in memory violates best-practices of many hosting platforms, such as e.g. Heroku.

In my opinion, the users' documents should be kept outside of the diffsync node e.g. in a redis instance.

Luckily, diffsync internally already reads data via an asynchronous interface so that the code changes should actually be pretty minimal. Regarding testing it should also not be too hard.

The main implication for the user would be an elevated sync time which depends on the type of data store and the distribution of the application's parts (node, intermediate data and permanent data). I guess it is okay to have this overhead in favour of getting rid of this extra state.

What do you think?

@seidtgeist
Copy link

Is this about something other than https://github.com/janmonschke/diffsync#dataadapter?

My thoughts:

  1. diffsync should always ship with a in-memory implementation by default so people can play with it
  2. Could multiple diffsync servers let users work on the same document if there's a shared representation between servers? What then is the difference between clients and servers?

@janmonschke
Copy link
Owner Author

My thoughts about your thoughts: 💭 🌀 💭

  1. Yes, it's about more than that. It is about how the internal sync documents are handled. But yeah, the API is basically the same with the addition of a removeData method that is used for scenarios when a client disconnects and the shadow documents are not needed anymore.
  2. Yes, the in-memory implementation would still be in there for exactly the reason that you mentioned. Much like express is handling sessions.
  3. Indeed, multiple diffsync servers could handle the clients for the same document and it is one step further into allowing clients behave exactly as servers.

@janmonschke
Copy link
Owner Author

Ouff, just saw that I did not implement the fetching of client shadow documents asynchronously 😄
But I should have done it like that in the first place ;)

@janmonschke
Copy link
Owner Author

Oh no, running into a bigger problem here. Let's say that clients can reside on arbitrary nodes and those nodes take care of fetching the correct master document and the correct shadow documents for each client.

This leads to the problem that for each sync request, each node has to make up to four DB requests:

  1. Get the master document
  2. Get the shadow documents
  3. Write the shadow documents
  4. Write the new master document

These requests can be reduced to two requests if shadow documents were embedded inside their master documents (which would be easy for the case of schema-free databases).

But the biggest culprit would be that the database had to lock the document from step 1 to step 4 and could only release it afterwards. Other nodes could attempt to write to the same document in the meantime which would result in dirty reads and loss of data when writing.

Am I right with my assumptions? Or do I oversee a very simple solution on how to scale this to more than one node without having a load-balancer in place that gathers clients working on the same document on the same nodes.

@episodeyang How did you handle this problem?

@winton
Copy link

winton commented Aug 30, 2015

It seems like Redis would be a good fit for storing the master and shadow, maybe paired with node-redlock.

Another alternative is to have the servers share the objects directly with each other. It seemed that Neil Fraser was more intrigued by this idea in his talk.

@janmonschke I'm interested in working on this problem so please let me know your thoughts on those two options.

@janmonschke
Copy link
Owner Author

@winton thx for you input :)

I definitely prefer option #2, which would still not make it stateless, but I also think this is the way Neil Fraser was advocating. Happy to follow your work on that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants