-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guide on cluster setup with the MnesiaCache #10
Comments
Sorry for the delay. I've been asked to help start a tree planting organization and I had to set this aside. I'm hesitant to just paste the current mnesia guide that I wrote into a new powauth site file, because I think we should first resolve the initialization logic to work with libcluster. My understanding of the mnesia initialization logic: The first node needs know that it is alone so that it can create (rather than replicate) its mnesia data table. If it is not alone, then it replicates the table schema rather than creating it. However, when using libcluster, I cannot think of a way to know for sure that the first-booting node is alone. The only way to know if there are any connected nodes is after the current node is connected, i.e., after the :connect callback -- which will never happen if the node is alone. Another way to say it: There is no event that says, "you are alone." The only event we have is libcluster :connect which means you're not alone. But the first node needs know that it is alone so that it can create its mnesia data table as a basis for the others to replicate. The only way out of this that I can think of simply to wait a bit (10 seconds) before calling Node.list(); and if it's [] then your pow code creates the table rather than replicate it. Am I misunderstanding mnesia copying? |
Yeah, it's how the MnesiaCache works, however you can just provide a hardcoded list of nodes for it to connect automatically. If none of the provided hosts are connected it'll start up by itself, otherwise it'll connect to the existing cluster. Using I'll look into libcluster and see how you can deal with this. |
Edit: This is what works: defmodule MyApp.MnesiaClusterSupervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
@impl true
def init(_init_arg) do
children = [
{Pow.Store.Backend.MnesiaCache, extra_db_nodes: Node.list()},
Pow.Store.Backend.MnesiaCache.Unsplit
]
Supervisor.init(children, strategy: :one_for_one)
end
end def start(_type, _args) do
topologies = [
example: [
strategy: Cluster.Strategy.Kubernetes,
config: [
# ...
]
]
]
# List all child processes to be supervised
children = [
{Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
MyApp.MnesiaClusterSupervisor,
# Start the Ecto repository
MyApp.Repo,
# Start the endpoint when the application starts
MyAppWeb.Endpoint
# Starts a worker by calling: MyApp.Worker.start_link(arg)
# {MyApp.Worker, arg},
]
# See https://hexdocs.pm/elixir/Supervisor.html
# for other strategies and supported options
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end |
Ok, found a solution, updated above example 😄 |
UPDATE: This is working for a single server for me. Will try load balancing and add/drop in the next few days; when confirmed, will update the guide and add it docs. |
@danschultzer , Since it's rough I thought I would just paste the whole thing here; please let me know if you don't want me doing this. Search for the phrase: "please check the logic in this narrative": Mnesia cache store backend
The reason you might use Mnesia is that, in a clustered situation, you need to enable distributed checking of Pow User sessions, regardless of which backend server is chosen (e.g., by a load balancer) to service a given http request. This stateless load balancing is possible because Mnesia, being a distributed database, can replicate the cache across all connected nodes. Hence, there is no need for stateful routing at the load balancer (a.k.a., no need for "sticky sessions"). However, there is a need to connect the nodes! Connecting elixir nodes is straight forward, but the details can be complicated depending on the infrastructure strategy being used. For example, there is much more to learn if autoscaling or dynamic node discovery are requried. However, this is outside the scope of the current guide. This guide will first describe installation and configuration of Pow to use the Mnesia cache store. Then, a few use cases are described, along with considerations relative to Pow User sessions:
Configuring Pow to use Mnesia: Compile-time ConfigurationMnesia is part of OTP so there are no additional dependencies to add to Depending on whether you are working in development or production mode, be sure that (in either config :mnesia, dir: to_charlist(File.cwd!) ++ '/priv/mnesia' and that you have configured Pow to use Mnesia: config :my_app, :pow,
user: MyApp.Users.User,
repo: MyApp.Repo,
# ...
cache_store_backend: Pow.Store.Backend.MnesiaCache Also, you need to add :mnesia to :extra_applications in mix.exs to ensure that it's also included in the release; in def application do
[
mod: {MyApp.Application, []},
extra_applications: [:mnesia, ... ],
]
end Installing libclusterlibcluster offers:
Dan: please check the logic in this narrative: Because we are using Mnesia is a distributed, masterless database; masterless, in that the intent is that all nodes will eventually be consistent, and the communication protocol treats all peers equally: Any new data added to mnesia in one node will get replicated to all other nodes. However, nodes can go down! And when they come back up, there is an inconsistency if data were added during the outage. In order to heal this inconsistency, we need to figure out which node should temporarilly serve as a master from which the recovering nodes can replicate and return to a consistent state. In other words, we need to pick a node -- such as the node that booted least recently -- and copy the data from that node to all others. I.e., we need to "unsplit" the netsplit. As it turns out, unsplitting is also the solution to the "boot problem" (above) of a newly booted node creating a new database because it "thinks" it's alone but finds out seconds later that other nodes were up first and that it should have replicated rather than created the database. Simply, the solution is that every node, when it boots, can create the database assuming that it's alone, and then let the unsplitting logic detect if otherwise; and if so, treat it as a node that needs to be healed. Mnesia has notifications for splitting and healing events (e.g., defmodule MyApp.MnesiaClusterSupervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
@impl true
def init(_init_arg) do
children = [
{Pow.Store.Backend.MnesiaCache, extra_db_nodes: Node.list()},
Pow.Store.Backend.MnesiaCache.Unsplit
]
Supervisor.init(children, strategy: :one_for_one)
end
end Then, add your new def start(_type, _args) do
topologies = [
myapp: [
strategy: Cluster.Strategy.Gossip
]
]
# By default, the libcluster gossip occurs on port 45892,
# using the multicast address 230.1.1.251 to find other nodes
children = [
{Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
MyApp.MnesiaClusterSupervisor,
MyApp.Repo,
MyAppWeb.Endpoint
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end Now, when your app boots, it will have a supervisor to keep your node cluster intact and a supervisor to keep mnesia consistent, which relies on an unsplit genserver to automatically do any consistency healing involved. Specifically, when your application starts it will initialize the Adding or removing nodes at run timeAs long as at least one node in the Using other libraries that also use the Mnesia instanceIt's strongly recommended to take into account any libraries that will be using Mnesia for storage before using the A common example would be a job queue, where a potential solution to prevent data loss is to simply keep the job queue table on only one server instead of replicating it among all nodes. If you do this, then when a network partition occurs, the job queue table can be excluded from the tables to be flushed (and restored) during healing by setting Example: Using libcluster's
|
@danschultzer , Actually, the above works for 1 node but does not yet work when I add the second: When I open the Gossip port and it tries to connect, both servers crash. More info later. |
@danschultzer @sensiblearts Any new info about whether or not this works? |
@jwietelmann @danschultzer , I've been away from the code for a few months but am returning next week and will try this again. I'll be upgrading from Pow v 1.0.13 to v 1.0.19 and then I'll run some experiments with Digital Ocean VPSs. |
@sensiblearts - is this working for you in prod yet? I'm using Gigalixir which has rolling deploys, so distributing the Mnesia cache is the only persistence option I have. |
@jacobwarren an alternative is Nebulex or Redis. Though Mnesia has worked fine for me in production environment, I've heard that distribution issues have been pretty difficult to debug. It makes me reconsider my recommendation for Mnesia, even though it's part of standard Erlang distribution. I feel it's very under utilized in the Elixir community unfortunately, with limited documentation. Really wish Mnesia was used more 😞 |
@jacobwarren , @danschultzer , I intended to test that but it's been a strange year, to say the least. Mnesia is of course working for a single instance, but I have not had (and do not anticipate anytime soon) a need for more than one webserver behind my load balancer. But there is always hope :-) Would either of you like me to test it with 2 or 3 webservers? Or, @danschultzer , do you think it would be a waste of a day for me? (Also, my test would be crude, as I have no automated client testing in place. I would just use multiple browsers, my phone etc., and see if there are any dropped or missed sessions.) I'm happy to help (I feel that I owe @danschultzer more, for Pow), but I don't want to waste time. If you have a series of steps or protocol to test this, I welcome the suggestions. Cheers. |
@danschultzer and @sensiblearts - I may be looking for the wrong solution. My issue is that I'm on Gigalixir which has rolling deploys. Whenever I push an update it logs all users out. I'm currently utilizing the Mnesia cache, but because the cache is destroyed every time the container goes down for a new one to go up, it's wiped out. Is there a solution to preserve the cache aside from distributing the application out? Ps. I have a whole lot of guides to submit for using Pow with Absinthe! :) |
👏 I found @sensiblearts new docs and got libcluster working but got stuck trying to get mnesia to connect correctly. I was starting to untangle it when I finally found this thread. @danschultzer's MyApp.MnesiaClusterSupervisor solution above fixed my mnesia connection problems perfectly (thanks @danschultzer!) and now I have rolling deploys without logging my users out - @jacobwarren I think it would fix your problem as well. Should it should be added to the docs as well? This seems like a pretty common situation that others are going to run into. |
How did you ensure that when I was trying with Gossip strategy (in localhost) and is not deterministic in any way 😄 . Maybe I am not getting right. |
@sensiblearts has been writing a guide on cluster distribution strategies with Pow: pow-auth/pow#220
I believe it fits better here, as I like to keep third-party library references in the docs in the Pow to a minimum (I'm also contemplating moving the swoosh/bamboo integration guide to here).
The guide @sensiblearts has been working on is already extensive and it's a good read to understand different strategies: https://github.com/sensiblearts/pow/blob/master/guides/mnesia_cache_store_backend.md
The text was updated successfully, but these errors were encountered: