-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting up Cluster with Multiple Nodes - Segmentation Fault #25
Comments
I think this is likely due to Assise not finding the proper interface. Can you change |
Hi Waleed, ResultThank you very much for your help, I set The error message is a little bit different, on the
On the 10.10.1.2 node it says
There is an additional DebuggingThrough GDB, it also looks like the Do you happen to know what is the cause of this problem? Does it have something to do with connecting to port on the other node? I have allowed port 12345 on both nodes. Thank you very much for your help! |
Thanks for the debugging effort! I suspect this is likely a firewall issue. To test connectivity, you can try running the RPC application in |
Hi Waleed, Thank you very much for the checks in DebuggingHere are some of my debugging effort
ChangesRegarding your previous suggestion on firewall, it was a great suggestion, thank you! I realised the firewall enabled was on a different network interface. I've enabled incoming and outgoing to and from port 12345 for both nodes on the network interface used by the RDMA Further informationI am also using an NVM emulation instead of an actual NVM. Do you have any idea regarding the above error? Thank you very much for your help! |
I assume you weren't able to run the RPC test. If so, then the error is not Assise-related. The LD_PRELOAD or use of emulated NVM shouldn't be a factor here. I haven't encountered this particular error myself but, if I had to guess, it could simply be a driver issue. It might make sense to first check whether the MLNX_OFED drivers are properly installed and that the required modules are loaded in your kernel (e.g. libmlx5, libmlx4). That could be the culprit. If that doesn't help, you can try posting this on the Mellanox community forums. |
Hi Waleed, Thank you very much, it was indeed the error, I did not have my RDMA set up yet, I was not aware about it during the setup. Do you mind if I add a sentence or two mentioning that properly configured RDMA device and interfaces is a prerequisite? |
Thanks for confirming.
Absolutely! The README can definitely benefit from this. Feel free to do a pull request and I'll merge. |
Thank you Waleed for that! Do you mind if I clarify some things with regards to Assise to help me write a proper additional setup instruction?
Thank you very much Waleed for your kind help in clarifying about this! |
Sorry for the delayed reply! Last few weeks were hectic.
Yes, that's correct.
Our prototype currently doesn't come with an interface to the cluster manager (zookeeper). Only hot replicas, as you noted, are supported as of now.
Correct, all nodes defined in |
Thanks a lot Waleed for the clarification! |
@agnesnatasya |
Hi @caposerenity! Sure! For me, I have a lab cluster that has Mellanox adapter installed on it, and also the Infiniband drivers installed. I use that to establish the RDMA connection between the nodes. |
Hi,
Setup
I am trying to set up a simple cluster with 2 nodes. These are the network interfaces of each node:
In each of these node, I set
g_n_hot_rep
to 2 and RPC interface toI run KernFS starting from the node that has 10.10.1.3 as its interface.
Result
I received a segmentation fault
Debugging
After debugging, it looks like the segmentation fault comes in
libfs/lib/rdma/agent.c
line 96 and line 130, the rdma_cm_id struct after rdma_create_id isNULL
.I also run the filesystem as a local file system, where
g_n_hot_rep = 1
and RPC interface is set to localhost, and it worksDo you mind helping me with this problem? Thank you very much!
The text was updated successfully, but these errors were encountered: