Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about installation #2

Open
BinZlP opened this issue Apr 22, 2021 · 1 comment
Open

Questions about installation #2

BinZlP opened this issue Apr 22, 2021 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@BinZlP
Copy link

BinZlP commented Apr 22, 2021

I sent the question via email, but there was no reply, so I am leaving it on the issue.

I used 2 nodes, one is for the compute node and the other for the far memory node.

Both nodes have the following HW specifications:

  • CPU: Intel Xeon W-2245 (3.9GHz, 8-cores)
  • Memory: 32GB
  • NIC: ConnectX3 EN 40G

and SW specifications:
NIC Drivers

  • Mellanox ConnectX3 OFED 4.2-1.0.0.0 drivers (Both nodes)
    OS
  • Ubuntu 16.04.7 server (Both nodes)
  • Fastswap kernel which is compiled & installed as the given information (Compute node)
  • Linux kernel 4.11.0 (Far memory node)

I wanted to use 24GB of far memory, so created a swap file size of 24GB, and deactivated any other swap devices on my compute node.


So, followings are my questions:
rmserver_edit
[ farmemserver/rmserver.c ]

  1. In the paper, it said you used 32GB of memory for each node. In my case, there's an error when trying to get 32GB of memory for queues by malloc(BUFFER_SIZE). I fixed BUFFER_SIZE to 24GB, then it works. To use less memory than the default of 32GB, is it right to modify BUFFER_SIZE in farmemserver/rmserver.c as above?
  2. drivers/fastswap_rdma.c is supposed to take the number of CPUs by num_online_cpu(), but in farmemserver/rmserver.c has the fixed number of CPUs to 8. If hyperthreading is enabled (as in the paper), num_online_cpu() will return 16 at compute node, and try to get 48 queues from the server. But rmserver only creates 24 queues. Is it okay to modify the NUM_PROCS of farmemserver/rmserver.c to 16 as screenshot attached above?

fastswap_kernel_error
After I compiled and executed rmserver with edited code, I could successfully load fastswap_rdma and fastswap modules on the compute node. But when I tried to execute the test workloads of cfm, I encountered kernel error on the compute node as above and swap traffics flow to the local swap space (compute node didn't make any RDMA requests). I tried rebooting both machines and setting up for using far memory, but the compute node only used local swap file without any errors.

  1. Have you ever experienced the same error as above? If so, could you tell me what was the problem and how you solved it?

Thank you for your reading and I would appreciate your reply.

@amaro
Copy link
Collaborator

amaro commented Apr 25, 2021

I just realized that the mailing list seems to be broken. I'll try to get it fixed. Sorry about that!

In the paper, it said you used 32GB of memory for each node. In my case, there's an error when trying to get 32GB of memory for queues by malloc(BUFFER_SIZE). I fixed BUFFER_SIZE to 24GB, then it works. To use less memory than the default of 32GB, is it right to modify BUFFER_SIZE in farmemserver/rmserver.c as above?

You should modify the source code of the program, and recompile it (in the screenshot, it seems you modified only the comment).

drivers/fastswap_rdma.c is supposed to take the number of CPUs by num_online_cpu(), but in farmemserver/rmserver.c has the fixed number of CPUs to 8. If hyperthreading is enabled (as in the paper), num_online_cpu() will return 16 at compute node, and try to get 48 queues from the server. But rmserver only creates 24 queues. Is it okay to modify the NUM_PROCS of farmemserver/rmserver.c to 16 as screenshot attached above?

We didn't use hyper threading in our experiments; quoting from the paper: "We use one hyperthread on each core and disable TurboBoost and CPU frequency scaling in order to reduce variability."
The number of queues must match exactly both in fastswap and in the memory server. So yes, modifying (and recompiling) the rmserver to create the same number of queues fastswap is trying to create should work.

Have you ever experienced the same error as above? If so, could you tell me what was the problem and how you solved it?

No I haven't seen this error before. Can you post your dmesg output from boot until the error shows up?

Can you also post the output of using ib_read_lat between the server and client?

@amaro amaro self-assigned this Apr 25, 2021
@amaro amaro added the question Further information is requested label Apr 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants