Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASSERT(cm_->open_thread_local_device(idx) != nullptr) in src/core/rworker.cc #5

Open
psistakis opened this issue Jun 8, 2021 · 6 comments

Comments

@psistakis
Copy link

psistakis commented Jun 8, 2021

Hi,

I would like to ask you if this assertion is something you have experienced before? Before the assertion, there are some warning messages about the query port_id 1 on device 1 not being active.

In order to build the project, I used the suggested flags (cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -DROCC_RBUF_SIZE_M=13240 -DRDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2).

When I run: ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 2 (I use the default config.xml and I have added two (2) hostnames in the hosts.xml file), I get the output below.

I have also set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2, since I have 1 NIC per machine. Furthermore, I have done the change as described in #4.

I would appreciate any feedback.

Thank you.

Output:

NOCC started with program [noccocc]. at 08-06-2021 11:04:12
[bench_runner.cc:303] Use TCP port 33333
[bench_runner.cc:325] use scale factor: 24; with total 24 threads.
[view.h:48] Start with 0 backups.
[view.cc:10] total 2 backups to assign
[Bank]: check workload 25, 15, 15, 15, 15, 15
[util.cc:167] huge page real size 12.9316G
[rnic.hpp:60] query port_id 1 on device 1 not active.
[bench_runner.cc:135] Total logger area 0.00390625G.
[bench_runner.cc:146] add RDMA store size 4.88281G.
[bench_runner.cc:156] First 4.88867G are left over.
[bench_runner.cc:159] RDMA heap size 8.041G.
[util.cc:167] huge page real size 0.294922G
[util.cc:167] huge page real size 0.294922G
[Bank], total 4800000 accounts loaded
[bank_main.cc:262] check cv balance 46280
[Runner] local db size: 220.746 MB
[Runner] Cache size: 0 MB
[bench_runner.cc:210] backed list num: 0
[bench_listener2.cc:70] try log results to ./results/noccocc_bank_2_24_10_100.log
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rdma_ctrl_impl.hpp:82] wrong dev_id: -1; total 2 found
[rworker.cc:106] Assertion!
[rnic.hpp:60] query port_id 1 on device 1 not active.
[NOCC] Meet an assertion failure!
stack trace:
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
./noccocc() [0x4c0225]
/lib/x86_64-linux-gnu/libc.so.6 : ()+0x354c0
/lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0x38
/lib/x86_64-linux-gnu/libc.so.6 : abort()+0x16a
./noccocc : nocc::MessageLogger::~MessageLogger()+0x2ee
./noccocc : nocc::oltp::RWorker::init_rdma(char*, unsigned long)+0x452
./noccocc : nocc::oltp::BenchWorker::run()+0x2d1
./noccocc : ndb_thread::pthread_bootstrap(void*)+0xf
/lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x76ba
/lib/x86_64-linux-gnu/libc.so.6 : clone()+0x6d
[ENDING] End benchmarks
[ENDING] send ending messages in SIGINT handler
[ENDING] kill processes
node0 password:
node1 password:
kill try 0
node0 password:
node1 password:
Kill done
[ENDING] kill processes done

@wxdwfc
Copy link
Collaborator

wxdwfc commented Jun 9, 2021 via email

@psistakis
Copy link
Author

Hi,

Thanks for the reply.

Based on the output it seems the cluster has two devices per machine, and the port 1 of the second machine is inactive (at least that is my understanding) --I am afraid I do not have physical access to the cluster to confirm this, but I can double check this with someone who has. Is there a way to bypass this issue, i.e., use only one device & one port per machine?

Thank you.

The ibstatus on each machine returns the following:

Infiniband device 'mlx5_0' port 1 status:
default gid: XXX
base lid: 0x6
sm lid: 0x4
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand

Infiniband device 'mlx5_1' port 1 status:
default gid: XXX
base lid: 0xffff
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 10 Gb/sec (4X SDR)
link_layer: InfiniBand

@wxdwfc
Copy link
Collaborator

wxdwfc commented Jun 9, 2021 via email

@psistakis
Copy link
Author

Hi,

Thank you for your feedback.

Just to make sure I understand: if the name mlx5_X from the output I sent earlier shows the port number, then port 0 (mlx5_0) is the one that is active, correct?

If that is the case, as I mentioned earlier (first comment), I have set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2. Is this your suggestion? I have tried this change before + re-building the project, but I get the same output.

Please let me know if I have misunderstood something.

Thank you.

@wxdwfc
Copy link
Collaborator

wxdwfc commented Jun 9, 2021 via email

@psistakis
Copy link
Author

Hi,

Thanks for the feedback.

I tried the following and it seems it worked.

In the init_rdma() in src/core/rworker.cc, I set idx to be a fixed value (dev_id = 0, port_id=1), instead of using cm_->convert_port_idx(). More specifically:

RdmaCtrl::DevIdx idx = RdmaCtrl::DevIdx{.dev_id = 0, .port_id=1}

Thank you for your help! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants