-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASSERT(cm_->open_thread_local_device(idx) != nullptr) in src/core/rworker.cc #5
Comments
Hi ,
According to
[rnic.hpp:60] query port_id 1 on device 1 not active.
It seems that the device on your machine is not active. Could you please check the output of `ibstatus`?
… 2021年6月9日 上午1:07,Antonis Psistakis ***@***.***> 写道:
Hi,
I would like to ask you if this assertion is something you have experienced before? Before the assertion, there are some warning messages about the query port_id 1 on device 1 not being active.
In order to build the project, I used the suggested flags (cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -DROCC_RBUF_SIZE_M=13240 -DRDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2).
When I run: ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 2 (I use the default config.xml and I have added two (2) hostnames in the hosts.xml file), I get the output below.
I have also set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2 <#2>, since I have 1 NIC per machine. Furthermore, I have done the change as described in #4 <#4>.
I would appreciate any feedback.
Thank you.
Output:
NOCC started with program [noccocc]. at 08-06-2021 11:04:12
[bench_runner.cc:303] Use TCP port 33333
[bench_runner.cc:325] use scale factor: 24; with total 24 threads.
[view.h:48] Start with 0 backups.
[view.cc:10] total 2 backups to assign
[Bank]: check workload 25, 15, 15, 15, 15, 15
[util.cc:167] huge page real size 12.9316G
[rnic.hpp:60] query port_id 1 on device 1 not active.
[bench_runner.cc:135] Total logger area 0.00390625G.
[bench_runner.cc:146] add RDMA store size 4.88281G.
[bench_runner.cc:156] First 4.88867G are left over.
[bench_runner.cc:159] RDMA heap size 8.041G.
[util.cc:167] huge page real size 0.294922G
[util.cc:167] huge page real size 0.294922G
[Bank], total 4800000 accounts loaded
[bank_main.cc:262] check cv balance 46280
[Runner] local db size: 220.746 MB
[Runner] Cache size: 0 MB
[bench_runner.cc:210] backed list num: 0
[bench_listener2.cc:70] try log results to ./results/noccocc_bank_2_24_10_100.log
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rdma_ctrl_impl.hpp:82] wrong dev_id: -1; total 2 found
[rworker.cc:106] Assertion!
[rnic.hpp:60] query port_id 1 on device 1 not active.
[NOCC] Meet an assertion failure!
stack trace:
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
[rnic.hpp:60] query port_id 1 on device 1 not active.
./noccocc() [0x4c0225]
/lib/x86_64-linux-gnu/libc.so.6 : ()+0x354c0
/lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0x38
/lib/x86_64-linux-gnu/libc.so.6 : abort()+0x16a
./noccocc : nocc::MessageLogger::~MessageLogger()+0x2ee
./noccocc : nocc::oltp::RWorker::init_rdma(char*, unsigned long)+0x452
./noccocc : nocc::oltp::BenchWorker::run()+0x2d1
./noccocc : ndb_thread::pthread_bootstrap(void*)+0xf
/lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x76ba
/lib/x86_64-linux-gnu/libc.so.6 : clone()+0x6d
[ENDING] End benchmarks
[ENDING] send ending messages in SIGINT handler
[ENDING] kill processes
's password:
's password:
kill try 0
's password:
's password:
Kill done
[ENDING] kill processes done
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#5>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZVCEWMCU4MIVQHSFX3ZQDTRZE5LANCNFSM46KLRG7Q>.
|
Hi, Thanks for the reply. Based on the output it seems the cluster has two devices per machine, and the port 1 of the second machine is inactive (at least that is my understanding) --I am afraid I do not have physical access to the cluster to confirm this, but I can double check this with someone who has. Is there a way to bypass this issue, i.e., use only one device & one port per machine? Thank you. The
|
Hi,
Thanks for sending me more information.
According to the results of `ibstatus`, the port 1 on the NIC is not available.
To specific which port used by each thread, you can customize the DrTM+H by modifying `choose_rnic_port()` in src/core/rworker.cc <http://rworker.cc/> and use an active port.
This hopefully can fix your problem.
Thanks!
… 2021年6月9日 下午5:15,Antonis Psistakis ***@***.***> 写道:
Hi,
Thanks for the reply.
Based on the output it seems the cluster has two devices per machine, and the port 1 of the second machine is inactive (at least that is my understanding) --I am afraid I do not have physical access to the cluster to confirm this, but I can double check this with someone who has. Is there a way to bypass this issue, i.e., use only one device & one port per machine?
Thank you.
The ibstatus on each machine returns the following:
Infiniband device 'mlx5_0' port 1 status:
default gid: XXX
base lid: 0x6
sm lid: 0x4
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
Infiniband device 'mlx5_1' port 1 status:
default gid: XXX
base lid: 0xffff
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 10 Gb/sec (4X SDR)
link_layer: InfiniBand
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZVCEXZ2LX6RBPPNT6ROT3TR4WJZANCNFSM46KLRG7Q>.
|
Hi, Thank you for your feedback. Just to make sure I understand: if the name mlx5_X from the output I sent earlier shows the port number, then port 0 (mlx5_0) is the one that is active, correct? If that is the case, as I mentioned earlier (first comment), I have set the Please let me know if I have misunderstood something. Thank you. |
Hi,
I think you understand correctly. It’s strange that using the first device not address the issue, because I’ve not met the same issue before.
I’m sorry I could not help further if you are using the active device (i.e., dev_id = 0 & port_idx = 1) and the error reports.
Thanks.
… 2021年6月9日 下午8:05,Antonis Psistakis ***@***.***> 写道:
Hi,
Thank you for your feedback.
Just to make sure I understand: if the name mlx5_X from the output I sent earlier shows the port number, then port 0 (mlx5_0) is the one that is active, correct?
If that is the case, as I mentioned earlier (first comment), I have set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2 <#2>. Is this your suggestion? I have tried this change before + re-building the project, but I get the same output.
Please let me know if I have misunderstood something.
Thank you.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZVCEUMVZCJ2T632AGOLU3TR5KIRANCNFSM46KLRG7Q>.
|
Hi, Thanks for the feedback. I tried the following and it seems it worked. In the
Thank you for your help! :) |
Hi,
I would like to ask you if this assertion is something you have experienced before? Before the assertion, there are some warning messages about the query port_id 1 on device 1 not being active.
In order to build the project, I used the suggested flags (
cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -DROCC_RBUF_SIZE_M=13240 -DRDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2
).When I run:
./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 2
(I use the defaultconfig.xml
and I have added two (2) hostnames in thehosts.xml
file), I get the output below.I have also set the
use_port_
to be 0 inRWorker::choose_rnic_port()
as suggested in #2, since I have 1 NIC per machine. Furthermore, I have done the change as described in #4.I would appreciate any feedback.
Thank you.
Output:
The text was updated successfully, but these errors were encountered: