You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run the RandomForestClassifier benchmark on the node with 2 GPUs, spark-rapids-ml threw the below exception
terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/home/xxx/work.d/ml/cuml/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argumentObtained 14 stack frames#0 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x2e) [0x7f586987abe0]#1 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft9exceptionC2ENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5a) [0x7f586987ab8c]#2 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x37) [0x7f586987c0b1]#3 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIfiiEEE15assignWorkspaceEPcS5_+0x5fc) [0x7f586a6bd504]#4 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIfiiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKfPKiiiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIfiEE+0x7ad) [0x7f586a69b2ff]#5 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT12DecisionTree3fitIfiEESt10shared_ptrINS0_16TreeMetaDataNodeIT_T0_EEERKN4raft8handle_tEP11CUstream_stPKS5_iiPKS6_PN3rmm14device_uvectorIiEEiNS0_18DecisionTreeParamsEmRKNS0_9QuantilesIS5_iEEi+0xa1) [0x7f586a68193a]#6 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(+0x1a441dd) [0x7f586a7101dd]#7 in /home/xxxanaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x97818) [0x7f5868a19818]#8 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(__kmp_invoke_microtask+0x93) [0x7f5868a363b3]#9 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x42194) [0x7f58689c4194]#10 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x4189a) [0x7f58689c389a]#11 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x96072) [0x7f5868a18072]#12 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f598e381609]#13 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f598e142133]
How to repro
The issue can be reproduced on the spark local or standonealone mode where 1 worker has 2 GPUs.
Issue Description
When I run the RandomForestClassifier benchmark on the node with 2 GPUs, spark-rapids-ml threw the below exception
How to repro
The issue can be reproduced on the spark local or standonealone mode where 1 worker has 2 GPUs.
Generate datasets
Run the training job
Please note that, there is no such issue if there are 2 workers, each has 1 GPUs.
The text was updated successfully, but these errors were encountered: