[Bug] RandomForest: cuML throws exception when setting n_streams>1 on the node with 2 GPUs #111

wbo4958 · 2023-02-23T06:53:48Z

Issue Description

When I run the RandomForestClassifier benchmark on the node with 2 GPUs, spark-rapids-ml threw the below exception

terminate called after throwing an instance of 'raft::cuda_error'
  what():  CUDA error encountered at: file=/home/xxx/work.d/ml/cuml/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=328: call='cudaMemsetAsync(done_count, 0, sizeof(int) * max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 14 stack frames
#0 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x2e) [0x7f586987abe0]
#1 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft9exceptionC2ENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5a) [0x7f586987ab8c]
#2 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x37) [0x7f586987c0b1]
#3 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIfiiEEE15assignWorkspaceEPcS5_+0x5fc) [0x7f586a6bd504]
#4 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT7BuilderINS0_21GiniObjectiveFunctionIfiiEEEC1ERKN4raft8handle_tEP11CUstream_stimRKNS0_18DecisionTreeParamsEPKfPKiiiPN3rmm14device_uvectorIiEEiRKNS0_9QuantilesIfiEE+0x7ad) [0x7f586a69b2ff]
#5 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(_ZN2ML2DT12DecisionTree3fitIfiEESt10shared_ptrINS0_16TreeMetaDataNodeIT_T0_EEERKN4raft8handle_tEP11CUstream_stPKS5_iiPKS6_PN3rmm14device_uvectorIiEEiNS0_18DecisionTreeParamsEmRKNS0_9QuantilesIS5_iEEi+0xa1) [0x7f586a68193a]
#6 in /home/xxx/anaconda3/envs/cuml_dev/lib/libcuml++.so(+0x1a441dd) [0x7f586a7101dd]
#7 in /home/xxxanaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x97818) [0x7f5868a19818]
#8 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(__kmp_invoke_microtask+0x93) [0x7f5868a363b3]
#9 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x42194) [0x7f58689c4194]
#10 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x4189a) [0x7f58689c389a]
#11 in /home/xxx/anaconda3/envs/cuml_dev/lib/libgomp.so.1(+0x96072) [0x7f5868a18072]
#12 in /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f598e381609]
#13 in /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f598e142133]

How to repro

The issue can be reproduced on the spark local or standonealone mode where 1 worker has 2 GPUs.

Generate datasets

python gen_data.py  classification --output_dir=/tmp/abc

Run the training job

python benchmark_runner.py  random_forest_classifier --train_path=/tmp/abc --n_streams=4 --gpu_worker=2

Please note that, there is no such issue if there are 2 workers, each has 1 GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RandomForest: cuML throws exception when setting n_streams>1 on the node with 2 GPUs #111

[Bug] RandomForest: cuML throws exception when setting n_streams>1 on the node with 2 GPUs #111

wbo4958 commented Feb 23, 2023

[Bug] RandomForest: cuML throws exception when setting n_streams>1 on the node with 2 GPUs #111

[Bug] RandomForest: cuML throws exception when setting n_streams>1 on the node with 2 GPUs #111

Comments

wbo4958 commented Feb 23, 2023

Issue Description

How to repro

Generate datasets

Run the training job