2.1.0-SNAPSHOT GPU jars train FAILED on `Check failed: device_ == device (0 vs. 3)` #9510

NvTimLiu · 2023-08-22T12:05:06Z

Use 2.1.0-SNAPSHOT GPU jars, train failed on src/common/host_device_vector.cu:166: Check failed: device_ == device (0 vs. 3) : New device ordinal is different from previous one.

Driver log:
driver-log.txt

 00:06:15 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on 127.0.0.1:35395 (size: 30.0 KiB, free: 16.9 GiB)
  00:06:18 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 9) (127.0.0.1 executor 3): ml.dmlc.xgboost4j.java.XGBoostError: [00:06:18] /workspace/src/common/host_device_vector.cu:166: Check failed: device_ == device (0 vs. 3) : New device ordinal is different from previous one.
 Stack trace:
   [bt] (0) /raid/tmp/libxgboost4j3852614352633587535.so(+0xa0573a) [0x7eee14bb173a]
   [bt] (1) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::HostDeviceVectorImpl<unsigned long>::SetDevice(int)+0x1af) [0x7eee14bcf1cf]
   [bt] (2) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::predictor::GPUPredictor::DevicePredictInternal(xgboost::DMatrix*, xgboost::HostDeviceVector<float>*, xgboost::gbm::GBTreeModel const&, unsigned long, unsigned long) const+0x425) [0x7eee14ed51e5]
   [bt] (3) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::predictor::GPUPredictor::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, xgboost::gbm::GBTreeModel const&, unsigned int, unsigned int) const+0x117) [0x7eee14ed63b7]
   [bt] (4) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::gbm::GBTree::PredictBatchImpl(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, int, int) const+0x299) [0x7eee148628d9]
   [bt] (5) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::LearnerImpl::Predict(std::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, int, int, bool, bool, bool, bool, bool)+0x286) [0x7eee148cba86]
   [bt] (6) /raid/tmp/libxgboost4j3852614352633587535.so(XGBoosterPredict+0xee) [0x7eee1457dfce]
   [bt] (7) /raid/tmp/libxgboost4j3852614352633587535.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGBoosterPredict+0x2c) [0x7eee1454db5c]
   [bt] (8) [0x7efda10183e7]
 
 
        at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
        at ml.dmlc.xgboost4j.java.Booster.predict(Booster.java:350)
        at ml.dmlc.xgboost4j.java.Booster.predict(Booster.java:419)
        at ml.dmlc.xgboost4j.scala.Booster.predict(Booster.scala:172)
        at ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel.producePredictionItrs(XGBoostClassifier.scala:342)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$transformDataset$1(GpuPreXGBoost.scala:198)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$$anon$1.$anonfun$loadNextBatch$2(GpuPreXGBoost.scala:340)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.withResource(GpuPreXGBoost.scala:597)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$$anon$1.loadNextBatch(GpuPreXGBoost.scala:320)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$$anon$1.hasNext(GpuPreXGBoost.scala:358)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:55)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

NvTimLiu · 2023-08-22T12:05:23Z

cc @wbo4958

trivialfis · 2023-08-22T12:41:45Z

@wbo4958 Could you please help work on some new tests when you are available?

trivialfis · 2023-08-22T12:45:15Z

in addition, if the spark cluster implementation doesn't set the correct device ordinal, we might need to fix the pyspark predict function as well.

Edit: I think (Py)Spark doesn't support GPU-based inference. We should just set the device ordinal to CPU.

Note:
For PySpark, the _transform function calls predict method from sklearn estimator, which in turn calls inplace_predict. If the device ordinal of the data doesn't match the one in the booster, an additional DMatrix is created.

Unless there's a user hack into the internal of pyspark-xgboost, the device ordinal of the booster returned from training is always 0.

This is not optimal since on a multi-GPU machine, we might create multiple DMatrix objects for inference. However, as far as I know, there's no plan to support GPU-based transform at the moment. Feel free to correct me if I'm wrong @wbo4958 .

wbo4958 · 2023-08-23T01:47:13Z

That's true xgboost-pyspark doesn't support gpu prediction, but xgboost-jvm indeed supports GPU prediction. It seems there's something wrong.

trivialfis · 2023-08-23T08:34:57Z

That's true xgboost-pyspark doesn't support gpu prediction

Then we need to make sure the booster is running on the CPU when transform is called.

NvTimLiu · 2023-08-28T06:00:02Z

Looks like this issue is not reproducible with latest GPU SNAPSHOT jars over the weekends, let's observe it for one or two more days, if still not repro, I'll close the issue, thanks! cc @wbo4958

NvTimLiu · 2023-08-31T07:16:29Z

dmlx-xgboost.tgz

upload driver and executor logs

trivialfis · 2023-09-11T08:39:00Z

Will skip the issue if it's no longer reproducible.

trivialfis · 2023-09-13T16:10:24Z

Feel free to re-open if there's new information.

trivialfis added the Blocking label Aug 22, 2023

trivialfis mentioned this issue Sep 1, 2023

2.0.0 Release Candidate #9497

Closed

5 tasks

trivialfis added status: need update and removed Blocking labels Sep 13, 2023

trivialfis closed this as completed Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.1.0-SNAPSHOT GPU jars train FAILED on `Check failed: device_ == device (0 vs. 3)` #9510

2.1.0-SNAPSHOT GPU jars train FAILED on `Check failed: device_ == device (0 vs. 3)` #9510

NvTimLiu commented Aug 22, 2023

NvTimLiu commented Aug 22, 2023

trivialfis commented Aug 22, 2023

trivialfis commented Aug 22, 2023 •

edited

Loading

wbo4958 commented Aug 23, 2023

trivialfis commented Aug 23, 2023

NvTimLiu commented Aug 28, 2023

NvTimLiu commented Aug 31, 2023

trivialfis commented Sep 11, 2023

trivialfis commented Sep 13, 2023

2.1.0-SNAPSHOT GPU jars train FAILED on Check failed: device_ == device (0 vs. 3) #9510

2.1.0-SNAPSHOT GPU jars train FAILED on Check failed: device_ == device (0 vs. 3) #9510

Comments

NvTimLiu commented Aug 22, 2023

NvTimLiu commented Aug 22, 2023

trivialfis commented Aug 22, 2023

trivialfis commented Aug 22, 2023 • edited Loading

wbo4958 commented Aug 23, 2023

trivialfis commented Aug 23, 2023

NvTimLiu commented Aug 28, 2023

NvTimLiu commented Aug 31, 2023

trivialfis commented Sep 11, 2023

trivialfis commented Sep 13, 2023

2.1.0-SNAPSHOT GPU jars train FAILED on `Check failed: device_ == device (0 vs. 3)` #9510

2.1.0-SNAPSHOT GPU jars train FAILED on `Check failed: device_ == device (0 vs. 3)` #9510

trivialfis commented Aug 22, 2023 •

edited

Loading