Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.1.0-SNAPSHOT GPU jars train FAILED on Check failed: device_ == device (0 vs. 3) #9510

Closed
NvTimLiu opened this issue Aug 22, 2023 · 9 comments

Comments

@NvTimLiu
Copy link

Use 2.1.0-SNAPSHOT GPU jars, train failed on src/common/host_device_vector.cu:166: Check failed: device_ == device (0 vs. 3) : New device ordinal is different from previous one.

Driver log:
driver-log.txt

 00:06:15 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on 127.0.0.1:35395 (size: 30.0 KiB, free: 16.9 GiB)
  00:06:18 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 9) (127.0.0.1 executor 3): ml.dmlc.xgboost4j.java.XGBoostError: [00:06:18] /workspace/src/common/host_device_vector.cu:166: Check failed: device_ == device (0 vs. 3) : New device ordinal is different from previous one.
 Stack trace:
   [bt] (0) /raid/tmp/libxgboost4j3852614352633587535.so(+0xa0573a) [0x7eee14bb173a]
   [bt] (1) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::HostDeviceVectorImpl<unsigned long>::SetDevice(int)+0x1af) [0x7eee14bcf1cf]
   [bt] (2) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::predictor::GPUPredictor::DevicePredictInternal(xgboost::DMatrix*, xgboost::HostDeviceVector<float>*, xgboost::gbm::GBTreeModel const&, unsigned long, unsigned long) const+0x425) [0x7eee14ed51e5]
   [bt] (3) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::predictor::GPUPredictor::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, xgboost::gbm::GBTreeModel const&, unsigned int, unsigned int) const+0x117) [0x7eee14ed63b7]
   [bt] (4) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::gbm::GBTree::PredictBatchImpl(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, int, int) const+0x299) [0x7eee148628d9]
   [bt] (5) /raid/tmp/libxgboost4j3852614352633587535.so(xgboost::LearnerImpl::Predict(std::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, int, int, bool, bool, bool, bool, bool)+0x286) [0x7eee148cba86]
   [bt] (6) /raid/tmp/libxgboost4j3852614352633587535.so(XGBoosterPredict+0xee) [0x7eee1457dfce]
   [bt] (7) /raid/tmp/libxgboost4j3852614352633587535.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGBoosterPredict+0x2c) [0x7eee1454db5c]
   [bt] (8) [0x7efda10183e7]
 
 
        at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
        at ml.dmlc.xgboost4j.java.Booster.predict(Booster.java:350)
        at ml.dmlc.xgboost4j.java.Booster.predict(Booster.java:419)
        at ml.dmlc.xgboost4j.scala.Booster.predict(Booster.scala:172)
        at ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel.producePredictionItrs(XGBoostClassifier.scala:342)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$transformDataset$1(GpuPreXGBoost.scala:198)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$$anon$1.$anonfun$loadNextBatch$2(GpuPreXGBoost.scala:340)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.withResource(GpuPreXGBoost.scala:597)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$$anon$1.loadNextBatch(GpuPreXGBoost.scala:320)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$$anon$1.hasNext(GpuPreXGBoost.scala:358)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:55)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
 
@NvTimLiu
Copy link
Author

cc @wbo4958

@trivialfis
Copy link
Member

@wbo4958 Could you please help work on some new tests when you are available?

@trivialfis
Copy link
Member

trivialfis commented Aug 22, 2023

in addition, if the spark cluster implementation doesn't set the correct device ordinal, we might need to fix the pyspark predict function as well.

Edit: I think (Py)Spark doesn't support GPU-based inference. We should just set the device ordinal to CPU.

Note:
For PySpark, the _transform function calls predict method from sklearn estimator, which in turn calls inplace_predict. If the device ordinal of the data doesn't match the one in the booster, an additional DMatrix is created.

Unless there's a user hack into the internal of pyspark-xgboost, the device ordinal of the booster returned from training is always 0.

This is not optimal since on a multi-GPU machine, we might create multiple DMatrix objects for inference. However, as far as I know, there's no plan to support GPU-based transform at the moment. Feel free to correct me if I'm wrong @wbo4958 .

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 23, 2023

That's true xgboost-pyspark doesn't support gpu prediction, but xgboost-jvm indeed supports GPU prediction. It seems there's something wrong.

@trivialfis
Copy link
Member

That's true xgboost-pyspark doesn't support gpu prediction

Then we need to make sure the booster is running on the CPU when transform is called.

@NvTimLiu
Copy link
Author

Looks like this issue is not reproducible with latest GPU SNAPSHOT jars over the weekends, let's observe it for one or two more days, if still not repro, I'll close the issue, thanks! cc @wbo4958

@NvTimLiu
Copy link
Author

dmlx-xgboost.tgz

upload driver and executor logs

@trivialfis trivialfis mentioned this issue Sep 1, 2023
5 tasks
@trivialfis
Copy link
Member

Will skip the issue if it's no longer reproducible.

@trivialfis
Copy link
Member

Feel free to re-open if there's new information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 2.0 Done
Development

No branches or pull requests

3 participants