Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train with caffee Vitis-AI GPU fail #691

Closed
mhanuel26 opened this issue Mar 3, 2022 · 21 comments
Closed

train with caffee Vitis-AI GPU fail #691

mhanuel26 opened this issue Mar 3, 2022 · 21 comments

Comments

@mhanuel26
Copy link

mhanuel26 commented Mar 3, 2022

Hi,

I am getting the following issue while doing train on cf_refinedet_coco_360_480_0.96_5.08G_2.0

(vitis-ai-caffe) Vitis-AI /workspace/models/AI-Model-Zoo/cf_refinedet_coco_360_480_0.96_5.08G_2.0/code/train > bash train.sh 
../../../caffe-xilinx/build/tools/caffe.bin does not exist, try use path in pre-build docker
F0303 10:14:08.370003   394 gpu_memory.cpp:171] Check failed: error == cudaSuccess (10 vs. 0)  invalid device ordinal
*** Check failure stack trace: ***
    @     0x7ff0e4aaf4dd  google::LogMessage::Fail()
    @     0x7ff0e4ab7071  google::LogMessage::SendToLog()
    @     0x7ff0e4aaeecd  google::LogMessage::Flush()
    @     0x7ff0e4ab076a  google::LogMessageFatal::~LogMessageFatal()
    @     0x7ff0e3760145  caffe::GPUMemory::Manager::update_dev_info()
    @     0x7ff0e37606bf  caffe::GPUMemory::Manager::init()
    @     0x55a72c9920ed  train()
    @     0x55a72c98ba59  main
    @     0x7ff0e1ceac87  __libc_start_main
    @     0x55a72c98c6a8  (unknown)
train.sh: line 37:   394 Aborted                 (core dumped) $exec_path "$@"

Here is the output of nvidia-smi

mhanuel@mhanuel-MSI:~$ nvidia-smi
Thu Mar  3 10:15:15 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   36C    P8    24W / 170W |    386MiB / 12288MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      7372      G   /usr/lib/xorg/Xorg                 35MiB |
|    0   N/A  N/A      7851      G   /usr/lib/xorg/Xorg                235MiB |
|    0   N/A  N/A      7976      G   /usr/bin/gnome-shell               40MiB |
|    0   N/A  N/A      8471      G   ...520405909793494209,131072       23MiB |
|    0   N/A  N/A    180023      G   ...AAAAAAAAA= --shared-files       39MiB |
+-----------------------------------------------------------------------------+

What could I be missing?

@mhanuel26
Copy link
Author

Hi there, After changing the gpu parameter to 0 such as

caffe_exec caffe train -solver solver.prototxt -gpu 0 -weights $PRETRAIN_WEIGHTS

caffee training was able to start but later I got another error down the line as showing below

I0303 10:56:51.229724   428 solver.cpp:341] Solving 
I0303 10:56:51.229728   428 solver.cpp:342] Learning Rate Policy: multistep
I0303 10:56:51.230501   428 blocking_queue.cpp:50] Data layer prefetch queue empty
F0303 10:56:52.181033   428 math_functions.cu:27] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0)  CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
    @     0x7f1b3730e4dd  google::LogMessage::Fail()
    @     0x7f1b37316071  google::LogMessage::SendToLog()
    @     0x7f1b3730decd  google::LogMessage::Flush()
    @     0x7f1b3730f76a  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f1b360c93da  caffe::caffe_gpu_gemm<>()
    @     0x7f1b35c985c9  caffe::BaseConvolutionLayer<>::backward_gpu_gemm()
    @     0x7f1b36041a0e  caffe::DeconvolutionLayer<>::Forward_gpu()
    @     0x7f1b35edb442  caffe::Net<>::ForwardFromTo()
    @     0x7f1b35edb537  caffe::Net<>::Forward()
    @     0x7f1b35f4ab64  caffe::Solver<>::Step()
    @     0x7f1b35f4b6b1  caffe::Solver<>::Solve()
    @     0x5606e36d95ce  train()
    @     0x5606e36d2a59  main
    @     0x7f1b34549c87  __libc_start_main
    @     0x5606e36d36a8  (unknown)
train.sh: line 37:   428 Aborted                 (core dumped) $exec_path "$@"

Does anyone have any idea?

@wangxd-xlnx
Copy link
Contributor

Hi @mhanuel26

Could you provide your GPU model? It seems that this error is related to GPU model and CUDA version.

Besides that, did you change the GPU id to ‘0’ in train.sh? You could have a try.

@mhanuel26
Copy link
Author

Hi @wangxd-xlnx

Yes I have change the GPU to only 0 and it starts actually working (before \I was getting an error very early), it is only after sometime that it throw that error.

I am working with Nvidia RTX 3060, I am not in front on the PC know to give you driver version (I will post later) but I am using CUDA 11.6 on ubuntu 20.04. I have been training with ssd_mobilenet_v2 and it haven't fail, thou it is really slow compared to the example.
I have test some metrics with PyTorch installed from sources and it looks the Nvidia tools are setup correctly.

What else can I do to debug?

Thanks,

@mhanuel26
Copy link
Author

Hi @wanghy-xlnx ,

Here is the output of nvidia-smi, where you san see the driver verision and Cuda version,

Wed Mar  9 14:15:49 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 45%   49C    P2    80W / 170W |   9267MiB / 12288MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1122      G   /usr/lib/xorg/Xorg                482MiB |
|    0   N/A  N/A      1465      G   /usr/bin/gnome-shell              154MiB |
|    0   N/A  N/A      2162      G   ...095877176430821094,131072      114MiB |
|    0   N/A  N/A      3933      G   ...AAAAAAAAA= --shared-files       74MiB |
|    0   N/A  N/A      5169      C   caffe                            8435MiB |
+-----------------------------------------------------------------------------+

the command I am running in train.sh is

caffe_exec caffe train -solver solver.prototxt -gpu 0 -weights $PRETRAIN_WEIGHTS

The first part of the log looks normal

../../../caffe-xilinx/build/tools/caffe.bin does not exist, try use path in pre-build docker
I0308 22:14:45.251152   296 gpu_memory.cpp:53] GPUMemory::Manager initialized with Caching (CUB) GPU Allocator
I0308 22:14:45.251293   296 gpu_memory.cpp:55] Total memory: 12628656128, Free: 2896297984, dev_info[0]: total=12628656128 free=2896297984
I0308 22:14:45.251365   296 caffe.cpp:213] Using GPUs 0
I0308 22:14:45.251403   296 caffe.cpp:218] GPU 0: NVIDIA GeForce RTX 3060
I0308 22:28:25.298084   296 solver.cpp:51] Initializing solver from parameters: 
test_iter: 10000
test_interval: 2000
base_lr: 1e-06
display: 500
max_iter: 64000
lr_policy: "multistep"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "./snapshot/refinedet"
solver_mode: GPU
device_id: 0
net: "../../float/trainval.prototxt"
test_initialization: false
stepvalue: 32000
stepvalue: 48000
type: "SGD"
ap_version: "11point"
eval_type: "detection"
I0308 22:28:25.505329   296 solver.cpp:99] Creating training net from net file: ../../float/trainval.prototxt
I0308 22:28:25.637007   296 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I0308 22:28:25.637063   296 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer odm_conf_reshape
I0308 22:28:25.637068   296 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer odm_conf_softmax
I0308 22:28:25.637070   296 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer odm_conf_flatten
I0308 22:28:25.637073   296 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer detection_out
I0308 22:28:25.637076   296 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer detection_eval
I0308 22:28:25.637080   296 net.cpp:52] Initializing net from parameters: 
state {

How can I debug this further?

Thanks,

@wangxd-xlnx
Copy link
Contributor

Hi @mhanuel26

Please notice that if you could use Vitis-AI docker, you just need to activate the conda env ‘vitis-ai-caffe'. You don't need to compile the caffe-xilinx source code manually. It's just for the environment that is inconvenient to use docker.

The caffe-xilinx is precompiled in conda env ‘vitis-ai-caffe'. You can directly use it. So you could have another try that exit and run docker but don't compile the caffe-xilinx.

PATH: /opt/vitis_ai/conda/envs/vitis-ai-caffe/bin/

image (3)

@mhanuel26
Copy link
Author

Hi @wangxd-xlnx ,

This is not the same problem, I was able to follow the SSD example for mobilnet_v2 to use the VOC dataset and it works correctly under caffe, the only problem is that runs very slow on my GPU RTX 3060.
The example is
https://github.com/Xilinx/Vitis-AI-Tutorials/blob/master/Design_Tutorials/14-caffe-ssd-pascal/README.md
The guide mentioned it takes 6 hours using two GTX 1080 Ti, I have only one RTX 3060 and it is taking 120 hours. This is clearly not sustainable environment.

I0306 09:03:54.903728 81519 solver.cpp:270] Iteration 100 (0.273932 iter/s, 365.054s/100 iter), loss = 14.7898, remaining 121 hours and 33 minutes
I0306 09:03:54.903770 81519 solver.cpp:291]     Train net output #0: mbox_loss = 14.9056 (* 1 = 14.9056 loss)
I0306 09:03:54.903776 81519 sgd_solver.cpp:106] Iteration 100, lr = 0.001

This issue #691 is more related to some mathematical operation of the model under the xilinx-caffe branch.

I was thinking if there might be some improvement if caffe-xilinx is build on the host.

I probably should open another issue to discuss this topic but let me know what do you think.

Thanks,

@mhanuel26
Copy link
Author

mhanuel26 commented Mar 13, 2022

Hi @wanghy-xlnx ,

I got similar error when working with the dogs vs cats design example, here is the console output (first part omitted).

I0313 19:55:00.769707  9878 net.cpp:284] Network initialization done.
I0313 19:55:00.769755  9878 solver.cpp:63] Solver scaffolding done.
I0313 19:55:00.770098  9878 caffe.cpp:247] Starting Optimization
I0313 19:55:00.770102  9878 solver.cpp:341] Solving alexnetBNnoLRN m2 (as m3 but less DROP and less BN)
I0313 19:55:00.770103  9878 solver.cpp:342] Learning Rate Policy: step
I0313 19:55:00.770746  9878 solver.cpp:424] Iteration 0, Testing net (#0)
I0313 19:55:01.232285  9878 solver.cpp:523]     Test net output #0: accuracy = 0.5
I0313 19:55:01.232316  9878 solver.cpp:523]     Test net output #1: loss = 0.693147 (* 1 = 0.693147 loss)
I0313 19:55:01.232318  9878 solver.cpp:523]     Test net output #2: top-1 = 0.5
F0313 19:55:01.267323  9878 math_functions.cu:27] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0)  CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
    @     0x7f953c7664dd  google::LogMessage::Fail()
    @     0x7f953c76e071  google::LogMessage::SendToLog()
    @     0x7f953c765ecd  google::LogMessage::Flush()
    @     0x7f953c76776a  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f953b5213da  caffe::caffe_gpu_gemm<>()
    @     0x7f953b4c764c  caffe::InnerProductLayer<>::Backward_gpu()
    @     0x7f953b33db03  caffe::Net<>::BackwardFromTo()
    @     0x7f953b33dc5f  caffe::Net<>::Backward()
    @     0x7f953b3a2b6c  caffe::Solver<>::Step()
    @     0x7f953b3a36b1  caffe::Solver<>::Solve()
    @     0x560a580365ce  train()
    @     0x560a5802fa59  main
    @     0x7f95399a1c87  __libc_start_main
    @     0x560a580306a8  (unknown)

It looks very reproducible, do you have any suggestion how to debug this further?

The command I run was

source caffe/caffe_flow_AlexNet.sh 2>&1 | tee log/logfile_caffe_${CNN}.txt

There was in fact another error before

F0313 19:29:36.654079  9802 math_functions.cu:27] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0)  CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
    @     0x7f7be88824dd  google::LogMessage::Fail()
    @     0x7f7be888a071  google::LogMessage::SendToLog()
    @     0x7f7be8881ecd  google::LogMessage::Flush()
    @     0x7f7be888376a  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f7be763d3da  caffe::caffe_gpu_gemm<>()
    @     0x7f7be75e364c  caffe::InnerProductLayer<>::Backward_gpu()
    @     0x7f7be7459b03  caffe::Net<>::BackwardFromTo()
    @     0x7f7be7459c5f  caffe::Net<>::Backward()
    @     0x7f7be74beb6c  caffe::Solver<>::Step()
    @     0x7f7be74bf6b1  caffe::Solver<>::Solve()
    @     0x555bcbe385ce  train()
    @     0x555bcbe31a59  main
    @     0x7f7be5abdc87  __libc_start_main
    @     0x555bcbe326a8  (unknown)
Aborted (core dumped)
TRAINING WITH CAFFE


Elapsed time for Caffe training (s):  776.044833


PLOT LEARNING CURVERS (METHOD1)
  File "/workspace/tutorials/caffe-xilinx/tools/extra/parse_log.py", line 166
    print 'Wrote %s' % output_filename
                   ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print('Wrote %s' % output_filename)?
Traceback (most recent call last):
  File "/workspace/tutorials/VAI-Caffe-ML-CATSvsDOGS/files/caffe/code/5_plot_learning_curve.py", line 56, in <module>
    train_log = pd.read_csv(train_log_path, sep=",")
  File "/opt/vitis_ai/conda/envs/vitis-ai-caffe/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/vitis_ai/conda/envs/vitis-ai-caffe/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/opt/vitis_ai/conda/envs/vitis-ai-caffe/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/opt/vitis_ai/conda/envs/vitis-ai-caffe/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/opt/vitis_ai/conda/envs/vitis-ai-caffe/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 689, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/workspace/tutorials/VAI-Caffe-ML-CATSvsDOGS/files/caffe/models/alexnetBNnoLRN/m2/logfile_2_alexnetBNnoLRN.log.train' does not exist: b'/workspace/tutorials/VAI-Caffe-ML-CATSvsDOGS/files/caffe/models/alexnetBNnoLRN/m2/logfile_2_alexnetBNnoLRN.log.train'
COMPUTE PREDICTIONS
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0313 19:43:02.439126  9861 gpu_memory.cpp:53] GPUMemory::Manager initialized with Caching (CUB) GPU Allocator
I0313 19:43:02.439143  9861 gpu_memory.cpp:55] Total memory: 12628656128, Free: 11200626688, dev_info[0]: total=12628656128 free=11200626688
Traceback (most recent call last):
  File "/workspace/tutorials/VAI-Caffe-ML-CATSvsDOGS/files/caffe/code/6_make_predictions.py", line 147, in <module>
    net = caffe.Net(caffe_description, caffe_model, caffe.TEST)

@wangxd-xlnx
Copy link
Contributor

Hi @mhanuel26

OK, thanks for your feedback.
If your major problems are operating efficiency of some tutorials and design example, my advice is to raise a new issue for it.
So the train scripts runs successfully now?

@mhanuel26
Copy link
Author

Hi @wanghy-xlnx ,

As you can see on my last comment I am getting errors to get caffe working on my box, i wish this is about efficiency, I have been trying different caffe examples with none of them working at all.

Any way to debug this? If you were to do it , how would you approach it?
Thanks,

@mhanuel26
Copy link
Author

Hi @wanghy-xlnx ,

I found something that might be related to the issues, I check the caffe environment and the cudnn and cudatoolkit seems to be somehow old versions, at least compared with the tensorflow2. Tensorflow has same versions as caffe. Look at below

(vitis-ai-caffe) Vitis-AI /workspace/models/AI-Model-Zoo > conda list cudnn
# packages in environment at /opt/vitis_ai/conda/envs/vitis-ai-caffe:
#
# Name                    Version                   Build  Channel
cudnn                     7.6.5.32             ha8d7eb6_1    conda-forge
(vitis-ai-caffe) Vitis-AI /workspace/models/AI-Model-Zoo > conda list cudatoolkit
# packages in environment at /opt/vitis_ai/conda/envs/vitis-ai-caffe:
#
# Name                    Version                   Build  Channel
cudatoolkit               10.0.130            hf841e97_10    conda-forge
(vitis-ai-caffe) Vitis-AI /workspace/models/AI-Model-Zoo > conda deactivate
(base) Vitis-AI /workspace/models/AI-Model-Zoo > conda activate vitis-ai-tensorflow 
(vitis-ai-tensorflow) Vitis-AI /workspace/models/AI-Model-Zoo > conda list cudatoolkit
# packages in environment at /opt/vitis_ai/conda/envs/vitis-ai-tensorflow:
#
# Name                    Version                   Build  Channel
cudatoolkit               10.0.130            hf841e97_10    conda-forge
(vitis-ai-tensorflow) Vitis-AI /workspace/models/AI-Model-Zoo > conda list cudnn
# packages in environment at /opt/vitis_ai/conda/envs/vitis-ai-tensorflow:
#
# Name                    Version                   Build  Channel
cudnn                     7.6.5.32             ha8d7eb6_1    conda-forge

Coincidentally, I haven't been able to run successfully a single caffe or TensorFlow example, but I was able to run successfully a Tensorflow2 example, look at the same output for Tensorflow2

vitis-ai-tensorflow2) Vitis-AI /workspace/08-tf2_flow/files > conda list cudnn
# packages in environment at /opt/vitis_ai/conda/envs/vitis-ai-tensorflow2:
#
# Name                    Version                   Build  Channel
cudnn                     8.2.1.32             h86fa8c9_0    conda-forge
(vitis-ai-tensorflow2) Vitis-AI /workspace/08-tf2_flow/files > conda list cudatoolkit
# packages in environment at /opt/vitis_ai/conda/envs/vitis-ai-tensorflow2:
#
# Name                    Version                   Build  Channel
cudatoolkit               11.5.1              hcf5317a_10    conda-forge

Do you know how can I build a docker image that uses cuda 11.5 or 11.6 instead of 10 and latest or newer cudnn version?
Or maybe I can upgrade those packages ?

@hanxue
Copy link
Contributor

hanxue commented Mar 15, 2022

Hi @mhanuel26 ,

nvidia-smi output does not necessarily reflect the actual NVIDIA driver that is being installed on the host machine. On your host machine, please can you list your nvidia driver

$ apt list --installed|grep nvidia-driver

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-driver-470/bionic-security,now 470.86-0ubuntu0.18.04.1 amd64 [installed,u
pgradable to: 470.103.01-0ubuntu1]

You can also use this command to check the NVIDIA driver version

$ modinfo nvidia | head                                     
filename:       /lib/modules/4.15.0-163-generic/updates/dkms/nvidia.ko
firmware:       nvidia/470.86/gsp.bin
alias:          char-major-195-*
version:        470.86
supported:      external

On your host, run docker info and make sure there is nvidia runtime

$ docker info

Client:

 Context:    default

 Debug Mode: false

...

 

Server:

 Containers: 0

  Running: 0

... 

...

 Runtimes: nvidia runc io.containerd.runc.v2 io.containerd.runtime.v1.linux

 Default Runtime: runc

@hanxue
Copy link
Contributor

hanxue commented Mar 15, 2022

If your docker container do not have the nvidia runtime, follow the instructions at https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit to install the nvidia-docker2 package and restart Docker.

@mhanuel26
Copy link
Author

Hi @hanxue , @wanghy-xlnx ,

My container was not having the nvidia runtime, I installed. Here are the outputs,

mhanuel@mhanuel-MSI:/usr/local/cuda-11.6$ sudo apt list --installed|grep nvidia-driver
[sudo] password for mhanuel: 

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-driver-510/unknown,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
mhanuel@mhanuel-MSI:/usr/local/cuda-11.6$ modinfo nvidia | head 
filename:       /lib/modules/5.13.0-30-generic/updates/dkms/nvidia.ko
firmware:       nvidia/510.47.03/gsp.bin
alias:          char-major-195-*
version:        510.47.03
supported:      external
license:        NVIDIA
srcversion:     AA3CDC718104247365A30A7
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*

mhanuel@mhanuel-MSI:/usr/local/cuda-11.6$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.7.1-docker)
  scan: Docker Scan (Docker Inc., v0.12.0)

Server:
 Containers: 4
  Running: 2
  Paused: 0
  Stopped: 2
 Images: 64
 Server Version: 20.10.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cc61520f4cd876b86e77edfeb88fbcd536d1f9d
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.13.0-30-generic
 Operating System: Ubuntu 20.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 31.24GiB
 Name: mhanuel-MSI
 ID: GJRH:AC3I:NI3U:FW6I:TPL5:QOTQ:3DS4:AQK6:UR3A:KDPG:IYY6:NUXI
 Docker Root Dir: /home/mhanuel/devel/DockerImg
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Afetr installing and restarting docker it shows the runtime

mhanuel@mhanuel-MSI:~$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.0-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 3
  Running: 1
  Paused: 0
  Stopped: 2
 Images: 64
 Server Version: 20.10.13
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2a1d4dbdb2a1030dc5b01e96fb110a9d9f150ecc
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.13.0-30-generic
 Operating System: Ubuntu 20.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 31.24GiB
 Name: mhanuel-MSI
 ID: GJRH:AC3I:NI3U:FW6I:TPL5:QOTQ:3DS4:AQK6:UR3A:KDPG:IYY6:NUXI
 Docker Root Dir: /home/mhanuel/devel/DockerImg
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

But the Vitis-AI model from Zoo still does not work, here is the console output (few lines at start then last few lines)

Vitis-AI /workspace/models/AI-Model-Zoo/cf_refinedet_coco_360_480_0.96_5.08G_2.0/code/train > conda activate vitis-ai-caffe
(vitis-ai-caffe) Vitis-AI /workspace/models/AI-Model-Zoo/cf_refinedet_coco_360_480_0.96_5.08G_2.0/code/train > bash train.sh 
../../../caffe-xilinx/build/tools/caffe.bin does not exist, try use path in pre-build docker
I0315 07:08:35.892222   197 gpu_memory.cpp:53] GPUMemory::Manager initialized with Caching (CUB) GPU Allocator
I0315 07:08:35.892367   197 gpu_memory.cpp:55] Total memory: 12628656128, Free: 11178934272, dev_info[0]: total=12628656128 free=11178934272
I0315 07:08:35.892458   197 caffe.cpp:213] Using GPUs 0
I0315 07:08:35.892503   197 caffe.cpp:218] GPU 0: NVIDIA GeForce RTX 3060
I0315 07:21:54.371585   197 solver.cpp:51] Initializing solver from parameters: 
test_iter: 10000


....

....

I0315 07:21:54.654903   197 caffe.cpp:247] Starting Optimization
I0315 07:21:54.654907   197 solver.cpp:341] Solving 
I0315 07:21:54.654909   197 solver.cpp:342] Learning Rate Policy: multistep
I0315 07:21:54.655547   197 blocking_queue.cpp:50] Data layer prefetch queue empty
F0315 07:21:55.471717   197 math_functions.cu:27] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0)  CUBLAS_STATUS_EXECUTION_FAILED
*** Check failure stack trace: ***
    @     0x7fb56a5e04dd  google::LogMessage::Fail()
    @     0x7fb56a5e8071  google::LogMessage::SendToLog()
    @     0x7fb56a5dfecd  google::LogMessage::Flush()
    @     0x7fb56a5e176a  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fb56939b3da  caffe::caffe_gpu_gemm<>()
    @     0x7fb568f6a5c9  caffe::BaseConvolutionLayer<>::backward_gpu_gemm()
    @     0x7fb569313a0e  caffe::DeconvolutionLayer<>::Forward_gpu()
    @     0x7fb5691ad442  caffe::Net<>::ForwardFromTo()
    @     0x7fb5691ad537  caffe::Net<>::Forward()
    @     0x7fb56921cb64  caffe::Solver<>::Step()
    @     0x7fb56921d6b1  caffe::Solver<>::Solve()
    @     0x561e3d6a75ce  train()
    @     0x561e3d6a0a59  main
    @     0x7fb56781bc87  __libc_start_main
    @     0x561e3d6a16a8  (unknown)
train.sh: line 37:   197 Aborted                 (core dumped) $exec_path "$@"


How can I debug this?

@wangxd-xlnx
Copy link
Contributor

Hi @mhanuel26

We have re-analyzed your operation process, and we can provide solutions on train.sh problems.

Please follow our steps exactly.

  1. pull docker, run vitis-ai-gpu:latest
  2. activate vitis-ai-caffe
  3. open models/AI-Model-Zoo, don't compile caffe-xilinx or delete caffe-xilinx at first (It is recommended to delete it directly, you may have compiled it)
  4. open cf_refinedet_coco_360_480_0.96_5.08G_2.0
    operate steps as readme including put coco2014 dataset and run these two data process scripts in order
    convert_coco2voc_like.py
    create_data.py
  5. bash train.sh

Then it will run successfully.

refinedet-train

@mhanuel26
Copy link
Author

Hi @wangxd-xlnx , @hanxue ,

That did NOT work. The data generation is ok and I have followed exactly as you said, I haven't compiled caffe-xilinx in fact, I rename the caffe-xilinx folder, in fact you can see that it is using the pre-build docker as shown below

../../../caffe-xilinx/build/tools/caffe.bin does not exist, try use path in pre-build docker

Vitis-AI /workspace/models/AI-Model-Zoo/cf_refinedet_coco_360_480_0.96_5.08G_2.0/code/train > conda activate vitis-ai-caffe
(vitis-ai-caffe) Vitis-AI /workspace/models/AI-Model-Zoo/cf_refinedet_coco_360_480_0.96_5.08G_2.0/code/train > bash train.sh 
../../../caffe-xilinx/build/tools/caffe.bin does not exist, try use path in pre-build docker
I0315 11:00:36.127033   103 gpu_memory.cpp:53] GPUMemory::Manager initialized with Caching (CUB) GPU Allocator
I0315 11:00:36.127187   103 gpu_memory.cpp:55] Total memory: 12628656128, Free: 10601627648, dev_info[0]: total=12628656128 free=10601627648
I0315 11:00:36.127307   103 caffe.cpp:213] Using GPUs 0
I0315 11:00:36.127347   103 caffe.cpp:218] GPU 0: NVIDIA GeForce RTX 3060

I have renamed the caffe-xilinx folder in the meantime.

The data directory after running those scripts looks like this

(vitis-ai-caffe) Vitis-AI /workspace/models/AI-Model-Zoo/cf_refinedet_coco_360_480_0.96_5.08G_2.0 > tree -d ./
./
├── code
│   ├── gen_data
│   ├── test
│   └── train
│       └── snapshot
├── data
│   ├── coco2014
│   │   ├── Annotations
│   │   └── Images
│   ├── coco2014_lmdb
│   │   ├── train2014_lmdb
│   │   └── val2014_lmdb
│   └── link_480_360
│       ├── train2014_lmdb -> ../../data/coco2014_lmdb/train2014_lmdb
│       └── val2014_lmdb -> ../../data/coco2014_lmdb/val2014_lmdb
├── float
└── quantized

17 directories

train.sh still fails.

Do you suggest to create a new docker image?

Is there a way I can get the cudnn and cudas-runtime version updated as the tensorflow2 conda environment?

Thanks,

@hanxue
Copy link
Contributor

hanxue commented Mar 15, 2022

Hi @mhanuel26 ,

I noticed that you are using GeForce RTX 3060. RTX 3060 uses the Ampere architecture, and requires at least CUDA 11.0. Unfortunately caffe can only be built with CUDA 10.0, and is not compatible with CUDA 11.0

Is there a chance that you can try with another NVIDIA GPU?

@mhanuel26
Copy link
Author

Hi @hanxue , @wangxd-xlnx ,

That probably explain everything.
Not at the moment.

Thanks,

@mhanuel26 mhanuel26 reopened this Mar 15, 2022
@mhanuel26
Copy link
Author

Hi @hanxue , @wangxd-xlnx ,

I was looking at the documentation about the compatibility you are mentioning, but cannot eaily find it.

Could you point me to it please before closing this?

Thanks,

@mhanuel26
Copy link
Author

I found this patch, it looks possible to use with modifications.

I know we can disable cudnn for caffe, so maybe it is a matter of looking at cuda 11 support for caffe.

@hanxue
Copy link
Contributor

hanxue commented Mar 16, 2022

Hi @mhanuel26 ,

This page shows that RTX 3060 requires CUDA 11.1 at least. https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/

It is not as simple to figure it out directly from the Ampere GPU Architecture Compatibility Guide and CUDA Compatibility index.

@ofekp
Copy link

ofekp commented Jul 26, 2023

Just in case it will help someone, as suggested by @hanxue, switching to a different GPU (20 series which does not use Ampere architecture) solved this issue for me. Thank you so much, @hanxue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants