Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customizing NVIDIA instructions for reproducibility #7

Open
psyhtest opened this issue Nov 25, 2019 · 36 comments
Open

Customizing NVIDIA instructions for reproducibility #7

psyhtest opened this issue Nov 25, 2019 · 36 comments

Comments

@psyhtest
Copy link

I'd like to generate and run TensorRT optimized plan files from the NVIDIA v0.5 submission on my (admittedly out-of-date) NVIDIA devices: Quadro M1000M, GeForce GTX1080 and Jetson TX1.

Unfortunately, following the instructions fails:

$ make generate_engines
...
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 286, in main
    config_files = find_config_files(benchmarks, scenarios)
  File "/data/anton/inference_results_v0.5/closed/NVIDIA/code/common/__init__.py", line 123, in find_config_files
    system = get_system_id()
  File "/data/anton/inference_results_v0.5/closed/NVIDIA/code/common/__init__.py", line 117, in get_system_id
    raise RuntimeError("Cannot find valid configs for {:d}x {:}".format(count_actual, name))
RuntimeError: Cannot find valid configs for 1x GeForce GTX 1080
Makefile:298: recipe for target 'generate_engines' failed

With a naive intervention:

anton@diviniti:~/projects/mlperf/inference_results_v0.5/closed/NVIDIA$ git diff
diff --git a/closed/NVIDIA/code/common/__init__.py b/closed/NVIDIA/code/common/__init__.py
index 1ab6ee33..a9e07066 100644
--- a/closed/NVIDIA/code/common/__init__.py
+++ b/closed/NVIDIA/code/common/__init__.py
@@ -102,7 +102,10 @@ def get_system_id():
     import pycuda.driver
     import pycuda.autoinit
     name = pycuda.driver.Device(0).name()
+    print(name)
     count_actual = pycuda.driver.Device.count()
+    print(count_actual)
+    print(system_list)
     for system in system_list:
         if system[1] in name and count_actual == system[2]:
             return system[0]

I get a bit of clue about what's going on:

ck virtual env --tags=pycuda --shell_cmd="make generate_engines"
Quadro M1000M
1
[('T4x8', 'Tesla T4', 8), ('T4x20', 'Tesla T4', 20), ('TitanRTXx4', 'TITAN RTX', 4), ('Xavier', 'Xavier', 1)]

OK, I only have one Quadro M1000M GPU in my laptop, and it's clearly not in the list of systems NVIDIA used for their v0.5 submission. The same for the other two devices.

It appears though that I would have the same issue with reproducing the Alibaba submission using only one Tesla T4 or the DellEMC submission using four Tesla T4s. It's ironic because both Alibaba and DellEMC both refer to the NVIDIA submission for reproducibility.)

Would it be possible to untangle the instructions to allow generating optimized TensorRT plans for other devices please?

@psyhtest
Copy link
Author

/cc @DilipSequeira @nvpohanh

@nvpohanh
Copy link

The problem is that MLPerf runs require that you set the a lot of correct parameters first (like batch size, precision, target_qps, samples_per_query, etc.). It is not easy for MLPerf beginners to get VALID results without extensive tuning and understanding of how MLPerf works.

In your case, you can add more system config files into measurements, changed the parameters, and append them to system_list.py. Alibaba and DellEMC did the same thing. Nevertheless, it is not guaranteed that it will work, especially on old GPUs like Maxwell or Pascal.

@nvpohanh
Copy link

One thing I do agree, though, is that we can provide more instructions about how to reproduce Alibaba and Dell's results.

@psyhtest
Copy link
Author

Thanks @nvpohanh. I agree that getting valid MLPerf results is not straightforward, and that your harness aims to address this for your submissions and those of your clients.

However, it seems to conflate two things: 1) generating optimized TensorRT plans which requires setting up parameters like the batch size and precision; 2) running the optimized plans which requires setting up LoadGen parameters like target_qps, samples_per_query, etc. It's OK for capturing the final result of the tuning, but it would be great to disentangle them to allow new devices to be added and tested quickly. A simple script per benchmark that takes as input the path to an original model, batch size, precision, etc., and outputs an optimized TensorRT plan for those parameters would be great.

For example, for MobileNet some interesting manipulations seem to be taking place (which probably result in a higher accuracy than when using the vanilla ONNX-to-TensorRT conversion), but it's not obvious how one would run this piece of code in isolation of your test harness.

@nvpohanh
Copy link

@psyhtest I realized that we did provide a performance_tuning_guide.adoc to tell the users how to tune the MLPerf parameters to get VALID results. The remaining missing part is how to add new systems into system_list.py. do you agree?

Also, the MobileNet manipulations shouldn't depend on which GPU you use.

@psyhtest
Copy link
Author

psyhtest commented Nov 27, 2019

Also, the MobileNet manipulations shouldn't depend on which GPU you use.

Precisely my point! I assume the same is true for the other models? Which is why it is a bit awkward that a full config should be provided just to run the conversion.

However, if you can provide a simple template that would make your harness accept e.g. Jetson TX1, that would be great.

@nvpohanh
Copy link

nvpohanh commented Nov 27, 2019

Since I think you focus on making it run rather than high perf numbers, I would suggest the following:

desktop GPUs (Quadro M1000M or GeForce GTX1080)

  1. Append ("<system_id_1>", "M1000", 1), ("<system_id_2>", "1080", 1) to system_list.py
  2. Copy configs file from one of the folders in measurements. I would suggest that you copy SingleStream or Offline for simplicity.
  3. Modify the following fields in "config.json":
  • input_dtype: set to "fp32"
  • input_format: set to "linear"
  • precision: set to "fp32"
  • use_graphs: set to false
  1. Then it should run now.

Jetson TX1

This is a more challenging one. Throughout our v0.5 submission, we assume Xavier when detecting aarch64 system. I would try the following:

  1. Modify __init__.py#L100 so that it returns your system_id.
  2. Again, copy from Xavier configs and make the same change as mentioned above to set it to fp32 mode.
  3. For SingleStream scenario, this should suffice to run the harness. For other scenarios, you need to pass in --gpu_only to RUN_ARGs to turn off DLA.

Could you try these? If it does not work, please let me know. Also, just a disclaimer that this won't give you the correct potential performance numbers, but it should at least be able to run. Thanks

@nvpohanh
Copy link

Actually, there might be a way simpler approach: try passing --configs=<config_1.json>,<config_2.json>,<config_3.json> into RUN_ARGS to specify the configs so that our code does not complain that it cannot find the configs

@psyhtest
Copy link
Author

Thanks @nvpohanh, I'll give it a go! Just a correction. I think I still need to use quantization on all these GPUs. Any advice on how to set the config then?

@nvpohanh
Copy link

Then you can try with original configs. Not sure if INT8 works on M1000 or not, though.

@vilmara
Copy link
Contributor

vilmara commented Dec 2, 2019

Since I think you focus on making it run rather than high perf numbers, I would suggest the following:

desktop GPUs (Quadro M1000M or GeForce GTX1080)

  1. Append ("<system_id_1>", "M1000", 1), ("<system_id_2>", "1080", 1) to system_list.py
  2. Copy configs file from one of the folders in measurements. I would suggest that you copy SingleStream or Offline for simplicity

Hi @nvpohanh, thanks for the explanation, I have followed the steps above to append successfully my system with 1x Tesla T4. However, I am getting the below error with the command make generate_engines on Docker version 19.03:

Tracelog:

[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/main.py", line 102, in handle_generate_engine
    b.build_engines()
  File "/work/code/common/builder.py", line 124, in build_engines
    buf = engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 317, in main
    launch_handle_generate_engine(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 80, in launch_handle_generate_engine
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
Makefile:298: recipe for target 'generate_engines' failed
make: *** [generate_engines] Error 1

@nvpohanh
Copy link

nvpohanh commented Dec 2, 2019

Since I think you focus on making it run rather than high perf numbers, I would suggest the following:

desktop GPUs (Quadro M1000M or GeForce GTX1080)

  1. Append ("<system_id_1>", "M1000", 1), ("<system_id_2>", "1080", 1) to system_list.py
  2. Copy configs file from one of the folders in measurements. I would suggest that you copy SingleStream or Offline for simplicity

Hi @nvpohanh, thanks for the explanation, I have followed the steps above to append successfully my system with 1x Tesla T4. However, I am getting the below error with the command make generate_engines on Docker version 19.03:

Tracelog:

[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/main.py", line 102, in handle_generate_engine
    b.build_engines()
  File "/work/code/common/builder.py", line 124, in build_engines
    buf = engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 317, in main
    launch_handle_generate_engine(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 80, in launch_handle_generate_engine
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
Makefile:298: recipe for target 'generate_engines' failed
make: *** [generate_engines] Error 1

I started seeing this error yesterday. Will investigate and it and give you updates.

@nvpohanh
Copy link

nvpohanh commented Dec 3, 2019

@vilmara This error seems to be gone today. Could you rebuild the docker image (especially, re-pull nvcr.io/nvidia/tensorrt:19.09-py3) and see if you still run into this issue? thanks

@nvpohanh
Copy link

nvpohanh commented Dec 4, 2019

Okay, root-caused the issue. The newly published libcublas-dev is built for cuda-10.2, so it breaks TRT.
Solution: Remove libcublas-dev from docker/Dockerfile and it should work

@vilmara
Copy link
Contributor

vilmara commented Dec 6, 2019

Hi @nvpohanh, it is working now, thank you!

@ens-lg4
Copy link

ens-lg4 commented Feb 18, 2020

Dear @nvpohanh ,

Thank you for your advice about running the benchmarking suite on GTX-1080 (via docker image) and Jetson-TX1 (installing all the Xavier's dependencies directly on the board and patching Xavier->TX1). I believe I have followed all the instructions that you provided, and we managed to reproduce running the benchmarks for image classification models (both for ResNet50 and MobileNet).

However, the same settings and protocol did not work for ssd-small (ssd-MobileNet), and only led to a "Segmentation Fault" type of crash on both systems. It did not matter whether we used int8/chw4 data format, int8/linear or fp32/linear, AccuracyOnly or PerformanceOnly - the result was always roughly the same:

root@4331bf547160:/work# make run_harness RUN_ARGS="--benchmarks=ssd-small --scenarios=Offline --test_mode=PerformanceOnly"
[2020-02-17 19:02:15,140 main.py:291 INFO] Using config files: measurements/Velociti/ssd-small/Offline/config.json
[2020-02-17 19:02:15,140 __init__.py:142 INFO] Parsing config file measurements/Velociti/ssd-small/Offline/config.json ...
[2020-02-17 19:02:15,141 main.py:295 INFO] Processing config "Velociti_ssd-small_Offline"
[2020-02-17 19:02:15,141 main.py:111 INFO] Running harness for ssd-small benchmark in Offline scenario...
{'gpu_batch_size': 20, 'gpu_copy_streams': 1, 'gpu_inference_streams': 1, 'gpu_offline_expected_qps': 200, 'input_dtype': 'int8', 'input_format': 'chw4', 'map_path': 'data_maps/coco/val_map.txt', 'precision': 'int8', 'tensor_path': '${PREPROCESSED_DATA_DIR}/coco/val2017/SSDMobileNet/int8_chw4', 'use_graphs': False, 'system_id': 'Velociti', 'scenario': 'Offline', 'benchmark': 'ssd-small', 'config_name': 'Velociti_ssd-small_Offline', 'test_mode': 'PerformanceOnly', 'log_dir': '/work/build/logs/2020.02.17-19.02.14'}
[2020-02-17 19:02:15,178 __init__.py:42 INFO] Running command: ./build/bin/harness_default --plugins="build/plugins/NMSOptPlugin/libnmsoptplugin.so" --logfile_outdir="/work/build/logs/2020.02.17-19.02.14/Velociti/ssd-small/Offline" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --gpu_copy_streams=1 --gpu_inference_streams=1 --use_graphs=false --gpu_batch_size=20 --map_path="data_maps/coco/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/coco/val2017/SSDMobileNet/int8_chw4" --gpu_engines="./build/engines/Velociti/ssd-small/Offline/ssd-small-Offline-gpu-b20-int8.plan" --performance_sample_count=256 --max_dlas=0 --mlperf_conf_path="measurements/Velociti/ssd-small/Offline/mlperf.conf" --user_conf_path="measurements/Velociti/ssd-small/Offline/user.conf" --scenario Offline --model ssd-small --response_postprocess coco
&&&& RUNNING Default_Harness # ./build/bin/harness_default
[I] mlperf.conf path: measurements/Velociti/ssd-small/Offline/mlperf.conf
[I] user.conf path: measurements/Velociti/ssd-small/Offline/user.conf
[I] Device:0: ./build/engines/Velociti/ssd-small/Offline/ssd-small-Offline-gpu-b20-int8.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.07325s.
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 319, in main
    handle_run_harness(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 141, in handle_run_harness
    result = harness.run_harness()
  File "/work/code/common/harness.py", line 240, in run_harness
    output = run_command(cmd, get_output=True)
  File "/work/code/common/__init__.py", line 58, in run_command
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command './build/bin/harness_default --plugins="build/plugins/NMSOptPlugin/libnmsoptplugin.so" --logfile_outdir="/work/build/logs/2020.02.17-19.02.14/Velociti/ssd-small/Offline" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --gpu_copy_streams=1 --gpu_inference_streams=1 --use_graphs=false --gpu_batch_size=20 --map_path="data_maps/coco/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/coco/val2017/SSDMobileNet/int8_chw4" --gpu_engines="./build/engines/Velociti/ssd-small/Offline/ssd-small-Offline-gpu-b20-int8.plan" --performance_sample_count=256 --max_dlas=0 --mlperf_conf_path="measurements/Velociti/ssd-small/Offline/mlperf.conf" --user_conf_path="measurements/Velociti/ssd-small/Offline/user.conf" --scenario Offline --model ssd-small --response_postprocess coco' returned non-zero exit status 139.
Makefile:303: recipe for target 'run_harness' failed
make: *** [run_harness] Error 1

I tried to directly feed preprocessed image data to the TRT engine loaded from .plan file, and noticed that although the program expects 701 floats per batch, of which the last, 701st float should convert into an int32 (since it is essentially a number of predicted boxes), I am consistently getting a non-integer there. Perhaps this could explain why the rest of the program crashes?

Does this situation ring any bells? Could I have missed anything?

Many thanks in advance.

@nvpohanh
Copy link

nvpohanh commented Feb 18, 2020

@ens-lg4 One thing I can think of is the NMS plugin. Could you make sure that you build the corresponding SM version for NMS plugin? See: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/code/plugin/NMSOptPlugin/CMakeLists.txt#L81

@ens-lg4
Copy link

ens-lg4 commented Feb 18, 2020

@nvpohanh , thank you so much!

On GTX-1080 I assumed the SM version to be 61, and after adding it to the list and recompiling, I can see two important changes: (1) the benchmark no longer crashes, it runs to the end and produces a credible result in the end. Also, (2) if I look at the output tensor coming off the engine, it is now properly structured - containing 701 floats with 100x7 boxes and a number of active boxes in the end.

On Jetson-TX1 I assumed the SM version to be 53, added it, recompiled, but nothing really changed, unfortunately. The output still looks broken, and the benchmark still SegFaults. Did I not guess the SM version correctly? Or is it not possible to compile NMS plugin for TX1?

Many thanks again!

@ens-lg4
Copy link

ens-lg4 commented Mar 3, 2020

Addition: I tried the same protocol on Jetson-TX2 (having added SM version 62), but it crashed with Segmentation Fault in the same way as Jetson-TX1 did.

Can we assume neither TX1 nor TX2 are supported?

@psyhtest
Copy link
Author

psyhtest commented Mar 4, 2020

It's worth adding that the TX1 and TX2 are both flashed with JetPack 4.3 (with TensorRT 6.0).

@psyhtest
Copy link
Author

psyhtest commented Mar 4, 2020

Also, I believe it happens for SSD-MobileNet-1 both with fp32 and fp16. We haven't tried SSD-ResNet yet.

@renganxu
Copy link

renganxu commented Mar 7, 2020

Hi @nvpohanh, I had error when running the inference on V100-SXM2-16GB. The following is the error. Do you have any idea why this error happened? I added the my config in code/common/system_list.py and corresponding files in measurements/V100-SXM2-16GBx4.

[2020-03-07 21:55:43,339 builder.py:119 INFO] Building ./build/engines/V100-SXM2-16GBx4/resnet/Offline/resnet-Offline-gpu-b32-int8.plan
[TensorRT] INFO: User provided dynamic ranges for all network tensors. Calibration skipped.
[TensorRT] ERROR: RES2_BR1_BR2C_1: could not find any supported formats consistent with input/output data types
[TensorRT] ERROR: ../builder/cudnnBuilderGraphNodes.cpp (539) - Misc Error in reportPluginError: 0 (could not find any supported formats consistent with input/output data types)
[TensorRT] ERROR: ../builder/cudnnBuilderGraphNodes.cpp (539) - Misc Error in reportPluginError: 0 (could not find any supported formats consistent with input/output data types)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/main.py", line 102, in handle_generate_engine
    b.build_engines()
  File "/work/code/common/builder.py", line 124, in build_engines
    buf = engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 317, in main
    launch_handle_generate_engine(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 80, in launch_handle_generate_engine
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
Makefile:298: recipe for target 'generate_engines' failed

@vilmara
Copy link
Contributor

vilmara commented Mar 9, 2020

Hi @renganxu, I think this error is produced because V100 doesn't support INT8 Tensor Cores, have you tried with FP16 Tensor Cores?. Here is the list of supported precision mode per hardware https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix

Capture

@nvpohanh
Copy link

nvpohanh commented Mar 9, 2020

@vilmara is correct. Please use FP16 on V100.

@dagrayvid
Copy link

@nvpohanh @vilmara Is there documentation on how to use FP16 in these benchmarks? Do we need to edit the preprocessing script?

@nvpohanh
Copy link

@dagrayvid You can modify the config.json like this one: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/measurements/T4x8/resnet/Offline/config.json

For example, change input_dtype to fp32, precision to fp16, and the "int8_linear" in tensor_path to fp32.

@nileshnegi
Copy link

I am trying to get the NVIDIA implementations of these inference benchmarks running on my 8*V100-SXM2-32GB box. At the moment, I just want to get all benchmarks running.

As suggested earlier, I copied the T4x8 config files and altered them for V100 (input_dtype, precision, path to preprocessed data). With this I am able to run the Offline and SingleStream (wherever supported) scenarios for all 5 benchmarks.
However with MultiStream and Server, the run_harness command hangs after the warmup step. There are processes (./build/bin/harness_default) spawned on all 8 GPUs, but no movement (GPU-util is at 0 and the CPU process for harness_default is in Sleep state).
Attached is the stdout:

[2020-05-06 10:07:13,373 main.py:291 INFO] Using config files: measurements/V100-SXM2-32GBx8/resnet/MultiStream/config.json
[2020-05-06 10:07:13,374 __init__.py:142 INFO] Parsing config file measurements/V100-SXM2-32GBx8/resnet/MultiStream/config.json ...
[2020-05-06 10:07:13,374 main.py:295 INFO] Processing config "V100-SXM2-32GBx8_resnet_MultiStream"
[2020-05-06 10:07:13,374 main.py:111 INFO] Running harness for resnet benchmark in MultiStream scenario...
[2020-05-06 10:07:13,375 harness.py:49 INFO] ===== Harness arguments for resnet =====
[2020-05-06 10:07:13,375 harness.py:51 INFO] deque_timeout_us=50
[2020-05-06 10:07:13,376 harness.py:51 INFO] gpu_batch_size=60
[2020-05-06 10:07:13,376 harness.py:51 INFO] gpu_multi_stream_samples_per_query=1920
[2020-05-06 10:07:13,376 harness.py:51 INFO] input_dtype=fp32
[2020-05-06 10:07:13,376 harness.py:51 INFO] input_format=
[2020-05-06 10:07:13,376 harness.py:51 INFO] map_path=data_maps/imagenet/val_map.txt
[2020-05-06 10:07:13,376 harness.py:51 INFO] precision=fp16
[2020-05-06 10:07:13,376 harness.py:51 INFO] tensor_path=${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32
[2020-05-06 10:07:13,376 harness.py:51 INFO] use_batcher_thread_per_device=True
[2020-05-06 10:07:13,376 harness.py:51 INFO] use_graphs=False
[2020-05-06 10:07:13,376 harness.py:51 INFO] system_id=V100-SXM2-32GBx8
[2020-05-06 10:07:13,376 harness.py:51 INFO] scenario=MultiStream
[2020-05-06 10:07:13,376 harness.py:51 INFO] benchmark=resnet
[2020-05-06 10:07:13,376 harness.py:51 INFO] config_name=V100-SXM2-32GBx8_resnet_MultiStream
[2020-05-06 10:07:13,376 harness.py:51 INFO] verbose=True
[2020-05-06 10:07:13,376 harness.py:51 INFO] test_mode=PerformanceOnly
[2020-05-06 10:07:13,376 harness.py:51 INFO] log_dir=/work/build/logs/2020.05.06-10.07.12
[2020-05-06 10:07:13,376 __init__.py:42 INFO] Running command: ./build/bin/harness_default --verbose=true --logfile_outdir="/work/build/logs/2020.05.06-10.07.12/V100-SXM2-32GBx8/resnet/MultiStream" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --deque_timeout_us=50 --use_batcher_thread_per_device=true --use_graphs=false --gpu_batch_size=60 --map_path="data_maps/imagenet/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32" --gpu_engines="./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan" --performance_sample_count=1024 --max_dlas=0 --mlperf_conf_path="measurements/V100-SXM2-32GBx8/resnet/MultiStream/mlperf.conf" --user_conf_path="measurements/V100-SXM2-32GBx8/resnet/MultiStream/user.conf" --scenario MultiStream --model resnet
{'deque_timeout_us': 50, 'gpu_batch_size': 60, 'gpu_multi_stream_samples_per_query': 1920, 'input_dtype': 'fp32', 'input_format': '', 'map_path': 'data_maps/imagenet/val_map.txt', 'precision': 'fp16', 'tensor_path': '${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32', 'use_batcher_thread_per_device': True, 'use_graphs': False, 'system_id': 'V100-SXM2-32GBx8', 'scenario': 'MultiStream', 'benchmark': 'resnet', 'config_name': 'V100-SXM2-32GBx8_resnet_MultiStream', 'verbose': True, 'test_mode': 'PerformanceOnly', 'log_dir': '/work/build/logs/2020.05.06-10.07.12'}
&&&& RUNNING Default_Harness # ./build/bin/harness_default
[I] mlperf.conf path: measurements/V100-SXM2-32GBx8/resnet/MultiStream/mlperf.conf
[I] user.conf path: measurements/V100-SXM2-32GBx8/resnet/MultiStream/user.conf
[V] [TRT] Plugin Creator registration succeeded - GridAnchor_TRT
[V] [TRT] Plugin Creator registration succeeded - NMS_TRT
[V] [TRT] Plugin Creator registration succeeded - Reorg_TRT
[V] [TRT] Plugin Creator registration succeeded - Region_TRT
[V] [TRT] Plugin Creator registration succeeded - Clip_TRT
[V] [TRT] Plugin Creator registration succeeded - LReLU_TRT
[V] [TRT] Plugin Creator registration succeeded - PriorBox_TRT
[V] [TRT] Plugin Creator registration succeeded - Normalize_TRT
[V] [TRT] Plugin Creator registration succeeded - RPROI_TRT
[V] [TRT] Plugin Creator registration succeeded - BatchedNMS_TRT
[V] [TRT] Plugin Creator registration succeeded - FlattenConcat_TRT
[V] [TRT] Deserialize required 1402082 microseconds.
[I] Device:0: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2031154 microseconds.
[I] Device:1: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2071939 microseconds.
[I] Device:2: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2275912 microseconds.
[I] Device:3: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2556367 microseconds.
[I] Device:4: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2710989 microseconds.
[I] Device:5: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2994004 microseconds.
[I] Device:6: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 3267479 microseconds.
[I] Device:7: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 1 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 2 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 3 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 4 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 5 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 6 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 7 EnableBatcherThreadPerDevice: true
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.16286s.

How can I debug this?

@nvpohanh
Copy link

nvpohanh commented May 6, 2020

@nileshnegi For Server, please set the server_target_qps to some very low number (like 1) to see if the program can finish without issue.

For MultiStream, please set multi_stream_samples_per_query to low number (like 1?) and set min_query_count to 1. By default, MultiStream will run for at least 4 hours.

@sbillus
Copy link

sbillus commented Jun 19, 2020

Hi, I am also trying to run the benchmark on Tesla V100. And based on the suggestions above, I have created the configuration file.
But when I am running the test, it is failing after warm up. Looks like it is not able to find a file that is never supposed to be there in ImageNet validation set anyway.
Below is the console output:

[I] mlperf.conf path: measurements/ml2/resnet/Offline/mlperf.conf
[I] user.conf path: measurements/ml2/resnet/Offline/user.conf
[I] Device:0: ./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan has been successfully loaded.
[I] Device:1: ./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan has been successfully loaded.
[I] Device:2: ./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.59176s.
F0619 01:22:14.912485 5454 qsl.hpp:145] Check failed: fs Unable to open: /work/build/preprocessed_data/imagenet/ResNet50/fp32/ILSVRC2012_val_00027229.JPEG.npy
*** Check failure stack trace: ***
@ 0x7f3e75626362 google::LogMessage::Fail()
@ 0x7f3e756262aa google::LogMessage::SendToLog()
@ 0x7f3e75625beb google::LogMessage::Flush()
@ 0x7f3e75629066 google::LogMessageFatal::~LogMessageFatal()
@ 0x55b98e06339f qsl::SampleLibrary::LoadSamplesToRam()
@ 0x55b98e0f57b3 mlperf::loadgen::RunPerformanceMode<>()
@ 0x55b98e0d5512 mlperf::StartTest()
@ 0x55b98e05de95 doInference()
@ 0x55b98e05b88f main
@ 0x7f3e6739fb6b __libc_start_main
@ 0x55b98e05bf9a start
@ (nil) (unknown)
Aborted (core dumped)
Traceback (most recent call last):
File "code/main.py", line 327, in
main()
File "code/main.py", line 319, in main
handle_run_harness(benchmark_name, benchmark_conf, need_gpu, need_dla)
File "code/main.py", line 141, in handle_run_harness
result = harness.run_harness()
File "/work/code/common/harness.py", line 240, in run_harness
output = run_command(cmd, get_output=True)
File "/work/code/common/init.py", line 58, in run_command
raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command './build/bin/harness_default --logfile_outdir="/work/build/logs/2020.06.19-01.20.50/ml2/resnet/Offline" --logfile_prefix="mlperf_log
" --gpu_copy_streams=4 --gpu_inference_streams=1 --use_graphs=false --gpu_batch_size=256 --map_path="data_maps/imagenet/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32" --gpu_engines="./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan" --performance_sample_count=1024 --max_dlas=0 --mlperf_conf_path="measurements/ml2/resnet/Offline/mlperf.conf" --user_conf_path="measurements/ml2/resnet/Offline/user.conf" --scenario Offline --model resnet' returned non-zero exit status 134.
Makefile:303: recipe for target 'run_harness' failed
make[1]: *** [run_harness] Error 1
make[1]: Leaving directory '/work'
Makefile:292: recipe for target 'run' failed
make: *** [run] Error 2

I checked, there is no ILSVRC2012_val_00027229.JPEG in validation set, then why is the benchmark looking for it?

@nvpohanh
Copy link

@sbillus
Copy link

sbillus commented Jun 19, 2020

Yes I did. I tried running preprocessing again just to be sure, and this time all the files already existed. I checked my val_map.txt file also and it doesn't contain ILSVRC2012_val_00027229.JPEG too. I have no idea then why it is looking for it.
The engine is generated successfully and the warm up runs fine too. It just fails when not able to find this image.

@nvpohanh
Copy link

Ah, you are running on V100! This is not our intended submission platform.

Nevertheless, you can hack the preprocessing script by add fp32 to the format list here: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/scripts/preprocess_data.py#L300

@sbillus
Copy link

sbillus commented Jun 19, 2020

It worked! Thank you for the help!
I thought that fp32 format subdirectory was already existing for ImageNet after preprocessing and the tensor path to it was already defined in the custom config file, that I didn't need to make any more tweaks anywhere else to run the benchmark.
(Although I still don't know why it was looking for that image in case of format mismatch?)
It makes me wonder, if there is a way to add "fp32" to COCO dataset preprocessing as well to benchmark V100 against object detection task as well.
I will try that and update others in the thread.
EDIT: It works the same way for COCO dataset too. Just tried ssd-large benchmark after preprocessing COCO dataset with the same hack into the preprocessing file suggested above, and it worked!

@rakshithvasudev
Copy link

Hello @nvpohanh,

I have a very similar issue as reported above by nileshnegi. It seems to me that resnet Server scenario is frozen for some reason. I've modified the config to run on V100 as mentioned above. Also tried setting the server_target_qps=1, still no luck. Any pointers to fix this issue would be appreciated.


root@node002:/work# make run_harness RUN_ARGS="--benchmarks=resnet --scenarios=Server --test_mode=PerformanceOnly"
[2020-07-16 03:11:04,380 main.py:291 INFO] Using config files: measurements/V100-SXM2-32GBx4/resnet/Server/config.json
[2020-07-16 03:11:04,380 __init__.py:142 INFO] Parsing config file measurements/V100-SXM2-32GBx4/resnet/Server/config.json ...
[2020-07-16 03:11:04,380 main.py:295 INFO] Processing config "V100-SXM2-32GBx4_resnet_Server"
[2020-07-16 03:11:04,380 main.py:111 INFO] Running harness for resnet benchmark in Server scenario...
{'active_sms': 50, 'deque_timeout_us': 2000, 'gpu_batch_size': 80, 'gpu_copy_streams': 8, 'gpu_inference_streams': 2, 'input_dtype': 'fp32', 'input_format': 'linear', 'map_path': 'data_maps/imagenet/val_map.txt', 'precision': 'fp16', 'server_target_qps': 1, 'tensor_path': '${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32', 'use_cuda_thread_per_device': True, 'use_deque_limit': True, 'use_graphs': False, 'system_id': 'V100-SXM2-32GBx4', 'scenario': 'Server', 'benchmark': 'resnet', 'config_name': 'V100-SXM2-32GBx4_resnet_Server', 'test_mode': 'PerformanceOnly', 'log_dir': '/work/build/logs/2020.07.16-03.11.03'}
[2020-07-16 03:11:04,382 __init__.py:42 INFO] Running command: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./build/bin/harness_default --logfile_outdir="/work/build/logs/2020.07.16-03.11.03/V100-SXM2-32GBx4/resnet/Server" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --gpu_copy_streams=8 --gpu_inference_streams=2 --use_deque_limit=true --deque_timeout_us=2000 --use_cuda_thread_per_device=true --use_graphs=false --gpu_batch_size=80 --map_path="data_maps/imagenet/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32" --gpu_engines="./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan" --performance_sample_count=1024 --max_dlas=0 --mlperf_conf_path="measurements/V100-SXM2-32GBx4/resnet/Server/mlperf.conf" --user_conf_path="measurements/V100-SXM2-32GBx4/resnet/Server/user.conf" --scenario Server --model resnet
&&&& RUNNING Default_Harness # ./build/bin/harness_default
[I] mlperf.conf path: measurements/V100-SXM2-32GBx4/resnet/Server/mlperf.conf
[I] user.conf path: measurements/V100-SXM2-32GBx4/resnet/Server/user.conf
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:0: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:1: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:2: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:3: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false
[I] Creating cuda thread: 0
[I] Creating cuda thread: 1
[I] Creating cuda thread: 2
[I] Creating cuda thread: 3
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.36912s.

@nvpohanh
Copy link

@rakshithvasudev Please also add --min_query_count=1. By default, LoadGen runs for at lease 270k queries, so at target_qps=1, that's three days...

@rakshithvasudev
Copy link

@nvpohanh it worked with --min_query_count=1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants