Customizing NVIDIA instructions for reproducibility #7

psyhtest · 2019-11-25T15:48:13Z

I'd like to generate and run TensorRT optimized plan files from the NVIDIA v0.5 submission on my (admittedly out-of-date) NVIDIA devices: Quadro M1000M, GeForce GTX1080 and Jetson TX1.

Unfortunately, following the instructions fails:

$ make generate_engines
...
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 286, in main
    config_files = find_config_files(benchmarks, scenarios)
  File "/data/anton/inference_results_v0.5/closed/NVIDIA/code/common/__init__.py", line 123, in find_config_files
    system = get_system_id()
  File "/data/anton/inference_results_v0.5/closed/NVIDIA/code/common/__init__.py", line 117, in get_system_id
    raise RuntimeError("Cannot find valid configs for {:d}x {:}".format(count_actual, name))
RuntimeError: Cannot find valid configs for 1x GeForce GTX 1080
Makefile:298: recipe for target 'generate_engines' failed

With a naive intervention:

anton@diviniti:~/projects/mlperf/inference_results_v0.5/closed/NVIDIA$ git diff
diff --git a/closed/NVIDIA/code/common/__init__.py b/closed/NVIDIA/code/common/__init__.py
index 1ab6ee33..a9e07066 100644
--- a/closed/NVIDIA/code/common/__init__.py
+++ b/closed/NVIDIA/code/common/__init__.py
@@ -102,7 +102,10 @@ def get_system_id():
     import pycuda.driver
     import pycuda.autoinit
     name = pycuda.driver.Device(0).name()
+    print(name)
     count_actual = pycuda.driver.Device.count()
+    print(count_actual)
+    print(system_list)
     for system in system_list:
         if system[1] in name and count_actual == system[2]:
             return system[0]

I get a bit of clue about what's going on:

ck virtual env --tags=pycuda --shell_cmd="make generate_engines"
Quadro M1000M
1
[('T4x8', 'Tesla T4', 8), ('T4x20', 'Tesla T4', 20), ('TitanRTXx4', 'TITAN RTX', 4), ('Xavier', 'Xavier', 1)]

OK, I only have one Quadro M1000M GPU in my laptop, and it's clearly not in the list of systems NVIDIA used for their v0.5 submission. The same for the other two devices.

It appears though that I would have the same issue with reproducing the Alibaba submission using only one Tesla T4 or the DellEMC submission using four Tesla T4s. It's ironic because both Alibaba and DellEMC both refer to the NVIDIA submission for reproducibility.)

Would it be possible to untangle the instructions to allow generating optimized TensorRT plans for other devices please?

The text was updated successfully, but these errors were encountered:

psyhtest · 2019-11-25T15:53:25Z

/cc @DilipSequeira @nvpohanh

nvpohanh · 2019-11-25T18:12:39Z

The problem is that MLPerf runs require that you set the a lot of correct parameters first (like batch size, precision, target_qps, samples_per_query, etc.). It is not easy for MLPerf beginners to get VALID results without extensive tuning and understanding of how MLPerf works.

In your case, you can add more system config files into measurements, changed the parameters, and append them to system_list.py. Alibaba and DellEMC did the same thing. Nevertheless, it is not guaranteed that it will work, especially on old GPUs like Maxwell or Pascal.

nvpohanh · 2019-11-25T18:16:04Z

One thing I do agree, though, is that we can provide more instructions about how to reproduce Alibaba and Dell's results.

psyhtest · 2019-11-26T10:11:32Z

Thanks @nvpohanh. I agree that getting valid MLPerf results is not straightforward, and that your harness aims to address this for your submissions and those of your clients.

However, it seems to conflate two things: 1) generating optimized TensorRT plans which requires setting up parameters like the batch size and precision; 2) running the optimized plans which requires setting up LoadGen parameters like target_qps, samples_per_query, etc. It's OK for capturing the final result of the tuning, but it would be great to disentangle them to allow new devices to be added and tested quickly. A simple script per benchmark that takes as input the path to an original model, batch size, precision, etc., and outputs an optimized TensorRT plan for those parameters would be great.

For example, for MobileNet some interesting manipulations seem to be taking place (which probably result in a higher accuracy than when using the vanilla ONNX-to-TensorRT conversion), but it's not obvious how one would run this piece of code in isolation of your test harness.

nvpohanh · 2019-11-26T18:21:26Z

@psyhtest I realized that we did provide a performance_tuning_guide.adoc to tell the users how to tune the MLPerf parameters to get VALID results. The remaining missing part is how to add new systems into system_list.py. do you agree?

Also, the MobileNet manipulations shouldn't depend on which GPU you use.

psyhtest · 2019-11-27T13:20:19Z

Also, the MobileNet manipulations shouldn't depend on which GPU you use.

Precisely my point! I assume the same is true for the other models? Which is why it is a bit awkward that a full config should be provided just to run the conversion.

However, if you can provide a simple template that would make your harness accept e.g. Jetson TX1, that would be great.

nvpohanh · 2019-11-27T18:38:29Z

Since I think you focus on making it run rather than high perf numbers, I would suggest the following:

desktop GPUs (Quadro M1000M or GeForce GTX1080)

Append ("<system_id_1>", "M1000", 1), ("<system_id_2>", "1080", 1) to system_list.py
Copy configs file from one of the folders in measurements. I would suggest that you copy SingleStream or Offline for simplicity.
Modify the following fields in "config.json":

input_dtype: set to "fp32"
input_format: set to "linear"
precision: set to "fp32"
use_graphs: set to false

Then it should run now.

Jetson TX1

This is a more challenging one. Throughout our v0.5 submission, we assume Xavier when detecting aarch64 system. I would try the following:

Modify __init__.py#L100 so that it returns your system_id.
Again, copy from Xavier configs and make the same change as mentioned above to set it to fp32 mode.
For SingleStream scenario, this should suffice to run the harness. For other scenarios, you need to pass in --gpu_only to RUN_ARGs to turn off DLA.

Could you try these? If it does not work, please let me know. Also, just a disclaimer that this won't give you the correct potential performance numbers, but it should at least be able to run. Thanks

nvpohanh · 2019-11-27T19:15:23Z

Actually, there might be a way simpler approach: try passing --configs=<config_1.json>,<config_2.json>,<config_3.json> into RUN_ARGS to specify the configs so that our code does not complain that it cannot find the configs

psyhtest · 2019-11-27T22:16:44Z

Thanks @nvpohanh, I'll give it a go! Just a correction. I think I still need to use quantization on all these GPUs. Any advice on how to set the config then?

nvpohanh · 2019-11-27T22:32:21Z

Then you can try with original configs. Not sure if INT8 works on M1000 or not, though.

vilmara · 2019-12-02T21:06:06Z

Since I think you focus on making it run rather than high perf numbers, I would suggest the following:

desktop GPUs (Quadro M1000M or GeForce GTX1080)

Append ("<system_id_1>", "M1000", 1), ("<system_id_2>", "1080", 1) to system_list.py

Copy configs file from one of the folders in measurements. I would suggest that you copy SingleStream or Offline for simplicity

Hi @nvpohanh, thanks for the explanation, I have followed the steps above to append successfully my system with 1x Tesla T4. However, I am getting the below error with the command make generate_engines on Docker version 19.03:

Tracelog:

[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/main.py", line 102, in handle_generate_engine
    b.build_engines()
  File "/work/code/common/builder.py", line 124, in build_engines
    buf = engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 317, in main
    launch_handle_generate_engine(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 80, in launch_handle_generate_engine
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
Makefile:298: recipe for target 'generate_engines' failed
make: *** [generate_engines] Error 1

nvpohanh · 2019-12-02T21:38:17Z

Since I think you focus on making it run rather than high perf numbers, I would suggest the following:

desktop GPUs (Quadro M1000M or GeForce GTX1080)

Append ("<system_id_1>", "M1000", 1), ("<system_id_2>", "1080", 1) to system_list.py

Copy configs file from one of the folders in measurements. I would suggest that you copy SingleStream or Offline for simplicity

Hi @nvpohanh, thanks for the explanation, I have followed the steps above to append successfully my system with 1x Tesla T4. However, I am getting the below error with the command make generate_engines on Docker version 19.03:

Tracelog:
[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
[TensorRT] ERROR: ../rtSafe/safeContext.cpp (110) - cuBLAS Error in initializeCommonContext: 1 (Could not initialize cublas, please check cuda installation.)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/main.py", line 102, in handle_generate_engine
    b.build_engines()
  File "/work/code/common/builder.py", line 124, in build_engines
    buf = engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 317, in main
    launch_handle_generate_engine(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 80, in launch_handle_generate_engine
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
Makefile:298: recipe for target 'generate_engines' failed
make: *** [generate_engines] Error 1

I started seeing this error yesterday. Will investigate and it and give you updates.

nvpohanh · 2019-12-03T18:47:00Z

@vilmara This error seems to be gone today. Could you rebuild the docker image (especially, re-pull nvcr.io/nvidia/tensorrt:19.09-py3) and see if you still run into this issue? thanks

nvpohanh · 2019-12-04T19:32:45Z

Okay, root-caused the issue. The newly published libcublas-dev is built for cuda-10.2, so it breaks TRT.
Solution: Remove libcublas-dev from docker/Dockerfile and it should work

vilmara · 2019-12-06T21:13:11Z

Hi @nvpohanh, it is working now, thank you!

ens-lg4 · 2020-02-18T12:00:12Z

Dear @nvpohanh ,

Thank you for your advice about running the benchmarking suite on GTX-1080 (via docker image) and Jetson-TX1 (installing all the Xavier's dependencies directly on the board and patching Xavier->TX1). I believe I have followed all the instructions that you provided, and we managed to reproduce running the benchmarks for image classification models (both for ResNet50 and MobileNet).

However, the same settings and protocol did not work for ssd-small (ssd-MobileNet), and only led to a "Segmentation Fault" type of crash on both systems. It did not matter whether we used int8/chw4 data format, int8/linear or fp32/linear, AccuracyOnly or PerformanceOnly - the result was always roughly the same:

root@4331bf547160:/work# make run_harness RUN_ARGS="--benchmarks=ssd-small --scenarios=Offline --test_mode=PerformanceOnly"
[2020-02-17 19:02:15,140 main.py:291 INFO] Using config files: measurements/Velociti/ssd-small/Offline/config.json
[2020-02-17 19:02:15,140 __init__.py:142 INFO] Parsing config file measurements/Velociti/ssd-small/Offline/config.json ...
[2020-02-17 19:02:15,141 main.py:295 INFO] Processing config "Velociti_ssd-small_Offline"
[2020-02-17 19:02:15,141 main.py:111 INFO] Running harness for ssd-small benchmark in Offline scenario...
{'gpu_batch_size': 20, 'gpu_copy_streams': 1, 'gpu_inference_streams': 1, 'gpu_offline_expected_qps': 200, 'input_dtype': 'int8', 'input_format': 'chw4', 'map_path': 'data_maps/coco/val_map.txt', 'precision': 'int8', 'tensor_path': '${PREPROCESSED_DATA_DIR}/coco/val2017/SSDMobileNet/int8_chw4', 'use_graphs': False, 'system_id': 'Velociti', 'scenario': 'Offline', 'benchmark': 'ssd-small', 'config_name': 'Velociti_ssd-small_Offline', 'test_mode': 'PerformanceOnly', 'log_dir': '/work/build/logs/2020.02.17-19.02.14'}
[2020-02-17 19:02:15,178 __init__.py:42 INFO] Running command: ./build/bin/harness_default --plugins="build/plugins/NMSOptPlugin/libnmsoptplugin.so" --logfile_outdir="/work/build/logs/2020.02.17-19.02.14/Velociti/ssd-small/Offline" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --gpu_copy_streams=1 --gpu_inference_streams=1 --use_graphs=false --gpu_batch_size=20 --map_path="data_maps/coco/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/coco/val2017/SSDMobileNet/int8_chw4" --gpu_engines="./build/engines/Velociti/ssd-small/Offline/ssd-small-Offline-gpu-b20-int8.plan" --performance_sample_count=256 --max_dlas=0 --mlperf_conf_path="measurements/Velociti/ssd-small/Offline/mlperf.conf" --user_conf_path="measurements/Velociti/ssd-small/Offline/user.conf" --scenario Offline --model ssd-small --response_postprocess coco
&&&& RUNNING Default_Harness # ./build/bin/harness_default
[I] mlperf.conf path: measurements/Velociti/ssd-small/Offline/mlperf.conf
[I] user.conf path: measurements/Velociti/ssd-small/Offline/user.conf
[I] Device:0: ./build/engines/Velociti/ssd-small/Offline/ssd-small-Offline-gpu-b20-int8.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.07325s.
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 319, in main
    handle_run_harness(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 141, in handle_run_harness
    result = harness.run_harness()
  File "/work/code/common/harness.py", line 240, in run_harness
    output = run_command(cmd, get_output=True)
  File "/work/code/common/__init__.py", line 58, in run_command
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command './build/bin/harness_default --plugins="build/plugins/NMSOptPlugin/libnmsoptplugin.so" --logfile_outdir="/work/build/logs/2020.02.17-19.02.14/Velociti/ssd-small/Offline" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --gpu_copy_streams=1 --gpu_inference_streams=1 --use_graphs=false --gpu_batch_size=20 --map_path="data_maps/coco/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/coco/val2017/SSDMobileNet/int8_chw4" --gpu_engines="./build/engines/Velociti/ssd-small/Offline/ssd-small-Offline-gpu-b20-int8.plan" --performance_sample_count=256 --max_dlas=0 --mlperf_conf_path="measurements/Velociti/ssd-small/Offline/mlperf.conf" --user_conf_path="measurements/Velociti/ssd-small/Offline/user.conf" --scenario Offline --model ssd-small --response_postprocess coco' returned non-zero exit status 139.
Makefile:303: recipe for target 'run_harness' failed
make: *** [run_harness] Error 1

I tried to directly feed preprocessed image data to the TRT engine loaded from .plan file, and noticed that although the program expects 701 floats per batch, of which the last, 701st float should convert into an int32 (since it is essentially a number of predicted boxes), I am consistently getting a non-integer there. Perhaps this could explain why the rest of the program crashes?

Does this situation ring any bells? Could I have missed anything?

Many thanks in advance.

nvpohanh · 2020-02-18T16:48:58Z

@ens-lg4 One thing I can think of is the NMS plugin. Could you make sure that you build the corresponding SM version for NMS plugin? See: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/code/plugin/NMSOptPlugin/CMakeLists.txt#L81

ens-lg4 · 2020-02-18T17:33:58Z

@nvpohanh , thank you so much!

On GTX-1080 I assumed the SM version to be 61, and after adding it to the list and recompiling, I can see two important changes: (1) the benchmark no longer crashes, it runs to the end and produces a credible result in the end. Also, (2) if I look at the output tensor coming off the engine, it is now properly structured - containing 701 floats with 100x7 boxes and a number of active boxes in the end.

On Jetson-TX1 I assumed the SM version to be 53, added it, recompiled, but nothing really changed, unfortunately. The output still looks broken, and the benchmark still SegFaults. Did I not guess the SM version correctly? Or is it not possible to compile NMS plugin for TX1?

Many thanks again!

ens-lg4 · 2020-03-03T19:22:49Z

Addition: I tried the same protocol on Jetson-TX2 (having added SM version 62), but it crashed with Segmentation Fault in the same way as Jetson-TX1 did.

Can we assume neither TX1 nor TX2 are supported?

psyhtest · 2020-03-04T00:40:09Z

It's worth adding that the TX1 and TX2 are both flashed with JetPack 4.3 (with TensorRT 6.0).

psyhtest · 2020-03-04T00:42:24Z

Also, I believe it happens for SSD-MobileNet-1 both with fp32 and fp16. We haven't tried SSD-ResNet yet.

renganxu · 2020-03-07T22:00:31Z

Hi @nvpohanh, I had error when running the inference on V100-SXM2-16GB. The following is the error. Do you have any idea why this error happened? I added the my config in code/common/system_list.py and corresponding files in measurements/V100-SXM2-16GBx4.

[2020-03-07 21:55:43,339 builder.py:119 INFO] Building ./build/engines/V100-SXM2-16GBx4/resnet/Offline/resnet-Offline-gpu-b32-int8.plan
[TensorRT] INFO: User provided dynamic ranges for all network tensors. Calibration skipped.
[TensorRT] ERROR: RES2_BR1_BR2C_1: could not find any supported formats consistent with input/output data types
[TensorRT] ERROR: ../builder/cudnnBuilderGraphNodes.cpp (539) - Misc Error in reportPluginError: 0 (could not find any supported formats consistent with input/output data types)
[TensorRT] ERROR: ../builder/cudnnBuilderGraphNodes.cpp (539) - Misc Error in reportPluginError: 0 (could not find any supported formats consistent with input/output data types)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/work/code/main.py", line 102, in handle_generate_engine
    b.build_engines()
  File "/work/code/common/builder.py", line 124, in build_engines
    buf = engine.serialize()
AttributeError: 'NoneType' object has no attribute 'serialize'
Traceback (most recent call last):
  File "code/main.py", line 327, in <module>
    main()
  File "code/main.py", line 317, in main
    launch_handle_generate_engine(benchmark_name, benchmark_conf, need_gpu, need_dla)
  File "code/main.py", line 80, in launch_handle_generate_engine
    raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
Makefile:298: recipe for target 'generate_engines' failed

vilmara · 2020-03-09T15:39:08Z

Hi @renganxu, I think this error is produced because V100 doesn't support INT8 Tensor Cores, have you tried with FP16 Tensor Cores?. Here is the list of supported precision mode per hardware https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix

nvpohanh · 2020-03-09T16:30:54Z

@vilmara is correct. Please use FP16 on V100.

dagrayvid · 2020-03-12T17:00:50Z

@nvpohanh @vilmara Is there documentation on how to use FP16 in these benchmarks? Do we need to edit the preprocessing script?

nvpohanh · 2020-03-12T17:06:17Z

@dagrayvid You can modify the config.json like this one: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/measurements/T4x8/resnet/Offline/config.json

For example, change input_dtype to fp32, precision to fp16, and the "int8_linear" in tensor_path to fp32.

…erf/inference_results_v0.5/issues/7#issuecomment-561804070

nileshnegi · 2020-05-06T15:23:24Z

I am trying to get the NVIDIA implementations of these inference benchmarks running on my 8*V100-SXM2-32GB box. At the moment, I just want to get all benchmarks running.

As suggested earlier, I copied the T4x8 config files and altered them for V100 (input_dtype, precision, path to preprocessed data). With this I am able to run the Offline and SingleStream (wherever supported) scenarios for all 5 benchmarks.
However with MultiStream and Server, the run_harness command hangs after the warmup step. There are processes (./build/bin/harness_default) spawned on all 8 GPUs, but no movement (GPU-util is at 0 and the CPU process for harness_default is in Sleep state).
Attached is the stdout:

[2020-05-06 10:07:13,373 main.py:291 INFO] Using config files: measurements/V100-SXM2-32GBx8/resnet/MultiStream/config.json
[2020-05-06 10:07:13,374 __init__.py:142 INFO] Parsing config file measurements/V100-SXM2-32GBx8/resnet/MultiStream/config.json ...
[2020-05-06 10:07:13,374 main.py:295 INFO] Processing config "V100-SXM2-32GBx8_resnet_MultiStream"
[2020-05-06 10:07:13,374 main.py:111 INFO] Running harness for resnet benchmark in MultiStream scenario...
[2020-05-06 10:07:13,375 harness.py:49 INFO] ===== Harness arguments for resnet =====
[2020-05-06 10:07:13,375 harness.py:51 INFO] deque_timeout_us=50
[2020-05-06 10:07:13,376 harness.py:51 INFO] gpu_batch_size=60
[2020-05-06 10:07:13,376 harness.py:51 INFO] gpu_multi_stream_samples_per_query=1920
[2020-05-06 10:07:13,376 harness.py:51 INFO] input_dtype=fp32
[2020-05-06 10:07:13,376 harness.py:51 INFO] input_format=
[2020-05-06 10:07:13,376 harness.py:51 INFO] map_path=data_maps/imagenet/val_map.txt
[2020-05-06 10:07:13,376 harness.py:51 INFO] precision=fp16
[2020-05-06 10:07:13,376 harness.py:51 INFO] tensor_path=${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32
[2020-05-06 10:07:13,376 harness.py:51 INFO] use_batcher_thread_per_device=True
[2020-05-06 10:07:13,376 harness.py:51 INFO] use_graphs=False
[2020-05-06 10:07:13,376 harness.py:51 INFO] system_id=V100-SXM2-32GBx8
[2020-05-06 10:07:13,376 harness.py:51 INFO] scenario=MultiStream
[2020-05-06 10:07:13,376 harness.py:51 INFO] benchmark=resnet
[2020-05-06 10:07:13,376 harness.py:51 INFO] config_name=V100-SXM2-32GBx8_resnet_MultiStream
[2020-05-06 10:07:13,376 harness.py:51 INFO] verbose=True
[2020-05-06 10:07:13,376 harness.py:51 INFO] test_mode=PerformanceOnly
[2020-05-06 10:07:13,376 harness.py:51 INFO] log_dir=/work/build/logs/2020.05.06-10.07.12
[2020-05-06 10:07:13,376 __init__.py:42 INFO] Running command: ./build/bin/harness_default --verbose=true --logfile_outdir="/work/build/logs/2020.05.06-10.07.12/V100-SXM2-32GBx8/resnet/MultiStream" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --deque_timeout_us=50 --use_batcher_thread_per_device=true --use_graphs=false --gpu_batch_size=60 --map_path="data_maps/imagenet/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32" --gpu_engines="./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan" --performance_sample_count=1024 --max_dlas=0 --mlperf_conf_path="measurements/V100-SXM2-32GBx8/resnet/MultiStream/mlperf.conf" --user_conf_path="measurements/V100-SXM2-32GBx8/resnet/MultiStream/user.conf" --scenario MultiStream --model resnet
{'deque_timeout_us': 50, 'gpu_batch_size': 60, 'gpu_multi_stream_samples_per_query': 1920, 'input_dtype': 'fp32', 'input_format': '', 'map_path': 'data_maps/imagenet/val_map.txt', 'precision': 'fp16', 'tensor_path': '${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32', 'use_batcher_thread_per_device': True, 'use_graphs': False, 'system_id': 'V100-SXM2-32GBx8', 'scenario': 'MultiStream', 'benchmark': 'resnet', 'config_name': 'V100-SXM2-32GBx8_resnet_MultiStream', 'verbose': True, 'test_mode': 'PerformanceOnly', 'log_dir': '/work/build/logs/2020.05.06-10.07.12'}
&&&& RUNNING Default_Harness # ./build/bin/harness_default
[I] mlperf.conf path: measurements/V100-SXM2-32GBx8/resnet/MultiStream/mlperf.conf
[I] user.conf path: measurements/V100-SXM2-32GBx8/resnet/MultiStream/user.conf
[V] [TRT] Plugin Creator registration succeeded - GridAnchor_TRT
[V] [TRT] Plugin Creator registration succeeded - NMS_TRT
[V] [TRT] Plugin Creator registration succeeded - Reorg_TRT
[V] [TRT] Plugin Creator registration succeeded - Region_TRT
[V] [TRT] Plugin Creator registration succeeded - Clip_TRT
[V] [TRT] Plugin Creator registration succeeded - LReLU_TRT
[V] [TRT] Plugin Creator registration succeeded - PriorBox_TRT
[V] [TRT] Plugin Creator registration succeeded - Normalize_TRT
[V] [TRT] Plugin Creator registration succeeded - RPROI_TRT
[V] [TRT] Plugin Creator registration succeeded - BatchedNMS_TRT
[V] [TRT] Plugin Creator registration succeeded - FlattenConcat_TRT
[V] [TRT] Deserialize required 1402082 microseconds.
[I] Device:0: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2031154 microseconds.
[I] Device:1: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2071939 microseconds.
[I] Device:2: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2275912 microseconds.
[I] Device:3: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2556367 microseconds.
[I] Device:4: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2710989 microseconds.
[I] Device:5: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 2994004 microseconds.
[I] Device:6: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[V] [TRT] Deserialize required 3267479 microseconds.
[I] Device:7: ./build/engines/V100-SXM2-32GBx8/resnet/MultiStream/resnet-MultiStream-gpu-b60-fp16.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 1 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 2 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 3 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 4 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 5 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 6 EnableBatcherThreadPerDevice: true
[I] Creating batcher thread: 7 EnableBatcherThreadPerDevice: true
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.16286s.

How can I debug this?

nvpohanh · 2020-05-06T16:09:57Z

@nileshnegi For Server, please set the server_target_qps to some very low number (like 1) to see if the program can finish without issue.

For MultiStream, please set multi_stream_samples_per_query to low number (like 1?) and set min_query_count to 1. By default, MultiStream will run for at least 4 hours.

sbillus · 2020-06-19T06:28:10Z

Hi, I am also trying to run the benchmark on Tesla V100. And based on the suggestions above, I have created the configuration file.
But when I am running the test, it is failing after warm up. Looks like it is not able to find a file that is never supposed to be there in ImageNet validation set anyway.
Below is the console output:

[I] mlperf.conf path: measurements/ml2/resnet/Offline/mlperf.conf
[I] user.conf path: measurements/ml2/resnet/Offline/user.conf
[I] Device:0: ./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan has been successfully loaded.
[I] Device:1: ./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan has been successfully loaded.
[I] Device:2: ./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.59176s.
F0619 01:22:14.912485 5454 qsl.hpp:145] Check failed: fs Unable to open: /work/build/preprocessed_data/imagenet/ResNet50/fp32/ILSVRC2012_val_00027229.JPEG.npy
*** Check failure stack trace: ***
@ 0x7f3e75626362 google::LogMessage::Fail()
@ 0x7f3e756262aa google::LogMessage::SendToLog()
@ 0x7f3e75625beb google::LogMessage::Flush()
@ 0x7f3e75629066 google::LogMessageFatal::~LogMessageFatal()
@ 0x55b98e06339f qsl::SampleLibrary::LoadSamplesToRam()
@ 0x55b98e0f57b3 mlperf::loadgen::RunPerformanceMode<>()
@ 0x55b98e0d5512 mlperf::StartTest()
@ 0x55b98e05de95 doInference()
@ 0x55b98e05b88f main
@ 0x7f3e6739fb6b __libc_start_main
@ 0x55b98e05bf9a start
@ (nil) (unknown)
Aborted (core dumped)
Traceback (most recent call last):
File "code/main.py", line 327, in
main()
File "code/main.py", line 319, in main
handle_run_harness(benchmark_name, benchmark_conf, need_gpu, need_dla)
File "code/main.py", line 141, in handle_run_harness
result = harness.run_harness()
File "/work/code/common/harness.py", line 240, in run_harness
output = run_command(cmd, get_output=True)
File "/work/code/common/init.py", line 58, in run_command
raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command './build/bin/harness_default --logfile_outdir="/work/build/logs/2020.06.19-01.20.50/ml2/resnet/Offline" --logfile_prefix="mlperf_log" --gpu_copy_streams=4 --gpu_inference_streams=1 --use_graphs=false --gpu_batch_size=256 --map_path="data_maps/imagenet/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32" --gpu_engines="./build/engines/ml2/resnet/Offline/resnet-Offline-gpu-b256-fp16.plan" --performance_sample_count=1024 --max_dlas=0 --mlperf_conf_path="measurements/ml2/resnet/Offline/mlperf.conf" --user_conf_path="measurements/ml2/resnet/Offline/user.conf" --scenario Offline --model resnet' returned non-zero exit status 134.
Makefile:303: recipe for target 'run_harness' failed
make[1]: *** [run_harness] Error 1
make[1]: Leaving directory '/work'
Makefile:292: recipe for target 'run' failed
make: *** [run] Error 2

I checked, there is no ILSVRC2012_val_00027229.JPEG in validation set, then why is the benchmark looking for it?

nvpohanh · 2020-06-19T06:34:10Z

ILSVRC2012_val_00027229.JPEG does exist in the validation set: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/data_maps/imagenet/val_map.txt

Did you follow the preprocessing steps mentioned here?
https://github.com/mlperf/inference_results_v0.5/tree/master/closed/NVIDIA#download-and-preprocess-datasets

sbillus · 2020-06-19T06:54:10Z

Yes I did. I tried running preprocessing again just to be sure, and this time all the files already existed. I checked my val_map.txt file also and it doesn't contain ILSVRC2012_val_00027229.JPEG too. I have no idea then why it is looking for it.
The engine is generated successfully and the warm up runs fine too. It just fails when not able to find this image.

nvpohanh · 2020-06-19T07:11:49Z

Ah, you are running on V100! This is not our intended submission platform.

Nevertheless, you can hack the preprocessing script by add fp32 to the format list here: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/scripts/preprocess_data.py#L300

sbillus · 2020-06-19T07:49:10Z

It worked! Thank you for the help!
I thought that fp32 format subdirectory was already existing for ImageNet after preprocessing and the tensor path to it was already defined in the custom config file, that I didn't need to make any more tweaks anywhere else to run the benchmark.
(Although I still don't know why it was looking for that image in case of format mismatch?)
It makes me wonder, if there is a way to add "fp32" to COCO dataset preprocessing as well to benchmark V100 against object detection task as well.
I will try that and update others in the thread.
EDIT: It works the same way for COCO dataset too. Just tried ssd-large benchmark after preprocessing COCO dataset with the same hack into the preprocessing file suggested above, and it worked!

rakshithvasudev · 2020-07-16T03:16:26Z

Hello @nvpohanh,

I have a very similar issue as reported above by nileshnegi. It seems to me that resnet Server scenario is frozen for some reason. I've modified the config to run on V100 as mentioned above. Also tried setting the server_target_qps=1, still no luck. Any pointers to fix this issue would be appreciated.


root@node002:/work# make run_harness RUN_ARGS="--benchmarks=resnet --scenarios=Server --test_mode=PerformanceOnly"
[2020-07-16 03:11:04,380 main.py:291 INFO] Using config files: measurements/V100-SXM2-32GBx4/resnet/Server/config.json
[2020-07-16 03:11:04,380 __init__.py:142 INFO] Parsing config file measurements/V100-SXM2-32GBx4/resnet/Server/config.json ...
[2020-07-16 03:11:04,380 main.py:295 INFO] Processing config "V100-SXM2-32GBx4_resnet_Server"
[2020-07-16 03:11:04,380 main.py:111 INFO] Running harness for resnet benchmark in Server scenario...
{'active_sms': 50, 'deque_timeout_us': 2000, 'gpu_batch_size': 80, 'gpu_copy_streams': 8, 'gpu_inference_streams': 2, 'input_dtype': 'fp32', 'input_format': 'linear', 'map_path': 'data_maps/imagenet/val_map.txt', 'precision': 'fp16', 'server_target_qps': 1, 'tensor_path': '${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32', 'use_cuda_thread_per_device': True, 'use_deque_limit': True, 'use_graphs': False, 'system_id': 'V100-SXM2-32GBx4', 'scenario': 'Server', 'benchmark': 'resnet', 'config_name': 'V100-SXM2-32GBx4_resnet_Server', 'test_mode': 'PerformanceOnly', 'log_dir': '/work/build/logs/2020.07.16-03.11.03'}
[2020-07-16 03:11:04,382 __init__.py:42 INFO] Running command: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./build/bin/harness_default --logfile_outdir="/work/build/logs/2020.07.16-03.11.03/V100-SXM2-32GBx4/resnet/Server" --logfile_prefix="mlperf_log_" --test_mode="PerformanceOnly" --gpu_copy_streams=8 --gpu_inference_streams=2 --use_deque_limit=true --deque_timeout_us=2000 --use_cuda_thread_per_device=true --use_graphs=false --gpu_batch_size=80 --map_path="data_maps/imagenet/val_map.txt" --tensor_path="${PREPROCESSED_DATA_DIR}/imagenet/ResNet50/fp32" --gpu_engines="./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan" --performance_sample_count=1024 --max_dlas=0 --mlperf_conf_path="measurements/V100-SXM2-32GBx4/resnet/Server/mlperf.conf" --user_conf_path="measurements/V100-SXM2-32GBx4/resnet/Server/user.conf" --scenario Server --model resnet
&&&& RUNNING Default_Harness # ./build/bin/harness_default
[I] mlperf.conf path: measurements/V100-SXM2-32GBx4/resnet/Server/mlperf.conf
[I] user.conf path: measurements/V100-SXM2-32GBx4/resnet/Server/user.conf
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:0: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:1: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:2: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[I] Device:3: ./build/engines/V100-SXM2-32GBx4/resnet/Server/resnet-Server-gpu-b80-fp16.plan has been successfully loaded.
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false
[I] Creating cuda thread: 0
[I] Creating cuda thread: 1
[I] Creating cuda thread: 2
[I] Creating cuda thread: 3
Starting warmup. Running for a minimum of 5 seconds.
Finished warmup. Ran for 5.36912s.

nvpohanh · 2020-07-16T11:21:24Z

@rakshithvasudev Please also add --min_query_count=1. By default, LoadGen runs for at lease 270k queries, so at target_qps=1, that's three days...

rakshithvasudev · 2020-07-16T18:03:57Z

@nvpohanh it worked with --min_query_count=1. Thanks!

nvpohanh mentioned this issue Dec 26, 2019

Source code about the model and performance #12

Open

psyhtest mentioned this issue Jan 15, 2020

How to use tensorrt to speed up ssd mobilenet v1 fpn model? ctuning/ck-object-detection#1

Open

psyhtest mentioned this issue Jan 27, 2020

NVIDIA: Can I use Xavier MLPerf Inference code base for MLPerf Infernce benchmarking on Jetson TX2 and Nano. #15

Open

changtimwu referenced this issue in changtimwu/inference_results_v0.5 May 3, 2020

fix typo and remove libcublas-dev as suggested https://github.com/mlp…

5b30e08

…erf/inference_results_v0.5/issues/7#issuecomment-561804070

Customizing NVIDIA instructions for reproducibility #7

Customizing NVIDIA instructions for reproducibility #7

Comments

psyhtest commented Nov 25, 2019

psyhtest commented Nov 25, 2019

nvpohanh commented Nov 25, 2019

nvpohanh commented Nov 25, 2019

psyhtest commented Nov 26, 2019

nvpohanh commented Nov 26, 2019

psyhtest commented Nov 27, 2019 • edited Loading

nvpohanh commented Nov 27, 2019 • edited Loading

desktop GPUs (Quadro M1000M or GeForce GTX1080)

Jetson TX1

nvpohanh commented Nov 27, 2019

psyhtest commented Nov 27, 2019

nvpohanh commented Nov 27, 2019

vilmara commented Dec 2, 2019 • edited Loading

desktop GPUs (Quadro M1000M or GeForce GTX1080)

nvpohanh commented Dec 2, 2019

desktop GPUs (Quadro M1000M or GeForce GTX1080)

nvpohanh commented Dec 3, 2019

nvpohanh commented Dec 4, 2019

vilmara commented Dec 6, 2019

ens-lg4 commented Feb 18, 2020

nvpohanh commented Feb 18, 2020 • edited Loading

ens-lg4 commented Feb 18, 2020 • edited Loading

ens-lg4 commented Mar 3, 2020

psyhtest commented Mar 4, 2020

psyhtest commented Mar 4, 2020

renganxu commented Mar 7, 2020

vilmara commented Mar 9, 2020 • edited Loading

nvpohanh commented Mar 9, 2020

dagrayvid commented Mar 12, 2020

nvpohanh commented Mar 12, 2020

nileshnegi commented May 6, 2020

nvpohanh commented May 6, 2020

sbillus commented Jun 19, 2020 • edited Loading

nvpohanh commented Jun 19, 2020

sbillus commented Jun 19, 2020

nvpohanh commented Jun 19, 2020

sbillus commented Jun 19, 2020 • edited Loading

rakshithvasudev commented Jul 16, 2020

nvpohanh commented Jul 16, 2020

rakshithvasudev commented Jul 16, 2020

psyhtest commented Nov 27, 2019 •

edited

Loading

nvpohanh commented Nov 27, 2019 •

edited

Loading

vilmara commented Dec 2, 2019 •

edited

Loading

nvpohanh commented Feb 18, 2020 •

edited

Loading

ens-lg4 commented Feb 18, 2020 •

edited

Loading

vilmara commented Mar 9, 2020 •

edited

Loading

sbillus commented Jun 19, 2020 •

edited

Loading

sbillus commented Jun 19, 2020 •

edited

Loading