-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customizing NVIDIA instructions for reproducibility #7
Comments
The problem is that MLPerf runs require that you set the a lot of correct parameters first (like batch size, precision, target_qps, samples_per_query, etc.). It is not easy for MLPerf beginners to get VALID results without extensive tuning and understanding of how MLPerf works. In your case, you can add more system config files into |
One thing I do agree, though, is that we can provide more instructions about how to reproduce Alibaba and Dell's results. |
Thanks @nvpohanh. I agree that getting valid MLPerf results is not straightforward, and that your harness aims to address this for your submissions and those of your clients. However, it seems to conflate two things: 1) generating optimized TensorRT plans which requires setting up parameters like the batch size and precision; 2) running the optimized plans which requires setting up LoadGen parameters like target_qps, samples_per_query, etc. It's OK for capturing the final result of the tuning, but it would be great to disentangle them to allow new devices to be added and tested quickly. A simple script per benchmark that takes as input the path to an original model, batch size, precision, etc., and outputs an optimized TensorRT plan for those parameters would be great. For example, for MobileNet some interesting manipulations seem to be taking place (which probably result in a higher accuracy than when using the vanilla ONNX-to-TensorRT conversion), but it's not obvious how one would run this piece of code in isolation of your test harness. |
@psyhtest I realized that we did provide a performance_tuning_guide.adoc to tell the users how to tune the MLPerf parameters to get VALID results. The remaining missing part is how to add new systems into system_list.py. do you agree? Also, the MobileNet manipulations shouldn't depend on which GPU you use. |
Precisely my point! I assume the same is true for the other models? Which is why it is a bit awkward that a full config should be provided just to run the conversion. However, if you can provide a simple template that would make your harness accept e.g. Jetson TX1, that would be great. |
Since I think you focus on making it run rather than high perf numbers, I would suggest the following: desktop GPUs (Quadro M1000M or GeForce GTX1080)
Jetson TX1This is a more challenging one. Throughout our v0.5 submission, we assume Xavier when detecting aarch64 system. I would try the following:
Could you try these? If it does not work, please let me know. Also, just a disclaimer that this won't give you the correct potential performance numbers, but it should at least be able to run. Thanks |
Actually, there might be a way simpler approach: try passing |
Thanks @nvpohanh, I'll give it a go! Just a correction. I think I still need to use quantization on all these GPUs. Any advice on how to set the config then? |
Then you can try with original configs. Not sure if INT8 works on M1000 or not, though. |
Hi @nvpohanh, thanks for the explanation, I have followed the steps above to append successfully my system with 1x Tesla T4. However, I am getting the below error with the command Tracelog:
|
I started seeing this error yesterday. Will investigate and it and give you updates. |
@vilmara This error seems to be gone today. Could you rebuild the docker image (especially, re-pull |
Okay, root-caused the issue. The newly published libcublas-dev is built for cuda-10.2, so it breaks TRT. |
Hi @nvpohanh, it is working now, thank you! |
Dear @nvpohanh , Thank you for your advice about running the benchmarking suite on GTX-1080 (via docker image) and Jetson-TX1 (installing all the Xavier's dependencies directly on the board and patching Xavier->TX1). I believe I have followed all the instructions that you provided, and we managed to reproduce running the benchmarks for image classification models (both for ResNet50 and MobileNet). However, the same settings and protocol did not work for ssd-small (ssd-MobileNet), and only led to a "Segmentation Fault" type of crash on both systems. It did not matter whether we used int8/chw4 data format, int8/linear or fp32/linear, AccuracyOnly or PerformanceOnly - the result was always roughly the same:
I tried to directly feed preprocessed image data to the TRT engine loaded from .plan file, and noticed that although the program expects 701 floats per batch, of which the last, 701st float should convert into an int32 (since it is essentially a number of predicted boxes), I am consistently getting a non-integer there. Perhaps this could explain why the rest of the program crashes? Does this situation ring any bells? Could I have missed anything? Many thanks in advance. |
@ens-lg4 One thing I can think of is the NMS plugin. Could you make sure that you build the corresponding SM version for NMS plugin? See: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/code/plugin/NMSOptPlugin/CMakeLists.txt#L81 |
@nvpohanh , thank you so much! On GTX-1080 I assumed the SM version to be 61, and after adding it to the list and recompiling, I can see two important changes: (1) the benchmark no longer crashes, it runs to the end and produces a credible result in the end. Also, (2) if I look at the output tensor coming off the engine, it is now properly structured - containing 701 floats with 100x7 boxes and a number of active boxes in the end. On Jetson-TX1 I assumed the SM version to be 53, added it, recompiled, but nothing really changed, unfortunately. The output still looks broken, and the benchmark still SegFaults. Did I not guess the SM version correctly? Or is it not possible to compile NMS plugin for TX1? Many thanks again! |
Addition: I tried the same protocol on Jetson-TX2 (having added SM version 62), but it crashed with Segmentation Fault in the same way as Jetson-TX1 did. Can we assume neither TX1 nor TX2 are supported? |
It's worth adding that the TX1 and TX2 are both flashed with JetPack 4.3 (with TensorRT 6.0). |
Also, I believe it happens for SSD-MobileNet-1 both with fp32 and fp16. We haven't tried SSD-ResNet yet. |
Hi @nvpohanh, I had error when running the inference on V100-SXM2-16GB. The following is the error. Do you have any idea why this error happened? I added the my config in code/common/system_list.py and corresponding files in measurements/V100-SXM2-16GBx4.
|
Hi @renganxu, I think this error is produced because V100 doesn't support INT8 Tensor Cores, have you tried with FP16 Tensor Cores?. Here is the list of supported precision mode per hardware https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix |
@vilmara is correct. Please use FP16 on V100. |
@dagrayvid You can modify the config.json like this one: https://github.com/mlperf/inference_results_v0.5/blob/master/closed/NVIDIA/measurements/T4x8/resnet/Offline/config.json For example, change |
I am trying to get the NVIDIA implementations of these inference benchmarks running on my 8*V100-SXM2-32GB box. At the moment, I just want to get all benchmarks running. As suggested earlier, I copied the T4x8 config files and altered them for V100 (input_dtype, precision, path to preprocessed data). With this I am able to run the Offline and SingleStream (wherever supported) scenarios for all 5 benchmarks.
How can I debug this? |
@nileshnegi For Server, please set the For MultiStream, please set |
Hi, I am also trying to run the benchmark on Tesla V100. And based on the suggestions above, I have created the configuration file. [I] mlperf.conf path: measurements/ml2/resnet/Offline/mlperf.conf I checked, there is no ILSVRC2012_val_00027229.JPEG in validation set, then why is the benchmark looking for it? |
Did you follow the preprocessing steps mentioned here? |
Yes I did. I tried running preprocessing again just to be sure, and this time all the files already existed. I checked my val_map.txt file also and it doesn't contain ILSVRC2012_val_00027229.JPEG too. I have no idea then why it is looking for it. |
Ah, you are running on V100! This is not our intended submission platform. Nevertheless, you can hack the preprocessing script by add |
It worked! Thank you for the help! |
Hello @nvpohanh, I have a very similar issue as reported above by nileshnegi. It seems to me that
|
@rakshithvasudev Please also add |
@nvpohanh it worked with --min_query_count=1. Thanks! |
I'd like to generate and run TensorRT optimized plan files from the NVIDIA v0.5 submission on my (admittedly out-of-date) NVIDIA devices: Quadro M1000M, GeForce GTX1080 and Jetson TX1.
Unfortunately, following the instructions fails:
With a naive intervention:
I get a bit of clue about what's going on:
OK, I only have one Quadro M1000M GPU in my laptop, and it's clearly not in the list of systems NVIDIA used for their v0.5 submission. The same for the other two devices.
It appears though that I would have the same issue with reproducing the Alibaba submission using only one Tesla T4 or the DellEMC submission using four Tesla T4s. It's ironic because both Alibaba and DellEMC both refer to the NVIDIA submission for reproducibility.)
Would it be possible to untangle the instructions to allow generating optimized TensorRT plans for other devices please?
The text was updated successfully, but these errors were encountered: