[Performance] getting slower after run 10 times in intel cpu when I loop execute onnnxruntime inference #13651

allen20200111 · 2022-11-15T06:45:13Z

Describe the issue

i have one onnxruntime session running at intel cpu :
(1) at first inference total time cost is 200ms,
(2) when many times later, time cost is more than 10s.

getting slower after run some times when I loop execute onnnxruntime inference, Maybe the threadPool block execute ? what should i do ?

when the process is block , i gdb :
0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7.x86_64
(gdb) bt
#0 0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#1 0x00007efd417a05ce in onnxruntime::concurrency::ThreadPool::ParallelForFixedBlockSizeScheduling(long, long, std::function<void (long, long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#2 0x00007efd417a06a5 in onnxruntime::concurrency::ThreadPool::SimpleParallelFor(long, std::function<void (long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#3 0x00007efd417ef558 in MlasExecuteThreaded(void ()(void, int), void*, int, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#4 0x00007efd417b98fc in MlasNchwcConv(long const*, long const*, long const*, long const*, long const*, long const*, unsigned long, float const*, float const*, float const*, float*, MLAS_ACTIVATION const*, bool, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0

To reproduce

loop execute onnnxruntime inference in intel cpu when
one onnxruntime session running at intel cpu :
(1) at first total time is 200ms,
(2) when test many times later, speed is more than 10s.

session options:
Ort::SessionOptions options;
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
m_session = new Ort::Experimental::Session(m_env, model_path, options);

Urgency

this issue block my project two months, please give some help,thanks.

Platform

Linux

OS Version

CentOS Linux release 7.8.2003 (Core)

ONNX Runtime Installation

build from source

ONNX Runtime Version or Commit ID

1.6.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

cuda 10.2

Model File

No response

Is this a quantized model?

No

yuslepukhin · 2022-11-16T00:24:33Z

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

allen20200111 · 2022-11-16T10:58:43Z


#include <assert.h>
#include <onnxruntime/core/session/onnxruntime_cxx_api.h>

#include <iostream>
#include <vector>

#include<sys/timeb.h>
#include <sched.h> 

long long systemtime()
{
    timeb t;
    ftime(&t);
    return t.time*1000+t.millitm;
}

inline void assignToThisCore(int core_id)
{
    cpu_set_t  mask;
    CPU_ZERO(&mask);
    CPU_SET(core_id, &mask);
    sched_setaffinity(0, sizeof(mask), &mask);
}

void run_ort_trt() {
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
  const auto& api = Ort::GetApi();
  OrtTensorRTProviderOptionsV2* tensorrt_options;

  Ort::SessionOptions session_options;
  // session_options.SetIntraOpNumThreads(1);

  // session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

  const char* model_path = "aaaa.onnx";
  std::cout << model_path << std::endl;

  //*****************************************************************************************
  // It's not suggested to directly new OrtTensorRTProviderOptionsV2 to get provider options
  //*****************************************************************************************
  //
  // auto tensorrt_options = get_default_trt_provider_options();
  // session_options.AppendExecutionProvider_TensorRT_V2(*tensorrt_options.get());

  //**************************************************************************************************************************
  // It's suggested to use CreateTensorRTProviderOptions() to get provider options
  // since ORT takes care of valid options for you
  //**************************************************************************************************************************
  // Ort::ThrowOnError(api.CreateTensorRTProviderOptions(&tensorrt_options));
  // std::unique_ptr<OrtTensorRTProviderOptionsV2, decltype(api.ReleaseTensorRTProviderOptions)> rel_trt_options(
  //     tensorrt_options, api.ReleaseTensorRTProviderOptions);
  // Ort::ThrowOnError(api.SessionOptionsAppendExecutionProvider_TensorRT_V2(static_cast<OrtSessionOptions*>(session_options),
  //                                                       rel_trt_options.get()));

  std::cout << "Running ORT TRT EP with default provider options" << std::endl;

  Ort::Session session(env, model_path, session_options);

  //*************************************************************************
  // print model input layer (node names, types, shape etc.)
  Ort::AllocatorWithDefaultOptions allocator;

  // print number of model input nodes
  const size_t num_input_nodes = session.GetInputCount();
  std::vector<Ort::AllocatedStringPtr> input_names_ptr;
  std::vector<const char*> input_node_names;
  input_names_ptr.reserve(num_input_nodes);
  input_node_names.reserve(num_input_nodes);
  std::vector<int64_t> input_node_dims;  // simplify... this model has only 1 input node {1, 3, 224, 224}.
                                         // Otherwise need vector<vector<>>

  std::cout << "Number of inputs = " << num_input_nodes << std::endl;

  const int height = 1024;
  const int width = 736;
  // iterate over all input nodes
  for (size_t i = 0; i < num_input_nodes; i++) {
    // print input node names
    auto input_name = session.GetInputNameAllocated(i, allocator);
    std::cout << "Input " << i << " : name =" << input_name.get() << std::endl;
    input_node_names.push_back(input_name.get());
    input_names_ptr.push_back(std::move(input_name));

    // print input node types
    auto type_info = session.GetInputTypeInfo(i);
    auto tensor_info = type_info.GetTensorTypeAndShapeInfo();

    ONNXTensorElementDataType type = tensor_info.GetElementType();
    std::cout << "Input " << i << " : type = " << type << std::endl;

    // print input shapes/dims
    input_node_dims = tensor_info.GetShape();

    input_node_dims[2] = height;
    input_node_dims[3] = width;

    std::cout << "Input " << i << " : num_dims = " << input_node_dims.size() << '\n';
    for (size_t j = 0; j < input_node_dims.size(); j++) {
      std::cout << "Input " << i << " : dim[" << j << "] =" << input_node_dims[j] << '\n';
    }
    std::cout << std::flush;
  }


  constexpr size_t input_tensor_size = height * width * 3;  // simplify ... using known dim values to calculate size
                                                       // use OrtGetTensorShapeElementCount() to get official size!

  std::vector<float> input_tensor_values(input_tensor_size);
  std::vector<const char*> output_node_names = {"output"};

  // initialize input data with values in [0.0, 1.0]
  for (unsigned int i = 0; i < input_tensor_size; i++) input_tensor_values[i] = (float)i / (input_tensor_size + 1);

  // create input tensor object from data values
  auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
  std::cout << "Input 95" << '\n';
  auto input_tensor = Ort::Value::CreateTensor<float>(memory_info, input_tensor_values.data(), 
                      input_tensor_size,input_node_dims.data(), 4);
  std::cout << "Input 97" << '\n';
  assert(input_tensor.IsTensor());
  // score model & input tensor, get back output tensor

  std::vector<Ort::Value> output_tensors;
  for (unsigned int i = 0; i < 70; i++) {
  time_t now = time(nullptr);
  long long start=systemtime();
  output_tensors =
      session.Run(Ort::RunOptions{nullptr}, input_node_names.data(), &input_tensor, 1, output_node_names.data(), 1);

  std::cout << "Input: " << i <<"  time: "<< systemtime()-start<<" 毫秒" << '\n';

  }

  assert(output_tensors.size() == 1 && output_tensors.front().IsTensor());

  // Get pointer to output tensor float values
  float* floatarr = output_tensors.front().GetTensorMutableData<float>();
  
  // assert(abs(floatarr[0] - 0.000045) < 1e-6);

  // score the model, and print scores for first 5 classes
  for (int i = 0; i < 5; i++) {
    std::cout << "Score for class [" << i << "] =  " << floatarr[i] << '\n';
  }
  std::cout << std::flush;

  // Results should be as below...
  // Score for class[0] = 0.000045
  // Score for class[1] = 0.003846
  // Score for class[2] = 0.000125
  // Score for class[3] = 0.001180
  // Score for class[4] = 0.001317

  std::cout << "Done!" << std::endl;
}

int main(int /*argc*/, char*[]) {
  assignToThisCore(2);
  run_ort_trt();

  return 0;
}

allen20200111 · 2022-11-16T11:21:34Z

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

can you give me email , the model email to you, can not upload because of github limit

yuslepukhin · 2022-11-16T18:14:21Z

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

can you give me email , the model email to you, can not upload because of github limit

I don't think it would be a good assumption that email would accept files that are over github limit. People usually put it in the cloud and send a link.

You specify onnxruntime version of 1.6 and yet in your example above you use some C++ API that only recently appeared. It is unlikely that that we are going to issue any patches for 1.6.

We employ memory patterns and pre-allocations. You need to run at least 1-2 times before you can measure the performance, so we do not allocate much more memory (providing you do not use dynamic shapes). And then compute the mean and variance/percentiles.

allen20200111 · 2022-11-17T02:39:31Z

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

can you give me email , the model email to you, can not upload because of github limit

I don't think it would be a good assumption that email would accept files that are over github limit. People usually put it in the cloud and send a link.

You specify onnxruntime version of 1.6 and yet in your example above you use some C++ API that only recently appeared. It is unlikely that that we are going to issue any patches for 1.6.

We employ memory patterns and pre-allocations. You need to run at least 1-2 times before you can measure the performance, so we do not allocate much more memory (providing you do not use dynamic shapes). And then compute the mean and variance/percentiles.

Thank you very much for you reply.

link: https://pan.baidu.com/s/1r8i8YOAyz9yU3kxCk7gfZQ?pwd=j5d4
the onnxruntime 1.13 behavior same as 1.6, I have tested both 1.6 and 1.13.
I test dynamic shapes find that model more and more time consuming and slower and slower.
test the static shapes as shown below， sometime slower, sometimes faster.
Input: 0 time: 345 ms
Input: 1 time: 339 ms
Input: 2 time: 541 ms
Input: 3 time: 420 ms
Input: 4 time: 445 ms
Input: 5 time: 474 ms
Input: 6 time: 428 ms
Input: 7 time: 430 ms
Input: 8 time: 505 ms
Input: 9 time: 497 ms
Input: 10 time: 441 ms
Input: 11 time: 436 ms
Input: 12 time: 650 ms
Input: 13 time: 469 ms
Input: 14 time: 472 ms
Input: 15 time: 496 ms
Input: 16 time: 569 ms
Input: 17 time: 733 ms
Input: 18 time: 394 ms
Input: 19 time: 412 ms
Input: 20 time: 446 ms
Input: 21 time: 576 ms
Input: 22 time: 518 ms
Input: 23 time: 436 ms
Input: 24 time: 641 ms
Input: 25 time: 838 ms
Input: 26 time: 654 ms
Input: 27 time: 1109 ms
Input: 28 time: 486 ms
Input: 29 time: 422 ms
Input: 30 time: 1480 ms
Input: 31 time: 974 ms
Input: 32 time: 874 ms
Input: 33 time: 400 ms
Input: 34 time: 568 ms
Input: 35 time: 591 ms
Input: 36 time: 423 ms
Input: 37 time: 567 ms
Input: 38 time: 856 ms
Input: 39 time: 723 ms
Input: 40 time: 877 ms
Input: 41 time: 476 ms
Input: 42 time: 658 ms
Input: 43 time: 1864 ms
Input: 44 time: 860 ms
Input: 45 time: 1087 ms
Input: 46 time: 1360 ms
Input: 47 time: 417 ms
Input: 48 time: 1121 ms
Input: 49 time: 451 ms
Input: 50 time: 619 ms
Input: 51 time: 630 ms
Input: 52 time: 1494 ms
Input: 53 time: 502 ms
Input: 54 time: 663 ms
Input: 55 time: 1312 ms
Input: 56 time: 1068 ms
Input: 57 time: 435 ms
Input: 58 time: 1179 ms
Input: 59 time: 861 ms
Input: 60 time: 485 ms
Input: 61 time: 1991 ms
Input: 62 time: 482 ms
Input: 63 time: 818 ms
Input: 64 time: 488 ms
Input: 65 time: 606 ms
Input: 66 time: 993 ms
Input: 67 time: 628 ms
Input: 68 time: 608 ms
Input: 69 time: 689 ms
Input: 70 time: 818 ms
Input: 71 time: 627 ms
Input: 72 time: 833 ms
Input: 73 time: 947 ms
Input: 74 time: 1921 ms
Input: 75 time: 966 ms
Input: 76 time: 961 ms
Input: 77 time: 934 ms
Input: 78 time: 1532 ms
Input: 79 time: 2143 ms
Input: 80 time: 931 ms
Input: 81 time: 1378 ms
Input: 82 time: 1894 ms
Input: 83 time: 1297 ms
Input: 84 time: 1960 ms
Input: 85 time: 1567 ms
Input: 86 time: 1255 ms
Input: 87 time: 2298 ms

yuslepukhin · 2022-11-17T19:10:36Z

Baidu seems to require a client installation to download the file, and that I am not willing to do for my work computer. Would it be possible to put this on a MS OneDrive or Google Drive? Or anything else that does not require a proprietary client installation.
The size does not seem to be very big for free versions.

yuslepukhin · 2022-11-17T19:24:42Z

One thing to suggest, since it is a CPU, you do not need memory arena.
options.DisableCpuMemArena();

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Nov 15, 2022

yuslepukhin self-assigned this Nov 16, 2022

yuslepukhin added core runtime issues related to core runtime and removed ep:CUDA issues related to the CUDA execution provider labels Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] getting slower after run 10 times in intel cpu when I loop execute onnnxruntime inference #13651

[Performance] getting slower after run 10 times in intel cpu when I loop execute onnnxruntime inference #13651

allen20200111 commented Nov 15, 2022 •

edited

Loading

yuslepukhin commented Nov 16, 2022

allen20200111 commented Nov 16, 2022 •

edited

Loading

allen20200111 commented Nov 16, 2022

yuslepukhin commented Nov 16, 2022 •

edited

Loading

allen20200111 commented Nov 17, 2022 •

edited

Loading

yuslepukhin commented Nov 17, 2022

yuslepukhin commented Nov 17, 2022

[Performance] getting slower after run 10 times in intel cpu when I loop execute onnnxruntime inference #13651

[Performance] getting slower after run 10 times in intel cpu when I loop execute onnnxruntime inference #13651

Comments

allen20200111 commented Nov 15, 2022 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

yuslepukhin commented Nov 16, 2022

allen20200111 commented Nov 16, 2022 • edited Loading

allen20200111 commented Nov 16, 2022

yuslepukhin commented Nov 16, 2022 • edited Loading

allen20200111 commented Nov 17, 2022 • edited Loading

yuslepukhin commented Nov 17, 2022

yuslepukhin commented Nov 17, 2022

allen20200111 commented Nov 15, 2022 •

edited

Loading

allen20200111 commented Nov 16, 2022 •

edited

Loading

yuslepukhin commented Nov 16, 2022 •

edited

Loading

allen20200111 commented Nov 17, 2022 •

edited

Loading