Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] getting slower after run 10 times in intel cpu when I loop execute onnnxruntime inference #13651

Open
allen20200111 opened this issue Nov 15, 2022 · 7 comments
Assignees
Labels
core runtime issues related to core runtime

Comments

@allen20200111
Copy link

allen20200111 commented Nov 15, 2022

Describe the issue

i have one onnxruntime session running at intel cpu :
(1) at first inference total time cost is 200ms,
(2) when many times later, time cost is more than 10s.

getting slower after run some times when I loop execute onnnxruntime inference, Maybe the threadPool block execute ? what should i do ?

when the process is block , i gdb :
0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7.x86_64
(gdb) bt
#0 0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#1 0x00007efd417a05ce in onnxruntime::concurrency::ThreadPool::ParallelForFixedBlockSizeScheduling(long, long, std::function<void (long, long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#2 0x00007efd417a06a5 in onnxruntime::concurrency::ThreadPool::SimpleParallelFor(long, std::function<void (long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#3 0x00007efd417ef558 in MlasExecuteThreaded(void ()(void, int), void*, int, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0
#4 0x00007efd417b98fc in MlasNchwcConv(long const*, long const*, long const*, long const*, long const*, long const*, unsigned long, float const*, float const*, float const*, float*, MLAS_ACTIVATION const*, bool, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0

To reproduce

loop execute onnnxruntime inference in intel cpu when
one onnxruntime session running at intel cpu :
(1) at first total time is 200ms,
(2) when test many times later, speed is more than 10s.

session options:
Ort::SessionOptions options;
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
m_session = new Ort::Experimental::Session(m_env, model_path, options);

Urgency

this issue block my project two months, please give some help,thanks.

Platform

Linux

OS Version

CentOS Linux release 7.8.2003 (Core)

ONNX Runtime Installation

build from source

ONNX Runtime Version or Commit ID

1.6.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

cuda 10.2

Model File

No response

Is this a quantized model?

No

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Nov 15, 2022
@yuslepukhin
Copy link
Member

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

@allen20200111
Copy link
Author

allen20200111 commented Nov 16, 2022


#include <assert.h>
#include <onnxruntime/core/session/onnxruntime_cxx_api.h>

#include <iostream>
#include <vector>

#include<sys/timeb.h>
#include <sched.h> 

long long systemtime()
{
    timeb t;
    ftime(&t);
    return t.time*1000+t.millitm;
}

inline void assignToThisCore(int core_id)
{
    cpu_set_t  mask;
    CPU_ZERO(&mask);
    CPU_SET(core_id, &mask);
    sched_setaffinity(0, sizeof(mask), &mask);
}

void run_ort_trt() {
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
  const auto& api = Ort::GetApi();
  OrtTensorRTProviderOptionsV2* tensorrt_options;

  Ort::SessionOptions session_options;
  // session_options.SetIntraOpNumThreads(1);

  // session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

  const char* model_path = "aaaa.onnx";
  std::cout << model_path << std::endl;

  //*****************************************************************************************
  // It's not suggested to directly new OrtTensorRTProviderOptionsV2 to get provider options
  //*****************************************************************************************
  //
  // auto tensorrt_options = get_default_trt_provider_options();
  // session_options.AppendExecutionProvider_TensorRT_V2(*tensorrt_options.get());

  //**************************************************************************************************************************
  // It's suggested to use CreateTensorRTProviderOptions() to get provider options
  // since ORT takes care of valid options for you
  //**************************************************************************************************************************
  // Ort::ThrowOnError(api.CreateTensorRTProviderOptions(&tensorrt_options));
  // std::unique_ptr<OrtTensorRTProviderOptionsV2, decltype(api.ReleaseTensorRTProviderOptions)> rel_trt_options(
  //     tensorrt_options, api.ReleaseTensorRTProviderOptions);
  // Ort::ThrowOnError(api.SessionOptionsAppendExecutionProvider_TensorRT_V2(static_cast<OrtSessionOptions*>(session_options),
  //                                                       rel_trt_options.get()));

  std::cout << "Running ORT TRT EP with default provider options" << std::endl;

  Ort::Session session(env, model_path, session_options);

  //*************************************************************************
  // print model input layer (node names, types, shape etc.)
  Ort::AllocatorWithDefaultOptions allocator;

  // print number of model input nodes
  const size_t num_input_nodes = session.GetInputCount();
  std::vector<Ort::AllocatedStringPtr> input_names_ptr;
  std::vector<const char*> input_node_names;
  input_names_ptr.reserve(num_input_nodes);
  input_node_names.reserve(num_input_nodes);
  std::vector<int64_t> input_node_dims;  // simplify... this model has only 1 input node {1, 3, 224, 224}.
                                         // Otherwise need vector<vector<>>

  std::cout << "Number of inputs = " << num_input_nodes << std::endl;

  const int height = 1024;
  const int width = 736;
  // iterate over all input nodes
  for (size_t i = 0; i < num_input_nodes; i++) {
    // print input node names
    auto input_name = session.GetInputNameAllocated(i, allocator);
    std::cout << "Input " << i << " : name =" << input_name.get() << std::endl;
    input_node_names.push_back(input_name.get());
    input_names_ptr.push_back(std::move(input_name));

    // print input node types
    auto type_info = session.GetInputTypeInfo(i);
    auto tensor_info = type_info.GetTensorTypeAndShapeInfo();

    ONNXTensorElementDataType type = tensor_info.GetElementType();
    std::cout << "Input " << i << " : type = " << type << std::endl;

    // print input shapes/dims
    input_node_dims = tensor_info.GetShape();

    input_node_dims[2] = height;
    input_node_dims[3] = width;

    std::cout << "Input " << i << " : num_dims = " << input_node_dims.size() << '\n';
    for (size_t j = 0; j < input_node_dims.size(); j++) {
      std::cout << "Input " << i << " : dim[" << j << "] =" << input_node_dims[j] << '\n';
    }
    std::cout << std::flush;
  }


  constexpr size_t input_tensor_size = height * width * 3;  // simplify ... using known dim values to calculate size
                                                       // use OrtGetTensorShapeElementCount() to get official size!

  std::vector<float> input_tensor_values(input_tensor_size);
  std::vector<const char*> output_node_names = {"output"};

  // initialize input data with values in [0.0, 1.0]
  for (unsigned int i = 0; i < input_tensor_size; i++) input_tensor_values[i] = (float)i / (input_tensor_size + 1);

  // create input tensor object from data values
  auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
  std::cout << "Input 95" << '\n';
  auto input_tensor = Ort::Value::CreateTensor<float>(memory_info, input_tensor_values.data(), 
                      input_tensor_size,input_node_dims.data(), 4);
  std::cout << "Input 97" << '\n';
  assert(input_tensor.IsTensor());
  // score model & input tensor, get back output tensor

  std::vector<Ort::Value> output_tensors;
  for (unsigned int i = 0; i < 70; i++) {
  time_t now = time(nullptr);
  long long start=systemtime();
  output_tensors =
      session.Run(Ort::RunOptions{nullptr}, input_node_names.data(), &input_tensor, 1, output_node_names.data(), 1);

  std::cout << "Input: " << i <<"  time: "<< systemtime()-start<<" 毫秒" << '\n';

  }

  assert(output_tensors.size() == 1 && output_tensors.front().IsTensor());

  // Get pointer to output tensor float values
  float* floatarr = output_tensors.front().GetTensorMutableData<float>();
  
  // assert(abs(floatarr[0] - 0.000045) < 1e-6);

  // score the model, and print scores for first 5 classes
  for (int i = 0; i < 5; i++) {
    std::cout << "Score for class [" << i << "] =  " << floatarr[i] << '\n';
  }
  std::cout << std::flush;

  // Results should be as below...
  // Score for class[0] = 0.000045
  // Score for class[1] = 0.003846
  // Score for class[2] = 0.000125
  // Score for class[3] = 0.001180
  // Score for class[4] = 0.001317

  std::cout << "Done!" << std::endl;
}

int main(int /*argc*/, char*[]) {
  assignToThisCore(2);
  run_ort_trt();

  return 0;
}

@allen20200111
Copy link
Author

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

can you give me email , the model email to you, can not upload because of github limit

@yuslepukhin
Copy link
Member

yuslepukhin commented Nov 16, 2022

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

can you give me email , the model email to you, can not upload because of github limit

I don't think it would be a good assumption that email would accept files that are over github limit. People usually put it in the cloud and send a link.

You specify onnxruntime version of 1.6 and yet in your example above you use some C++ API that only recently appeared. It is unlikely that that we are going to issue any patches for 1.6.

We employ memory patterns and pre-allocations. You need to run at least 1-2 times before you can measure the performance, so we do not allocate much more memory (providing you do not use dynamic shapes). And then compute the mean and variance/percentiles.

@yuslepukhin yuslepukhin self-assigned this Nov 16, 2022
@yuslepukhin yuslepukhin added core runtime issues related to core runtime and removed ep:CUDA issues related to the CUDA execution provider labels Nov 16, 2022
@allen20200111
Copy link
Author

allen20200111 commented Nov 17, 2022

what should i do ?
I would suggest following the process and share with us the model, if possible and a minimum test program, that is used to measure performance.

can you give me email , the model email to you, can not upload because of github limit

I don't think it would be a good assumption that email would accept files that are over github limit. People usually put it in the cloud and send a link.

You specify onnxruntime version of 1.6 and yet in your example above you use some C++ API that only recently appeared. It is unlikely that that we are going to issue any patches for 1.6.

We employ memory patterns and pre-allocations. You need to run at least 1-2 times before you can measure the performance, so we do not allocate much more memory (providing you do not use dynamic shapes). And then compute the mean and variance/percentiles.

Thank you very much for you reply.

link: https://pan.baidu.com/s/1r8i8YOAyz9yU3kxCk7gfZQ?pwd=j5d4
the onnxruntime 1.13 behavior same as 1.6, I have tested both 1.6 and 1.13.
I test dynamic shapes find that model more and more time consuming and slower and slower.
test the static shapes as shown below, sometime slower, sometimes faster.
Input: 0 time: 345 ms
Input: 1 time: 339 ms
Input: 2 time: 541 ms
Input: 3 time: 420 ms
Input: 4 time: 445 ms
Input: 5 time: 474 ms
Input: 6 time: 428 ms
Input: 7 time: 430 ms
Input: 8 time: 505 ms
Input: 9 time: 497 ms
Input: 10 time: 441 ms
Input: 11 time: 436 ms
Input: 12 time: 650 ms
Input: 13 time: 469 ms
Input: 14 time: 472 ms
Input: 15 time: 496 ms
Input: 16 time: 569 ms
Input: 17 time: 733 ms
Input: 18 time: 394 ms
Input: 19 time: 412 ms
Input: 20 time: 446 ms
Input: 21 time: 576 ms
Input: 22 time: 518 ms
Input: 23 time: 436 ms
Input: 24 time: 641 ms
Input: 25 time: 838 ms
Input: 26 time: 654 ms
Input: 27 time: 1109 ms
Input: 28 time: 486 ms
Input: 29 time: 422 ms
Input: 30 time: 1480 ms
Input: 31 time: 974 ms
Input: 32 time: 874 ms
Input: 33 time: 400 ms
Input: 34 time: 568 ms
Input: 35 time: 591 ms
Input: 36 time: 423 ms
Input: 37 time: 567 ms
Input: 38 time: 856 ms
Input: 39 time: 723 ms
Input: 40 time: 877 ms
Input: 41 time: 476 ms
Input: 42 time: 658 ms
Input: 43 time: 1864 ms
Input: 44 time: 860 ms
Input: 45 time: 1087 ms
Input: 46 time: 1360 ms
Input: 47 time: 417 ms
Input: 48 time: 1121 ms
Input: 49 time: 451 ms
Input: 50 time: 619 ms
Input: 51 time: 630 ms
Input: 52 time: 1494 ms
Input: 53 time: 502 ms
Input: 54 time: 663 ms
Input: 55 time: 1312 ms
Input: 56 time: 1068 ms
Input: 57 time: 435 ms
Input: 58 time: 1179 ms
Input: 59 time: 861 ms
Input: 60 time: 485 ms
Input: 61 time: 1991 ms
Input: 62 time: 482 ms
Input: 63 time: 818 ms
Input: 64 time: 488 ms
Input: 65 time: 606 ms
Input: 66 time: 993 ms
Input: 67 time: 628 ms
Input: 68 time: 608 ms
Input: 69 time: 689 ms
Input: 70 time: 818 ms
Input: 71 time: 627 ms
Input: 72 time: 833 ms
Input: 73 time: 947 ms
Input: 74 time: 1921 ms
Input: 75 time: 966 ms
Input: 76 time: 961 ms
Input: 77 time: 934 ms
Input: 78 time: 1532 ms
Input: 79 time: 2143 ms
Input: 80 time: 931 ms
Input: 81 time: 1378 ms
Input: 82 time: 1894 ms
Input: 83 time: 1297 ms
Input: 84 time: 1960 ms
Input: 85 time: 1567 ms
Input: 86 time: 1255 ms
Input: 87 time: 2298 ms

@yuslepukhin
Copy link
Member

Baidu seems to require a client installation to download the file, and that I am not willing to do for my work computer. Would it be possible to put this on a MS OneDrive or Google Drive? Or anything else that does not require a proprietary client installation.
The size does not seem to be very big for free versions.

@yuslepukhin
Copy link
Member

One thing to suggest, since it is a CPU, you do not need memory arena.
options.DisableCpuMemArena();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

No branches or pull requests

2 participants