Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [microTVM] Truncated names of fused operators exceed CRT graph executor buffer lengths #8953

Closed
PhilippvK opened this issue Sep 7, 2021 · 15 comments

Comments

@PhilippvK
Copy link
Contributor

I realized that two models (int8 quantized vww and aww of the MLPerf Tiny benchmark) which have worked fine using the graph executor a few month ago, stopped working with the latest TVM version.

Expected behavior

Running the builded models (on x86 for simplicity) via the CRT graph runtime should not result in any error messages, segmentation faults or other crashes.

Actual behavior

The following is printed to the terminal during the Initialization of the graph executor (e.g. while parsing the JSON graph):

Error: string size greater than buffer size (120).
Error at line 331, JSON object expect '}' or ',' but got '"'
Error at line 331, Expect '"' but reach end of line
Error at line 331, Expect ':' but get '"'
do not support key  
invalid format
failed to load an element in `nodes` field in graph executor node.
invalid format
Segmentation fault (core dumped)

The actual error is the the first line while the others just seem to be consequences of the previous problem.

Investigations

The big question to me was why is this only happening for these too models? As the error message is about string sizes I figured out, that this is related to the char func_name[120]; array in the struct TVMOpParam. (See include/tvm/runtime/crt/graph_executor.h)

Looking at the graph.json file I found out that there are in fact function names with a length slightly larger than 120 characters. I also get rid of the error temporarily by limiting the maximum number of fused ops (using relay.FuseOps.max_depth) which results in shorter function names.

Here is the code used by TVM to generate unique truncated function names by appending a hash:

    readable_name_stream_ << "fused";
    auto outputs = this->VisitExpr(prim_func->body);
    auto candidate_name = readable_name_stream_.str();
    constexpr static size_t kMaxFuncNameLength = 80;
    if (candidate_name.size() > kMaxFuncNameLength) {
      std::stringstream truncated_name;
      truncated_name << candidate_name.substr(0, kMaxFuncNameLength);
      truncated_name << "_" << std::hash<std::string>{}(candidate_name) << "_";
      candidate_name = truncated_name.str();
    }

Then I did the following calculation based on this (example) JSON func_name entry: tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip_cast_s_9959535092109263429__2

120 (max allowed function length)
- 80 (truncated original function name)
- 19 (length of appended hash)
- 5 ('fused' prefix)
- 4 (total number of '_' characters)
=  12 (remaining number of characters for the model name INCLUDING the 'tvmgen_' prefix which requires another 7 characters)

This leads to the conclusion that the mod_name passed to the relay.build() function needs to have at most 5 characters which is suboptimal, especially considering that the default name seems to the default (e.g. tvmgen_default).

I suppose that the easiest possible fixes would be either

  • increasing the size of the aforementioned func_name[] array by a few characters
  • or reducing kMaxFuncNameLength by some characters.

In addition it might be good to limit the length of the model name which can be specified by the user to make sure that this does not break again for too long model names in the future.

Why does this only happen for a few models? - Most operator/function names afer the FuseOps transformation do not exceed the kMaxFuncNameLength and therefore do not need to be truncated.

Environment

  • OS: Ubuntu 20.04
  • Architecture: x86_64
  • Python version: 3.8.10
  • TVM version: based on commit 9b034d7 from yesterday
  • Target: MicroTVM (Standalone) using CRT graph executor/runtime (apps/bundle_deploy/bundle_static.c)
  • Script: See next section!

Steps to reproduce

Here is a branch I set up to reproduce this issue using the aforementioned TFLite models:

Step-by-Step instructions for reproducing the issue can be found here.

There is also a GitHub Action which runs the demonstration of the bug. The log output can be investigated here: https://github.com/PhilippvK/tvm/runs/3531514621?check_suite_focus=true

@PhilippvK
Copy link
Contributor Author

I just ran into another issued related to the limited maximum function name in the graph runtime. I got to use a TFLite model with tensor names which are 136 characters long. This exceeds the maximum allowed size of 120 characters and therefore leads to a crash at runtime. I am not very sure we are able to truncate the tensor names on the TVM-side one because the original names may be used for accessing model inputs and outputs in the target software. Increasing the buffer size can fix the issue at relatively small cost but if a default array size >120 is unwanted, it might be possible to make this configurable using crt_config.h instead of hardcoding the array size in graph_executor.h

I am aware of the fact that the standalone graph executor will be replaced by the AoT executor in the future for most use cases but as long it has still relevance, people might run into these issues as well.

@areusch
Copy link
Contributor

areusch commented Sep 19, 2021

cc @mehrdadh could you take a look?

@mehrdadh
Copy link
Member

@PhilippvK I'm working on reproducing this today. Sorry for late response.

@mehrdadh
Copy link
Member

mehrdadh commented Sep 21, 2021

@PhilippvK update: I built three models(kws, vww and ic) with zephyr using these settings and it works fine:

target_model = ZEPHYR_BOARDS[board]
    target = tvm.target.target.micro(
        target_model,
        options=[
            "-link-params=1",
            "--executor=aot",
            "--unpacked-api=1",
            "--interface-api=c",
        ],
    )

will try to replicate it for crt.

@PhilippvK
Copy link
Contributor Author

PhilippvK commented Sep 21, 2021

Only graph runtime is affected, so AOT executor works fine.

Thank you so much for investigating this issue. It should also be reproducible for other models by increasing the length of model name passed to the build function by some amount. As the limit of 120 characters for a function name is hardcoded, an the model name prefix can be arbitrary there is no general solution for this. But the default model name should not lead to any issues in my opinion.

@areusch
Copy link
Contributor

areusch commented Sep 22, 2021

gotcha--okay so we are just exercising these models with AOT at the present moment. I think reducing kMaxFuncNameLength is not a great option because it will constrain all other use cases just to make the C runtime use case work.

as an aside for microTVM we are generally in favor of deprecating the GraphExecutor as soon as AOT reaches parity. i'm okay with making the func_name array length configurable in crt_config.h. i think that would address the issue without requiring more RAM for all GraphExecutor use cases. do you want to submit a PR for this?

@PhilippvK
Copy link
Contributor Author

@areusch I agree, that a low-effort approach such as making the array length configure would be the best way to tackle this issue, given the low relevance of the graph executor in microTVMs future. Preparing a PR for that should be hopefully fairly simple, so I would be happy to do that.

@PhilippvK
Copy link
Contributor Author

@areusch I was going to implement this and I found an inconsistency I want to get rid off first:

Two different constant values (80 & 120) are hardcoded at different locations in the TVM codebase:

  1. char func_name[120]; in graph_executor.h: Makes sense to me
  2. char key[20], value[120]; in graph_executor.c: Used by JSON parser for every field.
  3. #define TVM_CRT_MAX_STRLEN_FUNCTION_NAME 80 in crt_config.h: Used to allocate enough memory in graph_executor.c to store names? However this seems to be unrelated to JSON parsing and rather affects the parsing process of the binary parameters for the model. (The parameter names stored in that names array have are much shorter than 80 characters, e.g. p35)
  4. constexpr static size_t kMaxFuncNameLength = 80; in te_compiler_cache.cc: Used to limit the generated function names BEFORE before the tvmgen_model_name prefix is added to the string.

It makes sense to me that 1. and 2. match but however the name TVM_CRT_MAX_STRLEN_FUNCTION_NAME seems to be misleading as it is not really related to function names.

Please let me know if I am missing anything.

@areusch
Copy link
Contributor

areusch commented Oct 4, 2021

ah i think i see what happened--since we started including the mangling and post-TE-compiler refactor, we haven't updated TVM_CRT_MAX_STRLEN_FUNCTION_NAME to match the new world (where it no longer maps 1:1 to kMaxFuncNameLength). it seems like we could either

  1. update TVM_CRT_MAX_STRLEN_FUNCTION_NAME to mean "max length of the name, after adding all the extra stuff"
  2. introduce length computations like you did above everywhere.

i sort of prefer 1, and add a comment next to kMaxFuncNameLength to keep in sync with crt_config.h and vice versa. What do you think? this would be changing 80 to 120, then.

@PhilippvK
Copy link
Contributor Author

@areusch You are more or less right.

  1. As calculated above, 120 characters is actually not enough for worst case models using the default name prefix default. Therefore we do either have to increase the number even more, or make sure that the resulting names are at least 2 characters shorter.
  2. Contrary to my expectations the constant TVM_CRT_MAX_STRLEN_FUNCTION_NAME is currently only used in a way, which which is somehow not related to function names at all aka. for storing parameter names such as p10, p35,... (See my last message)

The latter does not make sense to me, therefore it feels wrong to just increase TVM_CRT_MAX_STRLEN_FUNCTION_NAME even further because reserving 120 characters to store such small strings seems to be very inefficient.

@areusch
Copy link
Contributor

areusch commented Oct 12, 2021

@PhilippvK you are right, i think I see what happened here. #8380 was authored a long time ago and sync'd up several times, and i think as part of a "cleanup" these constants were changed from TVM_CRT_STRLEN_NAME. unfortunately looks like the cleanup was incomplete/wrong. it looks like there are also some places (e.g. https://github.com/apache/tvm/blob/main/src/runtime/crt/include/tvm/runtime/crt/internal/graph_executor/graph_executor.h#L63, but also grep for [120]) where we should have written TVM_CRT_MAX_STRLEN_FUNCTION_NAME instead. And, it looks like the places where we use TVM_CRT_MAX_STRLEN_FUNCTION_NAME are indeed parameters only.

So it seems like the "correct" thing to do is to define TVM_CRT_MAX_STRLEN_PARAM_NAME, replace the current usages of TVM_CRT_MAX_STRLEN_FUNCTION_NAME with TVM_CRT_MAX_STRLEN_PARAM_NAME, and then fix up any residual [120] to be TVM_CRT_MAX_STRLEN_FUNCTION_NAME. sorry this is a bit more involved than originally anticipated...

@PhilippvK
Copy link
Contributor Author

So it seems like the "correct" thing to do is to define TVM_CRT_MAX_STRLEN_PARAM_NAME, replace the current usages of TVM_CRT_MAX_STRLEN_FUNCTION_NAME with TVM_CRT_MAX_STRLEN_PARAM_NAME, and then fix up any residual [120] to be TVM_CRT_MAX_STRLEN_FUNCTION_NAME. sorry this is a bit more involved than originally anticipated...

@areusch thanks for clearing things up. I should no be able to get rid of these inconsistencies in a PR. What would be a suitable value for TVM_CRT_MAX_STRLEN_PARAM_NAME? I mean, will they always be of the form p35,p36,... or would there be a good reason to have more than lets say 10 characters available for a single parameter?

PhilippvK added a commit to PhilippvK/tvm that referenced this issue Oct 12, 2021
Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120`

Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]`
to maintain consistency and make the array lengths user-configurable.

Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only

Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc
making sure that the values are kept "in sync". (sort of)

See apache#8953 for more context. The actual bug reported there however can
only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a
value larger than the maximum possible truncated function name length
(including prefixes and suffices)

Example:

  6	['tvmgen' prefix length]
+ 7	['default' model name length]
+ 5	['fused' fused function name prefix length]
+ 80	[truncated function name length]
+ 19	[length of appended hash]
+ 4     [Number of '_' between components]
= 121
@areusch
Copy link
Contributor

areusch commented Oct 14, 2021

@PhilippvK following up on the PR

PhilippvK added a commit to PhilippvK/tvm that referenced this issue Oct 14, 2021
Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120`

Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]`
to maintain consistency and make the array lengths user-configurable.

Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only

Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc
making sure that the values are kept "in sync". (sort of)

See apache#8953 for more context. The actual bug reported there however can
only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a
value larger than the maximum possible truncated function name length
(including prefixes and suffices)

Example:

  6	['tvmgen' prefix length]
+ 7	['default' model name length]
+ 5	['fused' fused function name prefix length]
+ 80	[truncated function name length]
+ 19	[length of appended hash]
+ 4     [Number of '_' between components]
= 121
PhilippvK added a commit to PhilippvK/tvm that referenced this issue Oct 23, 2021
Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120`

Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]`
to maintain consistency and make the array lengths user-configurable.

Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only

Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc
making sure that the values are kept "in sync". (sort of)

See apache#8953 for more context. The actual bug reported there however can
only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a
value larger than the maximum possible truncated function name length
(including prefixes and suffices)

Example:

  6	['tvmgen' prefix length]
+ 7	['default' model name length]
+ 5	['fused' fused function name prefix length]
+ 80	[truncated function name length]
+ 19	[length of appended hash]
+ 4     [Number of '_' between components]
= 121
masahi pushed a commit that referenced this issue Oct 24, 2021
* Clean up redundant code in graph_executor.cc

How did these lines ended up here?

* Fix inconsistencies in graph_executor function names handling

Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120`

Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]`
to maintain consistency and make the array lengths user-configurable.

Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only

Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc
making sure that the values are kept "in sync". (sort of)

See #8953 for more context. The actual bug reported there however can
only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a
value larger than the maximum possible truncated function name length
(including prefixes and suffices)

Example:

  6	['tvmgen' prefix length]
+ 7	['default' model name length]
+ 5	['fused' fused function name prefix length]
+ 80	[truncated function name length]
+ 19	[length of appended hash]
+ 4     [Number of '_' between components]
= 121
@mehrdadh
Copy link
Member

mehrdadh commented Dec 9, 2021

@areusch This is resolved. Can we close it?

@areusch
Copy link
Contributor

areusch commented Dec 9, 2021

yep let's close, @PhilippvK please re-open if you still see a problem here!

@areusch areusch closed this as completed Dec 9, 2021
rafzi pushed a commit to rafzi/tvm that referenced this issue Dec 21, 2021
This can mitigate issues described in apache#8953 without increasing memory requirements
ylc pushed a commit to ylc/tvm that referenced this issue Jan 7, 2022
…e#9255)

* Clean up redundant code in graph_executor.cc

How did these lines ended up here?

* Fix inconsistencies in graph_executor function names handling

Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120`

Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]`
to maintain consistency and make the array lengths user-configurable.

Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only

Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc
making sure that the values are kept "in sync". (sort of)

See apache#8953 for more context. The actual bug reported there however can
only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a
value larger than the maximum possible truncated function name length
(including prefixes and suffices)

Example:

  6	['tvmgen' prefix length]
+ 7	['default' model name length]
+ 5	['fused' fused function name prefix length]
+ 80	[truncated function name length]
+ 19	[length of appended hash]
+ 4     [Number of '_' between components]
= 121
masahi pushed a commit that referenced this issue Jan 10, 2022
)

This can mitigate issues described in #8953 without increasing memory requirements
ylc pushed a commit to ylc/tvm that referenced this issue Jan 13, 2022
…e#9255)

* Clean up redundant code in graph_executor.cc

How did these lines ended up here?

* Fix inconsistencies in graph_executor function names handling

Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120`

Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]`
to maintain consistency and make the array lengths user-configurable.

Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only

Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc
making sure that the values are kept "in sync". (sort of)

See apache#8953 for more context. The actual bug reported there however can
only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a
value larger than the maximum possible truncated function name length
(including prefixes and suffices)

Example:

  6	['tvmgen' prefix length]
+ 7	['default' model name length]
+ 5	['fused' fused function name prefix length]
+ 80	[truncated function name length]
+ 19	[length of appended hash]
+ 4     [Number of '_' between components]
= 121
Raghav-Chakravarthy pushed a commit to Raghav-Chakravarthy/tvm that referenced this issue Jan 28, 2022
…ache#9787)

This can mitigate issues described in apache#8953 without increasing memory requirements
ylc pushed a commit to ylc/tvm that referenced this issue Feb 16, 2022
…ache#9787)

This can mitigate issues described in apache#8953 without increasing memory requirements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants