[Bug] [microTVM] Truncated names of fused operators exceed CRT graph executor buffer lengths #8953

PhilippvK · 2021-09-07T19:02:42Z

I realized that two models (int8 quantized vww and aww of the MLPerf Tiny benchmark) which have worked fine using the graph executor a few month ago, stopped working with the latest TVM version.

Expected behavior

Running the builded models (on x86 for simplicity) via the CRT graph runtime should not result in any error messages, segmentation faults or other crashes.

Actual behavior

The following is printed to the terminal during the Initialization of the graph executor (e.g. while parsing the JSON graph):

Error: string size greater than buffer size (120).
Error at line 331, JSON object expect '}' or ',' but got '"'
Error at line 331, Expect '"' but reach end of line
Error at line 331, Expect ':' but get '"'
do not support key  
invalid format
failed to load an element in `nodes` field in graph executor node.
invalid format
Segmentation fault (core dumped)

The actual error is the the first line while the others just seem to be consequences of the previous problem.

Investigations

The big question to me was why is this only happening for these too models? As the error message is about string sizes I figured out, that this is related to the char func_name[120]; array in the struct TVMOpParam. (See include/tvm/runtime/crt/graph_executor.h)

Looking at the graph.json file I found out that there are in fact function names with a length slightly larger than 120 characters. I also get rid of the error temporarily by limiting the maximum number of fused ops (using relay.FuseOps.max_depth) which results in shorter function names.

Here is the code used by TVM to generate unique truncated function names by appending a hash:

    readable_name_stream_ << "fused";
    auto outputs = this->VisitExpr(prim_func->body);
    auto candidate_name = readable_name_stream_.str();
    constexpr static size_t kMaxFuncNameLength = 80;
    if (candidate_name.size() > kMaxFuncNameLength) {
      std::stringstream truncated_name;
      truncated_name << candidate_name.substr(0, kMaxFuncNameLength);
      truncated_name << "_" << std::hash<std::string>{}(candidate_name) << "_";
      candidate_name = truncated_name.str();
    }

Then I did the following calculation based on this (example) JSON func_name entry: tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip_cast_s_9959535092109263429__2

120 (max allowed function length)
- 80 (truncated original function name)
- 19 (length of appended hash)
- 5 ('fused' prefix)
- 4 (total number of '_' characters)
=  12 (remaining number of characters for the model name INCLUDING the 'tvmgen_' prefix which requires another 7 characters)

This leads to the conclusion that the mod_name passed to the relay.build() function needs to have at most 5 characters which is suboptimal, especially considering that the default name seems to the default (e.g. tvmgen_default).

I suppose that the easiest possible fixes would be either

increasing the size of the aforementioned func_name[] array by a few characters
or reducing kMaxFuncNameLength by some characters.

In addition it might be good to limit the length of the model name which can be specified by the user to make sure that this does not break again for too long model names in the future.

Why does this only happen for a few models? - Most operator/function names afer the FuseOps transformation do not exceed the kMaxFuncNameLength and therefore do not need to be truncated.

Environment

OS: Ubuntu 20.04
Architecture: x86_64
Python version: 3.8.10
TVM version: based on commit 9b034d7 from yesterday
Target: MicroTVM (Standalone) using CRT graph executor/runtime (apps/bundle_deploy/bundle_static.c)
Script: See next section!

Steps to reproduce

Here is a branch I set up to reproduce this issue using the aforementioned TFLite models:

Step-by-Step instructions for reproducing the issue can be found here.

There is also a GitHub Action which runs the demonstration of the bug. The log output can be investigated here: https://github.com/PhilippvK/tvm/runs/3531514621?check_suite_focus=true

The text was updated successfully, but these errors were encountered:

PhilippvK · 2021-09-13T01:50:37Z

I just ran into another issued related to the limited maximum function name in the graph runtime. I got to use a TFLite model with tensor names which are 136 characters long. This exceeds the maximum allowed size of 120 characters and therefore leads to a crash at runtime. I am not very sure we are able to truncate the tensor names on the TVM-side one because the original names may be used for accessing model inputs and outputs in the target software. Increasing the buffer size can fix the issue at relatively small cost but if a default array size >120 is unwanted, it might be possible to make this configurable using crt_config.h instead of hardcoding the array size in graph_executor.h

I am aware of the fact that the standalone graph executor will be replaced by the AoT executor in the future for most use cases but as long it has still relevance, people might run into these issues as well.

areusch · 2021-09-19T18:38:16Z

cc @mehrdadh could you take a look?

mehrdadh · 2021-09-21T18:22:41Z

@PhilippvK I'm working on reproducing this today. Sorry for late response.

mehrdadh · 2021-09-21T19:30:14Z

@PhilippvK update: I built three models(kws, vww and ic) with zephyr using these settings and it works fine:

target_model = ZEPHYR_BOARDS[board]
    target = tvm.target.target.micro(
        target_model,
        options=[
            "-link-params=1",
            "--executor=aot",
            "--unpacked-api=1",
            "--interface-api=c",
        ],
    )

will try to replicate it for crt.

PhilippvK · 2021-09-21T19:46:45Z

Only graph runtime is affected, so AOT executor works fine.

Thank you so much for investigating this issue. It should also be reproducible for other models by increasing the length of model name passed to the build function by some amount. As the limit of 120 characters for a function name is hardcoded, an the model name prefix can be arbitrary there is no general solution for this. But the default model name should not lead to any issues in my opinion.

areusch · 2021-09-22T00:54:34Z

gotcha--okay so we are just exercising these models with AOT at the present moment. I think reducing kMaxFuncNameLength is not a great option because it will constrain all other use cases just to make the C runtime use case work.

as an aside for microTVM we are generally in favor of deprecating the GraphExecutor as soon as AOT reaches parity. i'm okay with making the func_name array length configurable in crt_config.h. i think that would address the issue without requiring more RAM for all GraphExecutor use cases. do you want to submit a PR for this?

PhilippvK · 2021-09-22T08:44:52Z

@areusch I agree, that a low-effort approach such as making the array length configure would be the best way to tackle this issue, given the low relevance of the graph executor in microTVMs future. Preparing a PR for that should be hopefully fairly simple, so I would be happy to do that.

PhilippvK · 2021-10-01T07:29:19Z

@areusch I was going to implement this and I found an inconsistency I want to get rid off first:

Two different constant values (80 & 120) are hardcoded at different locations in the TVM codebase:

char func_name[120]; in graph_executor.h: Makes sense to me
char key[20], value[120]; in graph_executor.c: Used by JSON parser for every field.
#define TVM_CRT_MAX_STRLEN_FUNCTION_NAME 80 in crt_config.h: Used to allocate enough memory in graph_executor.c to store names? However this seems to be unrelated to JSON parsing and rather affects the parsing process of the binary parameters for the model. (The parameter names stored in that names array have are much shorter than 80 characters, e.g. p35)
constexpr static size_t kMaxFuncNameLength = 80; in te_compiler_cache.cc: Used to limit the generated function names BEFORE before the tvmgen_model_name prefix is added to the string.

It makes sense to me that 1. and 2. match but however the name TVM_CRT_MAX_STRLEN_FUNCTION_NAME seems to be misleading as it is not really related to function names.

Please let me know if I am missing anything.

areusch · 2021-10-04T18:36:19Z

ah i think i see what happened--since we started including the mangling and post-TE-compiler refactor, we haven't updated TVM_CRT_MAX_STRLEN_FUNCTION_NAME to match the new world (where it no longer maps 1:1 to kMaxFuncNameLength). it seems like we could either

update TVM_CRT_MAX_STRLEN_FUNCTION_NAME to mean "max length of the name, after adding all the extra stuff"
introduce length computations like you did above everywhere.

i sort of prefer 1, and add a comment next to kMaxFuncNameLength to keep in sync with crt_config.h and vice versa. What do you think? this would be changing 80 to 120, then.

PhilippvK · 2021-10-10T02:46:36Z

@areusch You are more or less right.

As calculated above, 120 characters is actually not enough for worst case models using the default name prefix default. Therefore we do either have to increase the number even more, or make sure that the resulting names are at least 2 characters shorter.
Contrary to my expectations the constant TVM_CRT_MAX_STRLEN_FUNCTION_NAME is currently only used in a way, which which is somehow not related to function names at all aka. for storing parameter names such as p10, p35,... (See my last message)

The latter does not make sense to me, therefore it feels wrong to just increase TVM_CRT_MAX_STRLEN_FUNCTION_NAME even further because reserving 120 characters to store such small strings seems to be very inefficient.

areusch · 2021-10-12T00:19:23Z

@PhilippvK you are right, i think I see what happened here. #8380 was authored a long time ago and sync'd up several times, and i think as part of a "cleanup" these constants were changed from TVM_CRT_STRLEN_NAME. unfortunately looks like the cleanup was incomplete/wrong. it looks like there are also some places (e.g. https://github.com/apache/tvm/blob/main/src/runtime/crt/include/tvm/runtime/crt/internal/graph_executor/graph_executor.h#L63, but also grep for [120]) where we should have written TVM_CRT_MAX_STRLEN_FUNCTION_NAME instead. And, it looks like the places where we use TVM_CRT_MAX_STRLEN_FUNCTION_NAME are indeed parameters only.

So it seems like the "correct" thing to do is to define TVM_CRT_MAX_STRLEN_PARAM_NAME, replace the current usages of TVM_CRT_MAX_STRLEN_FUNCTION_NAME with TVM_CRT_MAX_STRLEN_PARAM_NAME, and then fix up any residual [120] to be TVM_CRT_MAX_STRLEN_FUNCTION_NAME. sorry this is a bit more involved than originally anticipated...

PhilippvK · 2021-10-12T01:58:47Z

So it seems like the "correct" thing to do is to define TVM_CRT_MAX_STRLEN_PARAM_NAME, replace the current usages of TVM_CRT_MAX_STRLEN_FUNCTION_NAME with TVM_CRT_MAX_STRLEN_PARAM_NAME, and then fix up any residual [120] to be TVM_CRT_MAX_STRLEN_FUNCTION_NAME. sorry this is a bit more involved than originally anticipated...

@areusch thanks for clearing things up. I should no be able to get rid of these inconsistencies in a PR. What would be a suitable value for TVM_CRT_MAX_STRLEN_PARAM_NAME? I mean, will they always be of the form p35,p36,... or would there be a good reason to have more than lets say 10 characters available for a single parameter?

Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120` Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]` to maintain consistency and make the array lengths user-configurable. Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc making sure that the values are kept "in sync". (sort of) See apache#8953 for more context. The actual bug reported there however can only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a value larger than the maximum possible truncated function name length (including prefixes and suffices) Example: 6 ['tvmgen' prefix length] + 7 ['default' model name length] + 5 ['fused' fused function name prefix length] + 80 [truncated function name length] + 19 [length of appended hash] + 4 [Number of '_' between components] = 121

areusch · 2021-10-14T04:11:58Z

@PhilippvK following up on the PR

Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120` Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]` to maintain consistency and make the array lengths user-configurable. Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc making sure that the values are kept "in sync". (sort of) See apache#8953 for more context. The actual bug reported there however can only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a value larger than the maximum possible truncated function name length (including prefixes and suffices) Example: 6 ['tvmgen' prefix length] + 7 ['default' model name length] + 5 ['fused' fused function name prefix length] + 80 [truncated function name length] + 19 [length of appended hash] + 4 [Number of '_' between components] = 121

* Clean up redundant code in graph_executor.cc How did these lines ended up here? * Fix inconsistencies in graph_executor function names handling Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120` Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]` to maintain consistency and make the array lengths user-configurable. Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc making sure that the values are kept "in sync". (sort of) See #8953 for more context. The actual bug reported there however can only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a value larger than the maximum possible truncated function name length (including prefixes and suffices) Example: 6 ['tvmgen' prefix length] + 7 ['default' model name length] + 5 ['fused' fused function name prefix length] + 80 [truncated function name length] + 19 [length of appended hash] + 4 [Number of '_' between components] = 121

mehrdadh · 2021-12-09T15:40:46Z

@areusch This is resolved. Can we close it?

areusch · 2021-12-09T20:00:05Z

yep let's close, @PhilippvK please re-open if you still see a problem here!

This can mitigate issues described in apache#8953 without increasing memory requirements

…e#9255) * Clean up redundant code in graph_executor.cc How did these lines ended up here? * Fix inconsistencies in graph_executor function names handling Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120` Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]` to maintain consistency and make the array lengths user-configurable. Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc making sure that the values are kept "in sync". (sort of) See apache#8953 for more context. The actual bug reported there however can only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a value larger than the maximum possible truncated function name length (including prefixes and suffices) Example: 6 ['tvmgen' prefix length] + 7 ['default' model name length] + 5 ['fused' fused function name prefix length] + 80 [truncated function name length] + 19 [length of appended hash] + 4 [Number of '_' between components] = 121

) This can mitigate issues described in #8953 without increasing memory requirements

…e#9255) * Clean up redundant code in graph_executor.cc How did these lines ended up here? * Fix inconsistencies in graph_executor function names handling Updates value of `TVM_CRT_MAX_STRLEN_FUNCTION_NAME` from `80` to `120` Replace all occurences of `[120]` with `[TVM_CRT_MAX_STRLEN_FUNCTION_NAME]` to maintain consistency and make the array lengths user-configurable. Introduces `TVM_CRT_MAX_STRLEN_PARAM_NAME` used for parameter names only Adds comments to `kMaxFuncNameLength` variabe in src/relay/backend/te_compiler_cache.cc making sure that the values are kept "in sync". (sort of) See apache#8953 for more context. The actual bug reported there however can only be fixed by increasing the TVM_CRT_MAX_STRLEN_FUNCTION_NAME to a value larger than the maximum possible truncated function name length (including prefixes and suffices) Example: 6 ['tvmgen' prefix length] + 7 ['default' model name length] + 5 ['fused' fused function name prefix length] + 80 [truncated function name length] + 19 [length of appended hash] + 4 [Number of '_' between components] = 121

…ache#9787) This can mitigate issues described in apache#8953 without increasing memory requirements

PhilippvK added the type: bug label Sep 7, 2021

PhilippvK mentioned this issue Oct 12, 2021

Fix inconsistencies in graph_executor function names handling #9255

Merged

areusch closed this as completed Dec 9, 2021

rafzi pushed a commit to rafzi/tvm that referenced this issue Dec 21, 2021

te_compiler_cache: reduce name length without loss of information

fc00340

This can mitigate issues described in apache#8953 without increasing memory requirements

rafzi mentioned this issue Dec 21, 2021

te_compiler_cache: reduce name length without loss of information #9787

Merged

masahi pushed a commit that referenced this issue Jan 10, 2022

te_compiler_cache: reduce name length without loss of information (#9787

7481a29

) This can mitigate issues described in #8953 without increasing memory requirements

Raghav-Chakravarthy pushed a commit to Raghav-Chakravarthy/tvm that referenced this issue Jan 28, 2022

te_compiler_cache: reduce name length without loss of information (ap…

f82a3bc

…ache#9787) This can mitigate issues described in apache#8953 without increasing memory requirements

ylc pushed a commit to ylc/tvm that referenced this issue Feb 16, 2022

te_compiler_cache: reduce name length without loss of information (ap…

ba6cd7a

…ache#9787) This can mitigate issues described in apache#8953 without increasing memory requirements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [microTVM] Truncated names of fused operators exceed CRT graph executor buffer lengths #8953

[Bug] [microTVM] Truncated names of fused operators exceed CRT graph executor buffer lengths #8953

PhilippvK commented Sep 7, 2021

PhilippvK commented Sep 13, 2021

areusch commented Sep 19, 2021

mehrdadh commented Sep 21, 2021

mehrdadh commented Sep 21, 2021 •

edited

Loading

PhilippvK commented Sep 21, 2021 •

edited

Loading

areusch commented Sep 22, 2021

PhilippvK commented Sep 22, 2021

PhilippvK commented Oct 1, 2021

areusch commented Oct 4, 2021

PhilippvK commented Oct 10, 2021

areusch commented Oct 12, 2021

PhilippvK commented Oct 12, 2021

areusch commented Oct 14, 2021

mehrdadh commented Dec 9, 2021

areusch commented Dec 9, 2021

[Bug] [microTVM] Truncated names of fused operators exceed CRT graph executor buffer lengths #8953

[Bug] [microTVM] Truncated names of fused operators exceed CRT graph executor buffer lengths #8953

Comments

PhilippvK commented Sep 7, 2021

Expected behavior

Actual behavior

Investigations

Environment

Steps to reproduce

PhilippvK commented Sep 13, 2021

areusch commented Sep 19, 2021

mehrdadh commented Sep 21, 2021

mehrdadh commented Sep 21, 2021 • edited Loading

PhilippvK commented Sep 21, 2021 • edited Loading

areusch commented Sep 22, 2021

PhilippvK commented Sep 22, 2021

PhilippvK commented Oct 1, 2021

areusch commented Oct 4, 2021

PhilippvK commented Oct 10, 2021

areusch commented Oct 12, 2021

PhilippvK commented Oct 12, 2021

areusch commented Oct 14, 2021

mehrdadh commented Dec 9, 2021

areusch commented Dec 9, 2021

mehrdadh commented Sep 21, 2021 •

edited

Loading

PhilippvK commented Sep 21, 2021 •

edited

Loading