-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AOT call stack optimizations #3773
Conversation
36c0ed8
to
81be41a
Compare
@wenyongh I'm not 100% sure if I haven't missed any places where I should release the frame so I'd appreciate if you feedback especially in that area. Thanks |
spec tests are failing on nuttx, likely unrelated to this issue and I hope this can fix it: #3771 |
81be41a
to
39fc6de
Compare
core/iwasm/aot/aot_runtime.c
Outdated
* because frame allocation is not part of the function) or the function | ||
* within the module (in that case the function itself is responsible for | ||
* allocating / freeing the frame), so we need to check it at runtime. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the frame of the import function isn't allocated, there may be several issues: (1) the dumped call stacks may be not complete, (2) for GC, the native function's parameters are not committed frame and AOT runtime can not traversing them and add them to GC's root set, GC reclaim process may be invalid, (3) for performance profiling, the execution time of the import function isn't found.
Can we allocate the import function's frame before calling the function and freeing after calling the function? Like what current code does for both import and non-import functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we allocate the import function's frame before calling the function and freeing after calling the function? Like what current code does for both import and non-import functions.
@wenyongh that's exactly what happens now; for call
opcode we have:
wasm-micro-runtime/core/iwasm/compilation/aot_emit_function.c
Lines 1485 to 1503 in eb4d5bb
if (comp_ctx->aux_stack_frame_type) { | |
if (func_idx < import_func_count | |
&& comp_ctx->call_stack_features.frame_per_function) { | |
if (!aot_alloc_frame_per_function_frame_for_aot_func( | |
comp_ctx, func_ctx, func_idx)) { | |
return false; | |
} | |
} | |
else if (!comp_ctx->call_stack_features.frame_per_function) { | |
if (comp_ctx->aux_stack_frame_type | |
!= AOT_STACK_FRAME_TYPE_STANDARD) { | |
aot_set_last_error("unsupported mode"); | |
return false; | |
} | |
if (!alloc_frame_for_aot_func(comp_ctx, func_ctx, func_idx)) { | |
return false; | |
} | |
} | |
} |
so the frame will be allocated using aot_alloc_frame_per_function_frame_for_aot_func
if this is imported function. For call_indirect
there's always a call to aot_alloc_frame
(there's no condition that checks if it's frame-per-function
mode), and in the aot_alloc_frame
the frame allocation is only skipped for non-imported function; otherwise it's allocated:
wasm-micro-runtime/core/iwasm/aot/aot_runtime.c
Lines 3803 to 3822 in eb4d5bb
bool | |
aot_alloc_frame(WASMExecEnv *exec_env, uint32 func_index) | |
{ | |
AOTModule *module = | |
(AOTModule *)((AOTModuleInstance *)exec_env->module_inst)->module; | |
if (is_frame_per_function(exec_env) | |
&& func_index >= module->import_func_count) { | |
/* in frame per function mode the frame is allocated at | |
the beginning of each frame, so we only need to allocate | |
the frame for imported functions */ | |
return true; | |
} | |
if (is_tiny_frame(exec_env)) { | |
return aot_alloc_tiny_frame(exec_env, func_index); | |
} | |
else { | |
return aot_alloc_standard_frame(exec_env, func_index); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think we can add check in the compiled LLVM IR of opcode call_indirect: compare the function index with import_func_count, and call aot_alloc_frame if func_idx < import_func_count. It doesn't necessary always calling aot_alloc_frame.
And seems no need to add the flag comp_ctx->call_stack_features.frame_per_function
, my idea is that: (1) in the new code of aot compiler, for both full frame and tiny frame, (a) always allocate frame in the beginning of a non-imported function, (b) always allocate frame before calling the imported function (in both op call and op call_indirect), (c) always emit the WASM_FEATURE_FRAME_PER_FUNCTION flag in the aot file. This simplify the aot compilation and seems no need to aot_alloc_import_frame.
(2) in aot runtime, (a) always aot_alloc_frame before calling import function, (b) aot_alloc_frame before calling non-import function if WASM_FEATURE_FRAME_PER_FUNCTION flag isn't detected, so as to keep backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be simpler, please correct me if I am wrong or there is something I missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think we can add check in the compiled LLVM IR of opcode call_indirect: compare the function index with import_func_count, and call aot_alloc_frame if func_idx < import_func_count. It doesn't necessary always calling aot_alloc_frame.
That's I think possible, I thought it will add slightly more code. However, I just noticed the comparison is already generated in AOT and there are two blocks for import and non-import function, so shouldn't bloat the code.
And seems no need to add the flag comp_ctx->call_stack_features.frame_per_function
Yes, it could potentially be removed; however, I didn't do it (just yet) due to multimodule / dynamic linking - if we always use frame-per-function mode, the callstack will contain duplicates. So I think we should keep both modes in the compiler for now until we figure out how to do it correctly; there are few options that come to my mind:
- explicitly expose frame-per-function as parameter to wamrc, so users can choose the mode
- allow users to define which functions are provided by another module and which functions are imported from native or modules without instrumentation
Either is fine for me, but also I don't think any of the option needs to be implemented right now and can be added in the future without breaking the ABI. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for me the first option (expose frame-per-function as parameter to wamrc) is better and can be implemented in the future. The latter may be a little complex for developer to choose which functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, I'll make it a follow-up task for the #3758
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenyongh I also removed aot_free_import_frame
function and moved the allocation of the frame for imported function to the AOT code. Let me know if you have any questions.
eb4d5bb
to
32f09a1
Compare
- implemented TINY / STANDARD frame modes - tiny mode is only able to keep track on the IP and func idx, STANDARD mode provides more capabilities (parameters, stack pointer etc) - implemented FRAME_PER_FUNCTION / FRAME_PER_CALL modes - frame per function adds code at the beginning and at the end of each function for allocating / deallocating stack frame, whereas in per-call mode the frame is allocated before each call. The exception is call to imported function, where frame-per-function mode also allocates the stack before the `call` instruction (as it can't instrument the imported function) At the moment TINY + FRAME_PER_FUNCTION is automatically enabled in case GC or perf profiling are disabled and `values` call stack feature is not requested. In all the other cases STANDARD + FRAME_PER_CALL is used. STANDARD + FRAME_PER_FUNCTION and TINY + FRAME_PER_CALL are currently not implemented but possible, and might be enabled in the future.
…e, instead of doing it in runtime
32f09a1
to
e5ae586
Compare
d168044
to
94c5df0
Compare
94c5df0
to
4dae5a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor issues.
4dae5a7
to
b0dc668
Compare
call
instruction (as it can't instrument the imported function)At the moment TINY + FRAME_PER_FUNCTION is automatically enabled in case GC or perf profiling are disabled and
values
call stack feature is not requested. In all the other cases STANDARD + FRAME_PER_CALL is used.STANDARD + FRAME_PER_FUNCTION and TINY + FRAME_PER_CALL are currently not implemented but possible, and might be enabled in the future.
#3758