-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting TVM working with VE #24
Comments
Dear Samuel-san, |
Thank you for letting us know the problem. The develop branch supports only limited vector instructions at the moment, and it causes your problem. @simoll, do you have any information? |
@kaz7 Interesting, it does seem to go through when disabling vectorized operations this way: with tvm.transform.PassContext(opt_level=3, config={"tir.disable_vectorize": True}, required_pass=["FastMath"]):
compiled_lib = relay.build(mod, "llvm -mtriple=ve-linux", params=params)
compiled_lib.export_library("libbertve.so") Thanks for the input! |
Now I'm getting a |
This is remedy if you don't need vectorization nor vector intrinsic instructions. You can disable vectorizations completely by changing VESubtarget.cpp like below.
|
@kaz7 Thanks for the information. I'm getting |
Sorry for hear that. :-( |
Please let me know how I can help debug this, for example, is there a way to enable an info dump of some kind? |
Hi @saudet , Could you try the hpce/develop branch? That one has vectorization and vector instruction selection - you will get the same SegFault as in the develop branch at runtime but i'd like to hear of any compiler crashes with that branch. |
@simoll Thanks for chipping in! I tried the |
Ok, I got something working. The cause of the |
Ah, here's the missing piece: BLAS. I was able to link it with |
I'm glad to hear that problems is fixed. Regarding to |
I tried with the |
Some more data points when using the
|
We are currently merging the lastest |
I'm attaching what I get from In either case, TVM's vectorizer is disabled as per above. If you spot anything off in that, please let me know! Thanks |
@simoll I've tried again with VPU using the latest code from I'm compiling with and without VPU using the available LLVM options as shown here: That gives me these kinds of execution profiles: In either case, "fused_nn_softmax" is accountable for about 40% of the execution time, and "fused_nn_dense_add", about 25%. The "fused_nn" part is typically just GEMM from BLAS, and we know NLC is still not super fast, but that doesn't explain why the "softmax" ones, even with VPU, are so slow. That operation is mostly exp() and divisions. I remember NCC having problems vectorizing these kinds of operations. What's the status of LLVM-VE with regard to this? |
The Select insn in BPF is expensive as BPF backend needs to resolve with conditionals. This patch set the getCmpSelInstrCost() to SCEVCheapExpansionBudget for Select insn to prevent some Select insn related optimizations. This change is motivated during bcc code review for iovisor/bcc#3270 where IndVarSimplifyPass eventually caused generating the following asm code: ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 14: 16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6> 15: bc 51 00 00 00 00 00 00 w1 = w5 16: 04 01 00 00 ff ff ff ff w1 += -1 17: 67 05 00 00 20 00 00 00 r5 <<= 32 18: 77 05 00 00 20 00 00 00 r5 >>= 32 19: a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4> 20: b7 05 00 00 06 00 00 00 r5 = 6 00000000000000a8 <LBB0_4>: 21: b7 02 00 00 00 00 00 00 r2 = 0 22: b7 01 00 00 00 00 00 00 r1 = 0 ; for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) { 23: 7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1 24: 7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5 Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is eventually saved on stack at insn #24 for later use. This cause later verifier failures. With this change, IndVarSimplifyPass won't do the above transformation any more. Differential Revision: https://reviews.llvm.org/D97479
This patch fixes an issue where a pre-indexed store e.g., STR x1, [x0, #24]! with a store like STR x0, [x0, #8] are merged into a single store: STP x1, x0, [x0, #24]! . They shouldn’t be merged because the second store uses x0 as both the stored value and the address and so it needs to be using the updated x0. Therefore, it should not be folded into a STP <>pre. Additionally a new test case is added to verify this fix. Differential Revision: https://reviews.llvm.org/D101888 Change-Id: I26f1985ac84e970961e2cdca23c590fa6773851a
This fixes a GISEL vs SDAG regression that showed up at -Os in 256.bzip2 In `_getAndMoveToFrontDecode`: gisel: ``` and w9, w0, #0xff orr w9, w9, w8, lsl #8 ``` sdag: ``` bfi w0, w8, #8, #24 ``` Differential revision: https://reviews.llvm.org/D103291
There is a SIGSEGV at `DeduceTemplateArgumentsByTypeMatch`. The bug [#51171](https://bugs.llvm.org/show_bug.cgi?id=51171) was filled. The reproducer can be found at the bug description. LIT test for the issue was added: ``` ./bin/llvm-lit -v ../clang/test/SemaCXX/pr51171-crash.cpp ``` The debug stack trace is below: ``` #0 0x00000000055afcb9 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:22 #1 0x00000000055afd70 PrintStackTraceSignalHandler(void*) /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1 #2 0x00000000055add2d llvm::sys::RunSignalHandlers() /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Signals.cpp:97:20 #3 0x00000000055af701 SignalHandler(int) /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1 #4 0x00007ffff7bc2b20 __restore_rt sigaction.c:0:0 #5 0x00007ffff66a337f raise (/lib64/libc.so.6+0x3737f) #6 0x00007ffff668ddb5 abort (/lib64/libc.so.6+0x21db5) #7 0x00007ffff668dc89 _nl_load_domain.cold.0 loadmsgcat.c:0:0 #8 0x00007ffff669ba76 .annobin___GI___assert_fail.end assert.c:0:0 #9 0x000000000594b210 clang::QualType::getCommonPtr() const /home/ivanmurashko/local/llvm-project/clang/include/clang/AST/Type.h:684:5 #10 0x0000000005a12ca6 clang::QualType::getCanonicalType() const /home/ivanmurashko/local/llvm-project/clang/include/clang/AST/Type.h:6467:36 #11 0x0000000005a137a6 clang::ASTContext::getCanonicalType(clang::QualType) const /home/ivanmurashko/local/llvm-project/clang/include/clang/AST/ASTContext.h:2433:58 #12 0x0000000009204584 DeduceTemplateArgumentsByTypeMatch(clang::Sema&, clang::TemplateParameterList*, clang::QualType, clang::QualType, clang::sema::TemplateDeductionInfo&, llvm::SmallVectorImpl<clang::DeducedTemplateArgument>&, unsigned int, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaTemplateDeduction.cpp:1355:54 #13 0x000000000920df0d clang::Sema::DeduceTemplateArguments(clang::FunctionTemplateDecl*, clang::TemplateArgumentListInfo*, clang::QualType, clang::FunctionDecl*&, clang::sema::TemplateDeductionInfo&, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaTemplateDeduction.cpp:4354:47 #14 0x0000000009012b09 (anonymous namespace)::AddressOfFunctionResolver::AddMatchingTemplateFunction(clang::FunctionTemplateDecl*, clang::DeclAccessPair const&) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:12026:38 #15 0x0000000009013030 (anonymous namespace)::AddressOfFunctionResolver::FindAllFunctionsThatMatchTargetTypeExactly() /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:12119:9 #16 0x0000000009012679 (anonymous namespace)::AddressOfFunctionResolver::AddressOfFunctionResolver(clang::Sema&, clang::Expr*, clang::QualType const&, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:11931:5 #17 0x0000000009013c91 clang::Sema::ResolveAddressOfOverloadedFunction(clang::Expr*, clang::QualType, bool, clang::DeclAccessPair&, bool*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:12286:42 #18 0x0000000008fed85d IsStandardConversion(clang::Sema&, clang::Expr*, clang::QualType, bool, clang::StandardConversionSequence&, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:1712:49 #19 0x0000000008fec8ea TryImplicitConversion(clang::Sema&, clang::Expr*, clang::QualType, bool, clang::Sema::AllowedExplicit, bool, bool, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:1433:27 #20 0x0000000008ff90ba TryCopyInitialization(clang::Sema&, clang::Expr*, clang::QualType, bool, bool, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:5273:71 #21 0x00000000090024fb clang::Sema::AddBuiltinCandidate(clang::QualType*, llvm::ArrayRef<clang::Expr*>, clang::OverloadCandidateSet&, bool, unsigned int) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:7755:32 #22 0x000000000900513f (anonymous namespace)::BuiltinOperatorOverloadBuilder::addGenericBinaryArithmeticOverloads() /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:8633:30 #23 0x0000000009007624 clang::Sema::AddBuiltinOperatorCandidates(clang::OverloadedOperatorKind, clang::SourceLocation, llvm::ArrayRef<clang::Expr*>, clang::OverloadCandidateSet&) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:9205:51 #24 0x0000000009018734 clang::Sema::LookupOverloadedBinOp(clang::OverloadCandidateSet&, clang::OverloadedOperatorKind, clang::UnresolvedSetImpl const&, llvm::ArrayRef<clang::Expr*>, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:13469:1 #25 0x0000000009018d56 clang::Sema::CreateOverloadedBinOp(clang::SourceLocation, clang::BinaryOperatorKind, clang::UnresolvedSetImpl const&, clang::Expr*, clang::Expr*, bool, bool, clang::FunctionDecl*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:13568:24 #26 0x0000000008b24797 BuildOverloadedBinOp(clang::Sema&, clang::Scope*, clang::SourceLocation, clang::BinaryOperatorKind, clang::Expr*, clang::Expr*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaExpr.cpp:14606:65 #27 0x0000000008b24ed5 clang::Sema::BuildBinOp(clang::Scope*, clang::SourceLocation, clang::BinaryOperatorKind, clang::Expr*, clang::Expr*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaExpr.cpp:14691:73 #28 0x0000000008b245d4 clang::Sema::ActOnBinOp(clang::Scope*, clang::SourceLocation, clang::tok::TokenKind, clang::Expr*, clang::Expr*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaExpr.cpp:14566:1 #29 0x00000000085bfafb clang::Parser::ParseRHSOfBinaryExpression(clang::ActionResult<clang::Expr*, true>, clang::prec::Level) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseExpr.cpp:630:71 #30 0x00000000085bd922 clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseExpr.cpp:177:1 #31 0x00000000085cbbcd clang::Parser::ParseExpressionList(llvm::SmallVectorImpl<clang::Expr*>&, llvm::SmallVectorImpl<clang::SourceLocation>&, llvm::function_ref<void ()>) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseExpr.cpp:3368:40 #32 0x000000000857f49c clang::Parser::ParseDeclarationAfterDeclaratorAndAttributes(clang::Declarator&, clang::Parser::ParsedTemplateInfo const&, clang::Parser::ForRangeInit*) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseDecl.cpp:2416:5 #33 0x000000000857df16 clang::Parser::ParseDeclGroup(clang::ParsingDeclSpec&, clang::DeclaratorContext, clang::SourceLocation*, clang::Parser::ForRangeInit*) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseDecl.cpp:2092:65 #34 0x000000000855f07b clang::Parser::ParseDeclOrFunctionDefInternal(clang::ParsedAttributesWithRange&, clang::ParsingDeclSpec&, clang::AccessSpecifier) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:1138:1 #35 0x000000000855f136 clang::Parser::ParseDeclarationOrFunctionDefinition(clang::ParsedAttributesWithRange&, clang::ParsingDeclSpec*, clang::AccessSpecifier) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:1153:57 #36 0x000000000855e644 clang::Parser::ParseExternalDeclaration(clang::ParsedAttributesWithRange&, clang::ParsingDeclSpec*) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:975:58 #37 0x000000000855d717 clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:720:42 #38 0x0000000008558e01 clang::ParseAST(clang::Sema&, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseAST.cpp:158:37 #39 0x000000000627a221 clang::ASTFrontendAction::ExecuteAction() /home/ivanmurashko/local/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1058:11 #40 0x0000000006bdcc31 clang::CodeGenAction::ExecuteAction() /home/ivanmurashko/local/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:1045:5 #41 0x0000000006279b4d clang::FrontendAction::Execute() /home/ivanmurashko/local/llvm-project/clang/lib/Frontend/FrontendAction.cpp:955:38 #42 0x00000000061c3fe9 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /home/ivanmurashko/local/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:974:42 #43 0x00000000063f9c5e clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /home/ivanmurashko/local/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:278:38 #44 0x0000000002603a03 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /home/ivanmurashko/local/llvm-project/clang/tools/driver/cc1_main.cpp:246:40 #45 0x00000000025f8a39 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) /home/ivanmurashko/local/llvm-project/clang/tools/driver/driver.cpp:338:20 #46 0x00000000025f9107 main /home/ivanmurashko/local/llvm-project/clang/tools/driver/driver.cpp:415:26 #47 0x00007ffff668f493 __libc_start_main (/lib64/libc.so.6+0x23493) #48 0x00000000025f729e _start (/data/users/ivanmurashko/llvm-project/build/bin/clang-13+0x25f729e) ``` Reviewed By: erichkeane Differential Revision: https://reviews.llvm.org/D106583
@simoll I'm starting to revisit this topic again, and I would have a couple of additional questions. The code generated using TVM and the current with tvm.transform.PassContext(opt_level=3, required_pass=["FastMath"]):
compiled_lib = relay.build(mod, "llvm -mtriple=ve-linux -mattr=+vpu -libs=cblas,vednn", params=params) https://github.com/saudet/tvm/blob/aurora/apps/howto_deploy/optimize_bert.py#L120 I've also made sure to set the RV_REPORT environment variable (to 1) and I get the following in the log so I'm guessing it's doing something, but what does it mean?
So, I've tried to hard code
Is this still expected to not function? Or am I doing something wrong? |
I've started looking at RV more closely, and if my understanding is correct, they are no plans to support LLVM vector instructions, but it should be able to autovectorize scalar LLVM IR instructions. Since TVM doesn't offer a way to annotate loops with OpenMP pragmas and what not, I had to hack RV's code, but I've been able to get something going by setting the RV_FORCE_WIDTH environment variable between 2 and 256, and by forcing it to accept given loops by setting this condition to true for specific For reference, the loops that are generated by TVM for that BERT model have the following names: "for_body", "for_body2", "for_body2.us.us", "for_body5", "for_begin1.preheader", "for_begin1.preheader.us", "for_begin4.preheader", "for_begin7.preheader", "for_begin7.preheader.1", and "for_begin7.preheader.2". Now, the strange thing is, RV is able to vectorize successfully all of these loops (except one as shown above), but even if only a single loop is vectorized (it doesn't matter which one), the resulting output at runtime becomes incorrect. Furthermore, for large values of RV_FORCE_WIDTH like 128 and 256, the program also crashes with "corrupted double-linked list", so memory is getting corrupted. For some reason, it sounds like vectorization causes the code to access incorrect areas of memory. Would someone have an idea as to where to look at next to debug this? |
Ah, I figured out what the problem was. RV seems to generate incorrect code when the LLVM flag @simoll Instead of silently generating incorrect code, RV should abort very loudly with a fatal error when Now, after making sure
Progress, at last! |
The
This is great! Can you provide an LLVM IR file for the loops that do not vectorize? It'd help if you provide reduced test cases that isolate the problem. |
When you say "LLVM", do you mean the portion largely maintained by @kaz7?
Let's see, I've continued debugging this today, and I can't get "for_body" or "for_body2" to generate correct code even when pretty much everything is disabled. The debug output, that is RV_REPORT=1 and LV_DIAG=1, looks like below, which I believe includes the input IR code from TVM:
Let me know if you need more than that, and how to obtain what you need, and I'll attach it! Other than that, I found that RV can vectorize "for_begin1.preheader", "for_begin1.preheader.us", "for_begin4.preheader", "for_begin7.preheader", "for_begin7.preheader.1", and "for_begin7.preheader.2" correctly when AVL is disabled, that is RV_DISABLE_AVL=1. When it's enabled, we get errors like this from LLVM at compile time:
Those loops mainly contain gather and scatter operations, so it seems there's something funny going on when doing gather and scatter with AVL enabled. Disabling AVL though makes everything run more slowly, so it's not something we want to disable. Furthermore, they are the main loops of the "fused_take_transpose_contrib_reverse_reshape_transpose_1" function, which is currently taking most of the execution time according to TVM's profiler, but they also appear before the main loops in other functions that still take a lot of time, mainly "fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape_1" and "fused_subtract_add_sqrt_divide_multiply_add_1", so I'll probably start debugging that myself and get them working first before anything else. |
Ok, here's one for you @simoll. Let me know if you'd like me to post bugs like that somewhere else, and I will repost there! I've continued to test stuff, and found that RV is able to vectorize the loops in "fused_take_transpose_contrib_reverse_reshape_transpose_1" without crashing LLVM, so that one in particular does not generate the crash above, but it does crash with SIGBUS at runtime. Next, TVM has a backend that can generate C code as well, so I've used it to extract that "fused_take_transpose_contrib_reverse_reshape_transpose_1" function since it's the one I'd be most interested in getting working at the moment. After cleaning up irrelevant/unused generated code, adding a main() function for testing it, and putting that in a file named, say fused_transpose_reshape.c, without modifying the actual loops, it looks like this: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int32_t fused_take_transpose_contrib_reverse_reshape_transpose(float *placeholder, float *T_transpose) {
#pragma omp simd
for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 768; ++ax0_ax1_fused) {
for (int32_t ax2_outer = 0; ax2_outer < 8; ++ax2_outer) {
for (int32_t ax2_inner = 0; ax2_inner < 16; ++ax2_inner) {
T_transpose[((((ax0_ax1_fused * 128) + (ax2_outer * 16)) + ax2_inner))] = placeholder[((((((ax2_outer * 36864) + (ax2_inner * 2304)) + ((ax0_ax1_fused >> 6) * 192)) + (ax0_ax1_fused & 63)) + 128))];
}
}
}
return 0;
}
int main() {
float *in = (float*)malloc(1200000);
float *out = (float*)malloc(400000);
fused_take_transpose_contrib_reverse_reshape_transpose(in, out);
printf("no crash \\(^^)/\n");
} Compiling and running that using NCC and Clang, with and without RV, looks like this: $ ncc -O3 fused_transpose_reshape.c; ./a.out
ncc: vec( 102): moo.c, line 7: Partially vectorized loop.
ncc: opt(1592): moo.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): moo.c, line 9: Vectorized loop.
no crash \(^^)/
$ clang --target=ve-linux -O3 fused_transpose_reshape.c; ./a.out
no crash \(^^)/
$ clang --target=ve-linux -O3 fused_transpose_reshape.c -fopenmp-simd; ./a.out
Segmentation fault So, there's something wrong with the generated code. Let me know if you'd like me to simplify the loops a bit more, and I'll do it. Thanks in advance for looking into this! |
Thanks! This example code has undefined behavior. The memory allocated by the |
Yeah, I've noticed Clang doing weird things when inlining functions, so I've tried just now to move the main() function into another compilation unit just to make sure, but the behavior is exactly the same: Crashes with RV, doesn't crash without RV. |
@simoll Since I'm not hearing back from you, I've continued debugging today, and this is the simplest function that I could come up that fails with RV (strided_loop.c): int strided_loop(int *x) {
#pragma omp simd
for (int i = 0; i < 256; ++i) {
x[i * 2] = i;
}
return 0;
} Along with a main() function in a separate compilation unit (strided_loop_main.c): #include <stdlib.h>
#include <stdio.h>
int main() {
int *x = (int*)malloc(256 * 2 * sizeof(int));
strided_loop(x);
printf("no crash \\(^^)/\n");
} Running that with NCC and Clang, without and with RV: $ ncc -O3 strided_loop.c strided_loop_main.c; ./a.out
ncc: vec( 101): strided_loop.c, line 3: Vectorized loop.
no crash \(^^)/
$ clang --target=ve-linux -O3 strided_loop.c strided_loop_main.c; ./a.out
no crash \(^^)/
$ clang --target=ve-linux -O3 strided_loop.c strided_loop_main.c -fopenmp-simd; ./a.out
Bus error So it looks like RV is unable to vectorize anything that doesn't access memory sequentially. When we only shift the address, like int strided_loop(int *x) {
#pragma omp simd
for (int i = 0; i < 512; i+=2) {
x[i] = i;
}
return 0;
} Is this a known limitation? In any case, as usual, if I don't hear back from you I'll keep debugging by myself, but a little bit of help would be greatly appreciated. /cc @efocht |
@simoll I've started peering over the assembly code generated by .text
.file "strided_loop.c"
.globl strided_loop # -- Begin function strided_loop
.p2align 4
.type strided_loop,@function
strided_loop: # @strided_loop
# %bb.0:
lea %s1, 256
lvl %s1
vseq %v0
or %s2, 2, (0)1
vmuls.l %v0, %s2, %v0
vmuls.l %v0, 4, %v0
vadds.l %v0, %s0, %v0
pvseq.lo %v1
or %s0, 0, (0)1
vscl %v0, %v1, 0, 0
b.l.t (, %s10)
.Lfunc_end0:
.size strided_loop, .Lfunc_end0-strided_loop
# -- End function
.ident "clang version 13.0.0 (https://github.com/sx-aurora-dev/llvm-project/ 629d3be80d3fcb33fb63719f5698014e307a303e)"
.section ".note.GNU-stack","",@progbits Notice that the arguments for VSCL are apparently inverted. If I manually switch them to make them correct, I think, that is to say |
Yes, that's a bug in the llvm-ve vector instruction selection. Thanks for following up on this and actually looking into the underlying issue! I've just checked against the ISA spec. I'll put a patch out on |
Reported by @saudet on github: #24 (comment)
@simoll Thank you for the fix! Now I'm trying to do the same thing with float arrays like this (strided_loop_float.c): int strided_loop(float *x) {
#pragma omp simd
for (int i = 0; i < 256; ++i) {
x[i * 2] = i;
}
return 0;
} And its main function (strided_loop_float_main.c): #include <stdlib.h>
#include <stdio.h>
int main() {
float *x = (float*)malloc(256 * 2 * sizeof(float));
strided_loop(x);
printf("no crash \\(^^)/\n");
} It kind of works, but only if we use NAS from NCC, instead of the assembler from Clang: $ clang --target=ve-linux -O3 strided_loop_float.c strided_loop_float_main.c -fopenmp-simd -S
$ ncc strided_loop_float.s strided_loop_float_main.s
$ ./a.out
no crash \(^^)/ However: $ clang --target=ve-linux -O3 strided_loop_float.c strided_loop_float_main.c -fopenmp-simd -c
$ ncc strided_loop_float.o strided_loop_float_main.o
$ ./a.out
Segmentation fault Any ideas what might be going on here?! I'll keep debugging, but I would very much welcome your insights! EDIT: Actually, it does the same thing with the original with int arrays as well, so this doesn't have anything to do with floats, apparently. Something wrong with LLVM's assembler, I guess? |
|
Ok, thanks @wmeddie. Here's what it looks like for the original strided_loop:
The only difference that I can see with strided_loop.s is |
I've located where that llvm-project/llvm/lib/Target/VE/VVPISelLowering.cpp Lines 416 to 418 in 29d1d34
The Scale that we get here is i64 = TargetConstant<4> , so I've tried to change that see what happens, and when we force it to CDAG.getConstant(4, MVT::i64) we get instead i64 = Constant<4> , and the final output of that morphs into a or %s2, 4, (0)1 followed by vmuls.l %v0, %s2, %v0 , which is correct, for both clang -S ... and clang -c ... , which is weird. That way though, strided_loop.c compiled, assembled, and linked with Clang runs successfully!
@simoll Does that sound like a good fix? I think it's weird that it would choke on a |
Well, when converting that TargetConstant to a Constant, almost everything else works just fine, so I've pushed that hack on my fork in commit saudet@71bb689. @simoll Please push a proper fix when you get the time! With that, I've been able to make some progress with TVM, as per the latest commit saudet/tvm@54c0d4d. As I mentioned above already, for that BERT model, it generates loops named:
With the latest fixes, RV can successfully vectorize and accelerate all those loops! RV can also vectorize the following (smaller) loops, but it slows down execution for most of them, so we need another way to differentiate between them:
Finally, some of the following loops crash LLVM at compile time as per the issue described at the bottom of #24 (comment):
To differentiate further the loops that can be accelerated from the ones that cannot, we can select all loops at the function level. Specifically, the loops in these functions become faster when vectorized:
While the ones in these functions run more slowly when vectorized:
And finally these ones crash LLVM, ostensibly inside their for_begin4.preheader and for_begin4.preheader.us loops:
These last two take less than 0.2% of the total execution time, so I'm happy to ignore them for now. :) When instructing RV to vectorize only the functions and loops that reduce execution time, it beats VML by quite a wide margin, for batch sizes of 1 and 32, respectively:
According to TVM's profiler, with batch sizes of 32, about ~60% of the time is spent in BLAS, but that means there is still ~40% spent in code generated by LLVM and RV, a figure that isn't exactly negligible... |
Given Aurora's hardware specs, compared to a CPU, it should be able to execute that BERT model 4~5 times faster, that is to say process each sequence of length 128 in ~5 ms per sequence. So I've experimented to find out if we could expect more from autovectorization, by assuming that we can make LLVM as good as NCC in that department. The heaviest functions, after the ones using GEMM from BLAS, are fused_take_transpose_contrib_reverse_reshape_transpose and fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape, so I've extracted those 2 functions from the C backend of TVM, cleaned them up a bit, and put the following in a file named test_functions.c: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int32_t fused_take_transpose_contrib_reverse_reshape_transpose(float *placeholder, float *T_transpose) {
#pragma omp simd
for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 4*768; ++ax0_ax1_fused) {
for (int32_t ax2_outer = 0; ax2_outer < 8; ++ax2_outer) {
for (int32_t ax2_inner = 0; ax2_inner < 16; ++ax2_inner) {
T_transpose[((((ax0_ax1_fused * 128) + (ax2_outer * 16)) + ax2_inner))] = placeholder[((((((ax2_outer * 36864) + (ax2_inner * 2304)) + ((ax0_ax1_fused >> 6) * 192)) + (ax0_ax1_fused & 63)) + 128))];
}
}
}
return 0;
}
int32_t fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape(float *placeholder, float *T_reshape) {
#pragma omp simd
for (int32_t ax0 = 0; ax0 < 128; ++ax0) {
for (int32_t ax1_outer = 0; ax1_outer < 192; ++ax1_outer) {
for (int32_t ax1_inner = 0; ax1_inner < 16; ++ax1_inner) {
float _1 = ((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 7.071068e-01f;
float _2 = (_1) < (4.000000e+00f) ? (_1) : (4.000000e+00f);
((float*)T_reshape)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] = ((((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 5.000000e-01f) * (1.000000e+00f + ((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -2.726142e-10f) + 2.770681e-08f)) + -2.101024e-06f)) + -5.692506e-05f)) + -7.349906e-04f)) + -2.954600e-03f)) + -1.609603e-02f)) / (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -1.456607e-05f) + -2.133740e-04f)) + -1.682827e-03f)) + -7.373329e-03f)) + -1.426474e-02f))));
}
}
}
return 0;
} Along with this main() function in a file named test_main.c: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
float *in = (float*)malloc(1600000);
float *out = (float*)malloc(1600000);
for (int i = 0; i < 10000; i++) {
FUNCTION(in, out);
}
printf("no crash \\(^^)/\n");
} And compiled and ran them using Clang with RV and NCC:
Surprisingly enough, it looks like RV is already better than NCC at vectorizing these loops! Unfortunately, I think that means we cannot expect to gain a lot more from autovectorization. To approach the speed enjoyed by CPUs and GPUs with TVM, we really need to use the high-level vector information that TVM gives us, via LLVM vector instructions. Right now, LLVM-VE is unable to process those vector instructions, and I am convinced it will prevent us from obtaining higher performance with TVM. A related problem is with NLC not providing a properly optimized implementation of GEMM. However, when we have an LLVM backend that supports vector instructions, TVM can generate highly optimized GEMM kernels by itself, so that is not a big issue: @efocht @simoll @mikishin @kaz7 Are there plans to add support for at least the LLVM vector instructions needed by TVM? |
Here's a run with
|
Just for completeness, I've also done the same for a batch size of 32, with test_functions_32.c: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int32_t fused_take_transpose_contrib_reverse_reshape_transpose(float *placeholder, float *T_transpose) {
#pragma omp simd
for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 24576; ++ax0_ax1_fused) {
for (int32_t ax2_outer = 0; ax2_outer < 8; ++ax2_outer) {
for (int32_t ax2_inner = 0; ax2_inner < 16; ++ax2_inner) {
((float*)T_transpose)[((((ax0_ax1_fused * 128) + (ax2_outer * 16)) + ax2_inner))] = ((float*)placeholder)[((((((ax2_outer * 1179648) + (ax2_inner * 73728)) + ((ax0_ax1_fused >> 6) * 192)) + (ax0_ax1_fused & 63)) + 128))];
}
}
}
return 0;
}
int32_t fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape(float *placeholder, float *T_reshape) {
#pragma omp simd
for (int32_t ax0 = 0; ax0 < 4096; ++ax0) {
for (int32_t ax1_outer = 0; ax1_outer < 192; ++ax1_outer) {
for (int32_t ax1_inner = 0; ax1_inner < 16; ++ax1_inner) {
float _1 = ((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 7.071068e-01f;
float _2 = (_1) < (4.000000e+00f) ? (_1) : (4.000000e+00f);
((float*)T_reshape)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] = ((((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 5.000000e-01f) * (1.000000e+00f + ((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -2.726142e-10f) + 2.770681e-08f)) + -2.101024e-06f)) + -5.692506e-05f)) + -7.349906e-04f)) + -2.954600e-03f)) + -1.609603e-02f)) / (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -1.456607e-05f) + -2.133740e-04f)) + -1.682827e-03f)) + -7.373329e-03f)) + -1.426474e-02f))));
}
}
}
return 0;
} And test_main_32.c: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
float *in = (float*)malloc(1600000*32);
float *out = (float*)malloc(1600000*32);
for (int i = 0; i < 1000; i++) {
FUNCTION(in, out);
}
printf("no crash \\(^^)/\n");
} Compiled and executed like this:
In this case, NCC performs even worse, so my conclusion is the same: I think I've done all we can from TVM's side. To increase performance further, I feel that we absolutely need better support for IR vector instructions in LLVM-VE. If anyone has a different opinion, please do let me know what you believe should be done next! |
Strided VLD/VST in LLVM-VE should give us a good speedup here. What you got here is with VSC/VGT only. |
Are you saying that LLVM-VE currently isn't generating any VLD/VST instructions for strided access? That sounds about right. RV is also generating VSC/VGT for this function, while NCC correctly generates VLD/VST: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int32_t fused_nn_batch_flatten(float *placeholder, float *tensor) {
#pragma omp simd
for (int32_t ax0 = 0; ax0 < 32; ++ax0) {
for (int32_t ax1_outer = 0; ax1_outer < 48; ++ax1_outer) {
for (int32_t ax1_inner = 0; ax1_inner < 16; ++ax1_inner) {
((float*)tensor)[((((ax0 * 768) + (ax1_outer * 16)) + ax1_inner))] = ((float*)placeholder)[((((ax0 * 768) + (ax1_outer * 16)) + ax1_inner))];
}
}
}
return 0;
} However, testing that by putting the above in fused_nn_batch_flatten.c and this in fused_nn_batch_flatten_main.c: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
float *in = (float*)malloc(100000);
float *out = (float*)malloc(100000);
for (int i = 0; i < 100000; i++) {
fused_nn_batch_flatten(in, out);
}
printf("no crash \\(^^)/\n");
} Still, NCC is only 2x faster:
So, I don't see how that's going to give us 4~5x overall speedup... That said, in the case of the x86 backend, using MKL so that's already a lot better than NLC, but still, on a single CPU socket from aurora05, I'm getting ~403 ms for a batch size of 32, and with |
The inner loops run over 16 elements, 32 bits wide each.. that's exactly 512bit.. and the memory accesses are mostly contiguous in the inner iteration variable.. I am not surprised AVX512 is doing well here with LLVM. |
It's a lot more than 512-bit wide. If you look at fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape() for example, the index used for all loads and stores is |
Thanks for the tip @simoll! I was able to increase the size of almost all inner loops to at least 256 by increasing the default split factor to 256, see commit saudet/tvm@0694ae3. Anything larger than that like 512 makes RV crash LLVM, regardless of the value of RV_FORCE_WIDTH, so I'm guessing that even if RV did support values larger than 256, it wouldn't be optimal. Is my assumption correct? Anyway, with a factor 256, the total execution time drops another notch to ~508 ms for a batch size of 32, and TVM now says that ~70% of that is spent in GEMM calls to NLC, so it's almost time to start thinking about getting rid of NLC... @mikishin BTW, are there any plans at NEC to accelerate GEMM calls in NLC? If optimized properly, we should be able to make it at least 4~5x faster than it currently is. |
I've started looking at what we could do for GEMM with TVM by following this tutorial using the backend for VE: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int32_t mmult_baseline(float *A, float *B, float *C) {
for (int32_t m = 0; m < 1024; m++) {
for (int32_t n = 0; n < 1024; n++) {
C[((m*1024) + n)] = 0;
for (int32_t k = 0; k < 1024; k++) {
C[((m*1024) + n)] = (C[((m*1024) + n)] + (A[((m*1024) + k)]*B[((k*1024) + n)]));
}
}
}
}
int32_t mmult_tiled(float *A, float *B, float *C) {
for (int32_t m_outer = 0; m_outer < 4; m_outer++) {
for (int32_t n_outer = 0; n_outer < 4; n_outer++) {
for (int32_t m_inner_init = 0; m_inner_init < 256; m_inner_init++) {
for (int32_t n_inner_init = 0; n_inner_init < 256; n_inner_init++) {
C[((((m_outer*262144) + (m_inner_init*1024)) + (n_outer*256)) + n_inner_init)] = 0;
}
}
for (int32_t k_outer = 0; k_outer < 256; k_outer++) {
for (int32_t k_inner = 0; k_inner < 4; k_inner++) {
for (int32_t m_inner = 0; m_inner < 256; m_inner++) {
for (int32_t n_inner = 0; n_inner < 256; n_inner++) {
C[((((m_outer*262144) + (m_inner*1024)) + (n_outer*256)) + n_inner)] = (C[((((m_outer*262144) + (m_inner*1024)) + (n_outer*256)) + n_inner)] + (A[((((m_outer*262144) + (m_inner*1024)) + (k_outer*4)) + k_inner)]*B[((((k_outer*4096) + (k_inner*1024)) + (n_outer*256)) + n_inner)]));
}
}
}
}
}
}
return 0;
} Along with this main() function in mmult_main.c: #include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
float *A = (float*)malloc(1024*1024*4);
float *B = (float*)malloc(1024*1024*4);
float *C = (float*)malloc(1024*1024*4);
for (int i = 0; i < 100; i++) {
MMULT(A, B, C);
}
printf("no crash \\(^^)/\n");
} And compiled and executed that with GCC as well as NCC as follows:
We can see from the above that NCC is also suffering from the same problem! I can only conclude that it is not possible to get compilers for VE to produce efficient code for tiled algorithms like this one, but I am probably just missing something. How can we achieve this without resorting to intrinsics, like it's possible with both GCC or LLVM and its x86 backend? |
Setting the vector width > 256 shouldn't crash RV nor LLVM. If anything, you should be getting packed mode instructions. Not expecting too much benefit for packed mode here though because there is no strided packed-mode vector load in the VE ISA. I do expect further improvements for LLVM-VE-RV with vector width == 256 and once we support strided VLD. This should become available on the |
Ok, so I guess there's some bug in that packed mode. It's crashing in the same in LegalizeVectorTypes.cpp that I describe above #24 (comment), but this only happens when AVL is enabled.
Ok, sounds good, although right now I'm worried about GEMM. According to engineers at NEC, there is absolutely no way to make GEMM any faster than what is already implemented in NLC. The only other thing I can think of to try at the moment is running TVM itself sequentially, thus giving more rope to NLC (and RV) for vectorizing purposes, but launch 8 independent threads like that. I measured an increase in overall throughput of about 1.5x doing it like that, but that's about it. Unless someone can point me at something else to look at, it sounds like we're pretty close to the limit of what the hardware is capable of. The conclusion seems to be that we can achieve higher throughput with cheaper Xeon CPUs that consume less power than Aurora! |
Important correction, the figure of ~403 ms mentioned above is actually for 2 sockets of Xeon CPUs (~1 TFLOP each). I hadn't set the OpenMP thread affinity properly. When correctly limited to a single CPU, it's around ~722 ms. So, in other words, a throughput of ~44 sequences per second. In the case of a single VE, we can get about 1.5 * 32 / 0.508 = 94 sequences per second, so we can say that VE is ~2x faster than CPU... |
So, from what I understood after a lengthy exchange with engineers at NEC, to obtain maximum performance on VE, data needs to be 128-byte aligned, but not 512-byte aligned, or we get bank conflicts. The details are still a bit fuzzy to me (it is very hard to get any information out of NEC), but needless to say, those are very unusable alignment requirements. On any other architecture that I am aware of that needs something like 128-byte alignment, 512-byte alignment meets that criterion, so we're good to go. Therefore, support for a kind of setting to avoid specific alignments isn't currently anywhere to be found in TVM. With enough resources, it should be possible to add support for this, and according to microbenchmarking we've been doing, that should give us an overall ~1.5x boost in performance. And then I'm sure we'll capture the remaining "~0.5x" from future improvements in RV, getting us pretty close to 4 TFLOPS. However, it looks like this is going to require a sizeable investment to get us there. @mikishin |
Found by msan -fsanitize-memory-use-after-dtor. ==8259==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55dbec54d2b8 in dtorRecord(clang::interp::Block*, char*, clang::interp::Descriptor*) clang/lib/AST/Interp/Descriptor.cpp:150:22 #1 0x55dbec54bfcf in dtorArrayDesc(clang::interp::Block*, char*, clang::interp::Descriptor*) clang/lib/AST/Interp/Descriptor.cpp:97:7 #2 0x55dbec508578 in invokeDtor clang/lib/AST/Interp/InterpBlock.h:79:7 #3 0x55dbec508578 in clang::interp::Program::~Program() clang/lib/AST/Interp/Program.h:55:19 #4 0x55dbec50657a in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5 #5 0x55dbec50657a in std::__msan::unique_ptr<clang::interp::Program, std::__msan::default_delete<clang::interp::Program>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7 #6 0x55dbec5035a1 in clang::interp::Context::~Context() clang/lib/AST/Interp/Context.cpp:27:22 #7 0x55dbebec1daa in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5 #8 0x55dbebec1daa in std::__msan::unique_ptr<clang::interp::Context, std::__msan::default_delete<clang::interp::Context>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7 #9 0x55dbebe285f9 in clang::ASTContext::~ASTContext() clang/lib/AST/ASTContext.cpp:1038:40 #10 0x55dbe941ff13 in llvm::RefCountedBase<clang::ASTContext>::Release() const llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:101:7 #11 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:159:38 #12 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:224:7 #13 0x55dbe94353ef in ~IntrusiveRefCntPtr llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:191:27 #14 0x55dbe94353ef in clang::CompilerInstance::setASTContext(clang::ASTContext*) clang/lib/Frontend/CompilerInstance.cpp:178:3 #15 0x55dbe95ad0ad in clang::FrontendAction::EndSourceFile() clang/lib/Frontend/FrontendAction.cpp:1100:8 #16 0x55dbe9445fcf in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) clang/lib/Frontend/CompilerInstance.cpp:1047:11 #17 0x55dbe6b3afef in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:266:25 #18 0x55dbe6b13288 in cc1_main(llvm::ArrayRef<char const*>, char const*, void*) clang/tools/driver/cc1_main.cpp:250:15 #19 0x55dbe6b0095f in ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) clang/tools/driver/driver.cpp:319:12 #20 0x55dbe6aff41c in clang_main(int, char**) clang/tools/driver/driver.cpp:395:12 #21 0x7f9be07fa632 in __libc_start_main #22 0x55dbe6a702e9 in _start Member fields were destroyed #0 0x55dbe6a7da5d in __sanitizer_dtor_callback_fields compiler-rt/lib/msan/msan_interceptors.cpp:949:5 #1 0x55dbec5094ac in ~SmallVectorImpl llvm/include/llvm/ADT/SmallVector.h:479:7 #2 0x55dbec5094ac in ~SmallVectorImpl llvm/include/llvm/ADT/SmallVector.h:612:3 #3 0x55dbec5094ac in llvm::SmallVector<clang::interp::Record::Base, 8u>::~SmallVector() llvm/include/llvm/ADT/SmallVector.h:1207:3 #4 0x55dbec508e79 in clang::interp::Record::~Record() clang/lib/AST/Interp/Record.h:24:7 #5 0x55dbec508612 in clang::interp::Program::~Program() clang/lib/AST/Interp/Program.h:49:26 #6 0x55dbec50657a in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5 #7 0x55dbec50657a in std::__msan::unique_ptr<clang::interp::Program, std::__msan::default_delete<clang::interp::Program>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7 #8 0x55dbec5035a1 in clang::interp::Context::~Context() clang/lib/AST/Interp/Context.cpp:27:22 #9 0x55dbebec1daa in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5 #10 0x55dbebec1daa in std::__msan::unique_ptr<clang::interp::Context, std::__msan::default_delete<clang::interp::Context>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7 #11 0x55dbebe285f9 in clang::ASTContext::~ASTContext() clang/lib/AST/ASTContext.cpp:1038:40 #12 0x55dbe941ff13 in llvm::RefCountedBase<clang::ASTContext>::Release() const llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:101:7 #13 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:159:38 #14 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:224:7 #15 0x55dbe94353ef in ~IntrusiveRefCntPtr llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:191:27 #16 0x55dbe94353ef in clang::CompilerInstance::setASTContext(clang::ASTContext*) clang/lib/Frontend/CompilerInstance.cpp:178:3 #17 0x55dbe95ad0ad in clang::FrontendAction::EndSourceFile() clang/lib/Frontend/FrontendAction.cpp:1100:8 #18 0x55dbe9445fcf in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) clang/lib/Frontend/CompilerInstance.cpp:1047:11 #19 0x55dbe6b3afef in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:266:25 #20 0x55dbe6b13288 in cc1_main(llvm::ArrayRef<char const*>, char const*, void*) clang/tools/driver/cc1_main.cpp:250:15 #21 0x55dbe6b0095f in ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) clang/tools/driver/driver.cpp:319:12 #22 0x55dbe6aff41c in clang_main(int, char**) clang/tools/driver/driver.cpp:395:12 #23 0x7f9be07fa632 in __libc_start_main #24 0x55dbe6a702e9 in _start
…(#80904)" This reverts commit b1ac052. This commit breaks coroutine splitting for non-swift calling convention functions. In this example: ```ll ; ModuleID = 'repro.ll' source_filename = "stdlib/test/runtime/test_llcl.mojo" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @0 = internal constant { i32, i32 } { i32 trunc (i64 sub (i64 ptrtoint (ptr @craSH to i64), i64 ptrtoint (ptr getelementptr inbounds ({ i32, i32 }, ptr @0, i32 0, i32 1) to i64)) to i32), i32 64 } define dso_local void @af_suspend_fn(ptr %0, i64 %1, ptr %2) #0 { ret void } define dso_local void @craSH(ptr %0) #0 { %2 = call token @llvm.coro.id.async(i32 64, i32 8, i32 0, ptr @0) %3 = call ptr @llvm.coro.begin(token %2, ptr null) %4 = getelementptr inbounds { ptr, { ptr, ptr }, i64, { ptr, i1 }, i64, i64 }, ptr poison, i32 0, i32 0 %5 = call ptr @llvm.coro.async.resume() store ptr %5, ptr %4, align 8 %6 = call { ptr, ptr, ptr } (i32, ptr, ptr, ...) @llvm.coro.suspend.async.sl_p0p0p0s(i32 0, ptr %5, ptr @ctxt_proj_fn, ptr @af_suspend_fn, ptr poison, i64 -1, ptr poison) ret void } define dso_local ptr @ctxt_proj_fn(ptr %0) #0 { ret ptr %0 } ; Function Attrs: nomerge nounwind declare { ptr, ptr, ptr } @llvm.coro.suspend.async.sl_p0p0p0s(i32, ptr, ptr, ...) #1 ; Function Attrs: nounwind declare token @llvm.coro.id.async(i32, i32, i32, ptr) #2 ; Function Attrs: nounwind declare ptr @llvm.coro.begin(token, ptr writeonly) #2 ; Function Attrs: nomerge nounwind declare ptr @llvm.coro.async.resume() #1 attributes #0 = { "target-features"="+adx,+aes,+avx,+avx2,+bmi,+bmi2,+clflushopt,+clwb,+clzero,+crc32,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+mwaitx,+pclmul,+pku,+popcnt,+prfchw,+rdpid,+rdpru,+rdrnd,+rdseed,+sahf,+sha,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+sse4a,+ssse3,+vaes,+vpclmulqdq,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" } attributes #1 = { nomerge nounwind } attributes #2 = { nounwind } ``` This verifier crashes after the `coro-split` pass with ``` cannot guarantee tail call due to mismatched parameter counts musttail call void @af_suspend_fn(ptr poison, i64 -1, ptr poison) LLVM ERROR: Broken function PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: opt ../../../reduced.ll -O0 #0 0x00007f1d89645c0e __interceptor_backtrace.part.0 /build/gcc-11-XeT9lY/gcc-11-11.4.0/build/x86_64-linux-gnu/libsanitizer/asan/../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4193:28 #1 0x0000556d94d254f7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:22 #2 0x0000556d94d19a2f llvm::sys::RunSignalHandlers() /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/Signals.cpp:105:20 #3 0x0000556d94d1aa42 SignalHandler(int) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/Unix/Signals.inc:371:36 #4 0x00007f1d88e42520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520) #5 0x00007f1d88e969fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76 #6 0x00007f1d88e969fc __pthread_kill_internal ./nptl/pthread_kill.c:78:10 #7 0x00007f1d88e969fc pthread_kill ./nptl/pthread_kill.c:89:10 #8 0x00007f1d88e42476 gsignal ./signal/../sysdeps/posix/raise.c:27:6 #9 0x00007f1d88e287f3 abort ./stdlib/abort.c:81:7 #10 0x0000556d8944be01 std::vector<llvm::json::Value, std::allocator<llvm::json::Value>>::size() const /usr/include/c++/11/bits/stl_vector.h:919:40 #11 0x0000556d8944be01 bool std::operator==<llvm::json::Value, std::allocator<llvm::json::Value>>(std::vector<llvm::json::Value, std::allocator<llvm::json::Value>> const&, std::vector<llvm::json::Value, std::allocator<llvm::json::Value>> const&) /usr/include/c++/11/bits/stl_vector.h:1893:23 #12 0x0000556d8944be01 llvm::json::operator==(llvm::json::Array const&, llvm::json::Array const&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/Support/JSON.h:572:69 #13 0x0000556d8944be01 llvm::json::operator==(llvm::json::Value const&, llvm::json::Value const&) (.cold) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/JSON.cpp:204:28 #14 0x0000556d949ed2bd llvm::report_fatal_error(char const*, bool) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/ErrorHandling.cpp:82:70 #15 0x0000556d8e37e876 llvm::SmallVectorBase<unsigned int>::size() const /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:91:32 #16 0x0000556d8e37e876 llvm::SmallVectorTemplateCommon<llvm::DiagnosticInfoOptimizationBase::Argument, void>::end() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:282:41 #17 0x0000556d8e37e876 llvm::SmallVector<llvm::DiagnosticInfoOptimizationBase::Argument, 4u>::~SmallVector() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:1215:24 #18 0x0000556d8e37e876 llvm::DiagnosticInfoOptimizationBase::~DiagnosticInfoOptimizationBase() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/DiagnosticInfo.h:413:7 #19 0x0000556d8e37e876 llvm::DiagnosticInfoIROptimization::~DiagnosticInfoIROptimization() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/DiagnosticInfo.h:622:7 #20 0x0000556d8e37e876 llvm::OptimizationRemark::~OptimizationRemark() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/DiagnosticInfo.h:689:7 #21 0x0000556d8e37e876 operator() /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Transforms/Coroutines/CoroSplit.cpp:2213:14 #22 0x0000556d8e37e876 emit<llvm::CoroSplitPass::run(llvm::LazyCallGraph::SCC&, llvm::CGSCCAnalysisManager&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&)::<lambda()> > /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/Analysis/OptimizationRemarkEmitter.h:83:12 #23 0x0000556d8e37e876 llvm::CoroSplitPass::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Transforms/Coroutines/CoroSplit.cpp:2212:13 #24 0x0000556d8c36ecb1 llvm::detail::PassModel<llvm::LazyCallGraph::SCC, llvm::CoroSplitPass, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3 #25 0x0000556d91c1a84f llvm::PassManager<llvm::LazyCallGraph::SCC, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Analysis/CGSCCPassManager.cpp:90:12 #26 0x0000556d8c3690d1 llvm::detail::PassModel<llvm::LazyCallGraph::SCC, llvm::PassManager<llvm::LazyCallGraph::SCC, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3 #27 0x0000556d91c2162d llvm::ModuleToPostOrderCGSCCPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Analysis/CGSCCPassManager.cpp:278:18 #28 0x0000556d8c369035 llvm::detail::PassModel<llvm::Module, llvm::ModuleToPostOrderCGSCCPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3 #29 0x0000556d9457abc5 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManager.h:247:20 #30 0x0000556d8e30979e llvm::CoroConditionalWrapper::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Transforms/Coroutines/CoroConditionalWrapper.cpp:19:74 #31 0x0000556d8c365755 llvm::detail::PassModel<llvm::Module, llvm::CoroConditionalWrapper, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3 #32 0x0000556d9457abc5 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManager.h:247:20 #33 0x0000556d89818556 llvm::SmallPtrSetImplBase::isSmall() const /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:196:33 #34 0x0000556d89818556 llvm::SmallPtrSetImplBase::~SmallPtrSetImplBase() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:84:17 #35 0x0000556d89818556 llvm::SmallPtrSetImpl<llvm::AnalysisKey*>::~SmallPtrSetImpl() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:321:7 #36 0x0000556d89818556 llvm::SmallPtrSet<llvm::AnalysisKey*, 2u>::~SmallPtrSet() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:427:7 #37 0x0000556d89818556 llvm::PreservedAnalyses::~PreservedAnalyses() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/Analysis.h:109:7 #38 0x0000556d89818556 llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::PassPlugin>, llvm::ArrayRef<std::function<void (llvm::PassBuilder&)>>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool, bool, bool) /home/ubuntu/modular/third-party/llvm-project/llvm/tools/opt/NewPMDriver.cpp:532:10 #39 0x0000556d897e3939 optMain /home/ubuntu/modular/third-party/llvm-project/llvm/tools/opt/optdriver.cpp:737:27 #40 0x0000556d89455461 main /home/ubuntu/modular/third-party/llvm-project/llvm/tools/opt/opt.cpp:25:33 #41 0x00007f1d88e29d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #42 0x00007f1d88e29e40 call_init ./csu/../csu/libc-start.c:128:20 #43 0x00007f1d88e29e40 __libc_start_main ./csu/../csu/libc-start.c:379:5 #44 0x0000556d897b6335 _start (/home/ubuntu/modular/.derived/third-party/llvm-project/build-relwithdebinfo-asan/bin/opt+0x150c335) Aborted (core dumped)
TestCases/Misc/Linux/sigaction.cpp fails because dlsym() may call malloc on failure. And then the wrapped malloc appears to access thread local storage using global dynamic accesses, thus calling ___interceptor___tls_get_addr, before REAL(__tls_get_addr) has been set, so we get a crash inside ___interceptor___tls_get_addr. For example, this can happen when looking up __isoc23_scanf which might not exist in some libcs. Fix this by marking the thread local variable accessed inside the debug checks as "initial-exec", which does not require __tls_get_addr. This is probably a better alternative to llvm/llvm-project#83886. This fixes a different crash but is related to llvm/llvm-project#46204. Backtrace: ``` #0 0x0000000000000000 in ?? () #1 0x00007ffff6a9d89e in ___interceptor___tls_get_addr (arg=0x7ffff6b27be8) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:2759 #2 0x00007ffff6a46bc6 in __sanitizer::CheckedMutex::LockImpl (this=0x7ffff6b27be8, pc=140737331846066) at /path/to/llvm/compiler-rt/lib/sanitizer_common/sanitizer_mutex.cpp:218 #3 0x00007ffff6a448b2 in __sanitizer::CheckedMutex::Lock (this=0x7ffff6b27be8, this@entry=0x730000000580) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_mutex.h:129 #4 __sanitizer::Mutex::Lock (this=0x7ffff6b27be8, this@entry=0x730000000580) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_mutex.h:167 #5 0x00007ffff6abdbb2 in __sanitizer::GenericScopedLock<__sanitizer::Mutex>::GenericScopedLock (mu=0x730000000580, this=<optimized out>) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_mutex.h:383 #6 __sanitizer::SizeClassAllocator64<__tsan::AP64>::GetFromAllocator (this=0x7ffff7487dc0 <__tsan::allocator_placeholder>, stat=stat@entry=0x7ffff570db68, class_id=11, chunks=chunks@entry=0x7ffff5702cc8, n_chunks=n_chunks@entry=128) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_primary64.h:207 #7 0x00007ffff6abdaa0 in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__tsan::AP64> >::Refill (this=<optimized out>, c=c@entry=0x7ffff5702cb8, allocator=<optimized out>, class_id=<optimized out>) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_local_cache.h:103 #8 0x00007ffff6abd731 in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__tsan::AP64> >::Allocate (this=0x7ffff6b27be8, allocator=0x7ffff5702cc8, class_id=140737311157448) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_local_cache.h:39 #9 0x00007ffff6abc397 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__tsan::AP64>, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate (this=0x7ffff5702cc8, cache=0x7ffff6b27be8, size=<optimized out>, size@entry=175, alignment=alignment@entry=16) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_combined.h:69 #10 0x00007ffff6abaa6a in __tsan::user_alloc_internal (thr=0x7ffff7ebd980, pc=140737331499943, sz=sz@entry=175, align=align@entry=16, signal=true) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_mman.cpp:198 #11 0x00007ffff6abb0d1 in __tsan::user_alloc (thr=0x7ffff6b27be8, pc=140737331846066, sz=11, sz@entry=175) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_mman.cpp:223 #12 0x00007ffff6a693b5 in ___interceptor_malloc (size=175) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:666 #13 0x00007ffff7fce7f2 in malloc (size=175) at ../include/rtld-malloc.h:56 #14 __GI__dl_exception_create_format (exception=exception@entry=0x7fffffffd0d0, objname=0x7ffff7fc3550 "/path/to/llvm/compiler-rt/cmake-build-all-sanitizers/lib/linux/libclang_rt.tsan-x86_64.so", fmt=fmt@entry=0x7ffff7ff2db9 "undefined symbol: %s%s%s") at ./elf/dl-exception.c:157 #15 0x00007ffff7fd50e8 in _dl_lookup_symbol_x (undef_name=0x7ffff6af868b "__isoc23_scanf", undef_map=<optimized out>, ref=0x7fffffffd148, symbol_scope=<optimized out>, version=<optimized out>, type_class=0, flags=2, skip_map=0x7ffff7fc35e0) at ./elf/dl-lookup.c:793 --Type <RET> for more, q to quit, c to continue without paging-- #16 0x00007ffff656d6ed in do_sym (handle=<optimized out>, name=0x7ffff6af868b "__isoc23_scanf", who=0x7ffff6a3bb84 <__interception::InterceptFunction(char const*, unsigned long*, unsigned long, unsigned long)+36>, vers=vers@entry=0x0, flags=flags@entry=2) at ./elf/dl-sym.c:146 #17 0x00007ffff656d9dd in _dl_sym (handle=<optimized out>, name=<optimized out>, who=<optimized out>) at ./elf/dl-sym.c:195 #18 0x00007ffff64a2854 in dlsym_doit (a=a@entry=0x7fffffffd3b0) at ./dlfcn/dlsym.c:40 #19 0x00007ffff7fcc489 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd310, operate=0x7ffff64a2840 <dlsym_doit>, args=0x7fffffffd3b0) at ./elf/dl-catch.c:237 #20 0x00007ffff7fcc5af in _dl_catch_error (objname=0x7fffffffd368, errstring=0x7fffffffd370, mallocedp=0x7fffffffd367, operate=<optimized out>, args=<optimized out>) at ./elf/dl-catch.c:256 #21 0x00007ffff64a2257 in _dlerror_run (operate=operate@entry=0x7ffff64a2840 <dlsym_doit>, args=args@entry=0x7fffffffd3b0) at ./dlfcn/dlerror.c:138 #22 0x00007ffff64a28e5 in dlsym_implementation (dl_caller=<optimized out>, name=<optimized out>, handle=<optimized out>) at ./dlfcn/dlsym.c:54 #23 ___dlsym (handle=<optimized out>, name=<optimized out>) at ./dlfcn/dlsym.c:68 #24 0x00007ffff6a3bb84 in __interception::GetFuncAddr (name=0x7ffff6af868b "__isoc23_scanf", trampoline=140737311157448) at /path/to/llvm/compiler-rt/lib/interception/interception_linux.cpp:42 #25 __interception::InterceptFunction (name=0x7ffff6af868b "__isoc23_scanf", ptr_to_real=0x7ffff74850e8 <__interception::real___isoc23_scanf>, func=11, trampoline=140737311157448) at /path/to/llvm/compiler-rt/lib/interception/interception_linux.cpp:61 #26 0x00007ffff6a9f2d9 in InitializeCommonInterceptors () at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:10315 ``` Reviewed By: vitalybuka, MaskRay Pull Request: llvm/llvm-project#83890
I would like to use TVM to compile models like BERT for VE, but I'm encountering errors from LLVM like these:
What does that mean? Where should I start to debug this?
I build https://github.com/sx-aurora-dev/llvm-project this way on CentOS 7:
And https://github.com/apache/incubator-tvm this way:
For the default BERT model described here on this blog post:
Using this script:
But replacing everything that comes after
target = "llvm -mcpu=skylake-avx512 -libs=cblas"
with:Everything works fine with the x86 target of LLVM, on the same machine with the same binaries, so it's something specific to the VE target. Any help would be greatly appreciated! Thanks in advance
The text was updated successfully, but these errors were encountered: