Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting TVM working with VE #24

Open
saudet opened this issue Nov 12, 2020 · 49 comments
Open

Getting TVM working with VE #24

saudet opened this issue Nov 12, 2020 · 49 comments

Comments

@saudet
Copy link

saudet commented Nov 12, 2020

I would like to use TVM to compile models like BERT for VE, but I'm encountering errors from LLVM like these:

LLVM ERROR: Cannot select: 0x10f7dfc8: v16f32 = fadd 0xd33cfa8, 0x99008b8
  0xd33cfa8: v16f32,ch = load<(load 64 from %ir.uglygep1718, !tbaa !247)> 0x2a31c68, 0x99002a0, undef:i64
    0x99002a0: i64 = add 0x10f7e308, 0x9901008
      0x10f7e308: i64,ch = CopyFromReg 0x2a31c68, Register:i64 %12
        0x9900b90: i64 = Register %12
      0x9901008: i64,ch = CopyFromReg 0x2a31c68, Register:i64 %17
        0xd33d280: i64 = Register %17
    0x9900ac0: i64 = undef
  0x99008b8: v16f32 = fadd 0x99001d0, 0x10f7e988
    0x99001d0: v16f32,ch = load<(load 64 from %ir.uglygep56, !tbaa !250)> 0x2a31c68, 0x10f7dc88, undef:i64
      0x10f7dc88: i64 = add 0x9900578, 0x9901008
        0x9900578: i64,ch = CopyFromReg 0x2a31c68, Register:i64 %16
          0x14856630: i64 = Register %16
        0x9901008: i64,ch = CopyFromReg 0x2a31c68, Register:i64 %17
          0xd33d280: i64 = Register %17
      0x9900ac0: i64 = undef
    0x10f7e988: v16f32,ch = load<(load 64 from %ir.uglygep2, !tbaa !253)> 0x2a31c68, 0x10f7dc20, undef:i64
      0x10f7dc20: i64 = add 0x9900648, 0x9901008
        0x9900648: i64,ch = CopyFromReg 0x2a31c68, Register:i64 %15
          0x9900e68: i64 = Register %15
        0x9901008: i64,ch = CopyFromReg 0x2a31c68, Register:i64 %17
          0xd33d280: i64 = Register %17
      0x9900ac0: i64 = undef
In function: __tvm_parallel_lambda
Aborted

What does that mean? Where should I start to debug this?

I build https://github.com/sx-aurora-dev/llvm-project this way on CentOS 7:

cmake ../llvm
make

And https://github.com/apache/incubator-tvm this way:

cmake -DUSE_LLVM=/path/to/llvm-project/build/bin/llvm-config ..

For the default BERT model described here on this blog post:

Using this script:

with tvm.transform.PassContext(opt_level=3, required_pass=["FastMath"]):
    compiled_lib = relay.build(mod, "llvm -mtriple=ve-linux", params=params)
compiled_lib.export_library("libbertve.so")

Everything works fine with the x86 target of LLVM, on the same machine with the same binaries, so it's something specific to the VE target. Any help would be greatly appreciated! Thanks in advance

@saudet
Copy link
Author

saudet commented Nov 18, 2020

/cc @mikishin @efocht

@mikishin
Copy link

Dear Samuel-san,
I am trying to reach the LLVM developer. If I have any update I will let you know ASAP.

@kaz7
Copy link
Collaborator

kaz7 commented Nov 18, 2020

Thank you for letting us know the problem. The develop branch supports only limited vector instructions at the moment, and it causes your problem. @simoll, do you have any information?

@saudet
Copy link
Author

saudet commented Nov 18, 2020

@kaz7 Interesting, it does seem to go through when disabling vectorized operations this way:

with tvm.transform.PassContext(opt_level=3, config={"tir.disable_vectorize": True}, required_pass=["FastMath"]):
    compiled_lib = relay.build(mod, "llvm -mtriple=ve-linux", params=params)
compiled_lib.export_library("libbertve.so")

Thanks for the input!

@saudet
Copy link
Author

saudet commented Nov 18, 2020

Now I'm getting a Segmentation fault when trying to run the generated code though :(
@kaz7 Any recommendations about how to debug this?

@kaz7
Copy link
Collaborator

kaz7 commented Nov 18, 2020

This is remedy if you don't need vectorization nor vector intrinsic instructions. You can disable vectorizations completely by changing VESubtarget.cpp like below.

diff --git a/llvm/lib/Target/VE/VESubtarget.cpp b/llvm/lib/Target/VE/VESubtarget.cpp
index 23a338a..ab0b730 100644
--- a/llvm/lib/Target/VE/VESubtarget.cpp
+++ b/llvm/lib/Target/VE/VESubtarget.cpp
@@ -28,7 +28,7 @@ void VESubtarget::anchor() {}
 VESubtarget &VESubtarget::initializeSubtargetDependencies(StringRef CPU,
                                                           StringRef FS) {
   // Default feature settings
-  EnableVPU = true;
+  EnableVPU = false;

   // Determine default and user specified characteristics
   std::string CPUName = std::string(CPU);

@saudet
Copy link
Author

saudet commented Nov 18, 2020

@kaz7 Thanks for the information. I'm getting Segmentation fault at runtime even with VPU disabled though.

@kaz7
Copy link
Collaborator

kaz7 commented Nov 18, 2020

Sorry for hear that. :-(

@saudet
Copy link
Author

saudet commented Nov 18, 2020

Please let me know how I can help debug this, for example, is there a way to enable an info dump of some kind?

@simoll
Copy link
Contributor

simoll commented Nov 18, 2020

Hi @saudet ,

Could you try the hpce/develop branch? That one has vectorization and vector instruction selection - you will get the same SegFault as in the develop branch at runtime but i'd like to hear of any compiler crashes with that branch.
When the SegFault issue is fixed in develop we can merge that into hpce/develop and talk about vector code :)

@saudet
Copy link
Author

saudet commented Nov 19, 2020

@simoll Thanks for chipping in! I tried the hpce/develop branch, but it segfaults when importing the model, so something before anything related to the VE target (or the x86 one for that matter) is broken. Would it be possible to update the branch with new commits from develop?

@saudet
Copy link
Author

saudet commented Dec 25, 2020

Ok, I got something working. The cause of the Segmentation fault at runtime was with NCC, apparently problems with thread_local that were fixed with 3.1.0. The next issue that I faced was an undefined symbol: __ve_grow_stack, which I was able to work around by compiling and linking manually with grow_stack.S via NCC. (Is this intended? Or is this a codegen bug?) Now, I'm able to run the BERT model correctly, on all 8 threads, but it's running about 100x slower than expected. This is without vectorization, which I'm assuming would speed that up by a factor of maybe 10x, so what could be the cause for the remaining 10x? In any case, I plan to keep investigating this, but please let me know if you think of something! Thanks

@saudet
Copy link
Author

saudet commented Dec 26, 2020

Ah, here's the missing piece: BLAS. I was able to link it with blas_openmp from NLC 2.1.0, and that made it over 10x faster. It's now down to about 400 ms for a batch of 1 sequence of length 128 (and about 7000 ms for a batch of 32), which is still about 10x slower than what we'd expect, but I'm assuming that vectorized operations are the final piece of the puzzle...

@jam7
Copy link
Contributor

jam7 commented Dec 26, 2020

I'm glad to hear that problems is fixed. Regarding to __ve_grow_stack, it is used in generated code by llvm for VE. If you use clang to link a target program, the particular library is automatically linked. If you use ncc, ncc doesn't know it. Please link $LLVM_INSTALL_PATH/lib/clang/12.0.0/lib/linux/libclang_rt.builtins-ve.a manually. We have plan to inline those code in the future to avoid this problem, but it is not finished atm.

@saudet
Copy link
Author

saudet commented Dec 28, 2020

I tried with the hpce/develop branch again (currently at llvm-ve-rv-1.8.0), and I'm not getting a Segmentation fault anymore when importing/exporting the model with TVM, so that's been fixed! It's also happily generating something for TVM's vectorized operations too! 👍 However, I am getting a Segmentation fault at runtime now, even when TVM's vectorizer is disabled.
@simoll Would you have some recommendations about what I should be looking at first?

@saudet
Copy link
Author

saudet commented Jan 6, 2021

Some more data points when using the hpce/develop branch:

  1. Small graphs like the one from https://github.com/apache/tvm/tree/main/apps/howto_deploy work fine
  2. With the BERT model, there is no Segmentation fault when EnableVPU = false, but if we enable TVM's vectorizer, it runs 2x times more slowly, and the output is incorrect

@simoll
Copy link
Contributor

simoll commented Jan 7, 2021

Some more data points when using the hpce/develop branch:

1. Small graphs like the one from https://github.com/apache/tvm/tree/main/apps/howto_deploy work fine

2. With the BERT model, there is no `Segmentation fault` when `EnableVPU = false`, but if we enable TVM's vectorizer, it runs 2x times more slowly, and the output is incorrect

We are currently merging the lastest develop into hpce/develop. I'll look into your issue when that is done. Can you provide an LLVM IR bitcode file of the failing test? That'd help pin the bug down.

@saudet
Copy link
Author

saudet commented Jan 8, 2021

I'm attaching what I get from compiled_lib.lib.get_source() for the BERT model above on llvm-ve-rv-1.8.0 with EnableVPU = true (segfaults) and EnableVPU = false (works fine):

In either case, TVM's vectorizer is disabled as per above. If you spot anything off in that, please let me know! Thanks

@saudet
Copy link
Author

saudet commented Feb 17, 2021

@simoll I've tried again with VPU using the latest code from hpce/develop, and although in general it still crashes and produces incorrect result, it actually appears to be working correctly with the specific BERT model above. However, in this case, it executes a bit more slowly than without VPU. I've also since implemented a TVM backend for VE, so we can now use the tools available in its Python code for debugging purposes, as demonstrated on this line here:

I'm compiling with and without VPU using the available LLVM options as shown here:

That gives me these kinds of execution profiles:

In either case, "fused_nn_softmax" is accountable for about 40% of the execution time, and "fused_nn_dense_add", about 25%. The "fused_nn" part is typically just GEMM from BLAS, and we know NLC is still not super fast, but that doesn't explain why the "softmax" ones, even with VPU, are so slow. That operation is mostly exp() and divisions. I remember NCC having problems vectorizing these kinds of operations. What's the status of LLVM-VE with regard to this?

kaz7 pushed a commit that referenced this issue Jun 30, 2021
The Select insn in BPF is expensive as BPF backend
needs to resolve with conditionals.  This patch set
the getCmpSelInstrCost() to SCEVCheapExpansionBudget
for Select insn to prevent some Select insn related
optimizations.

This change is motivated during bcc code review for
   iovisor/bcc#3270
where IndVarSimplifyPass eventually caused generating
the following asm code:
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      14:       16 05 40 00 00 00 00 00 if w5 == 0 goto +64 <LBB0_6>
      15:       bc 51 00 00 00 00 00 00 w1 = w5
      16:       04 01 00 00 ff ff ff ff w1 += -1
      17:       67 05 00 00 20 00 00 00 r5 <<= 32
      18:       77 05 00 00 20 00 00 00 r5 >>= 32
      19:       a6 01 01 00 05 00 00 00 if w1 < 5 goto +1 <LBB0_4>
      20:       b7 05 00 00 06 00 00 00 r5 = 6
  00000000000000a8 <LBB0_4>:
      21:       b7 02 00 00 00 00 00 00 r2 = 0
      22:       b7 01 00 00 00 00 00 00 r1 = 0
  ;       for (i = 0; (i < VIRTIO_MAX_SGS) && (i < num); i++) {
      23:       7b 1a e0 ff 00 00 00 00 *(u64 *)(r10 - 32) = r1
      24:       7b 5a c0 ff 00 00 00 00 *(u64 *)(r10 - 64) = r5
Note that insn #15 has w1 = w5 and w1 is refined later but r5(w5) is
eventually saved on stack at insn #24 for later use. This cause
later verifier failures.

With this change, IndVarSimplifyPass won't do the above
transformation any more.

Differential Revision: https://reviews.llvm.org/D97479
kaz7 pushed a commit that referenced this issue Jul 13, 2021
This patch fixes an issue where a pre-indexed store e.g.,
STR x1, [x0, #24]! with a store like STR x0, [x0, #8] are
merged into a single store: STP x1, x0, [x0, #24]!
. They shouldn’t be merged because the second store uses
x0 as both the stored value and the address and so it needs to be using the updated x0.
Therefore, it should not be folded into a STP <>pre.

Additionally a new test case is added to verify this fix.

Differential Revision: https://reviews.llvm.org/D101888

Change-Id: I26f1985ac84e970961e2cdca23c590fa6773851a
kaz7 pushed a commit that referenced this issue Jul 29, 2021
This fixes a GISEL vs SDAG regression that showed up at -Os in 256.bzip2

In `_getAndMoveToFrontDecode`:

gisel:
```
and w9, w0, #0xff
orr w9, w9, w8, lsl #8
```

sdag:
```
bfi w0, w8, #8, #24
```

Differential revision: https://reviews.llvm.org/D103291
kaz7 pushed a commit that referenced this issue Aug 14, 2021
There is a SIGSEGV at `DeduceTemplateArgumentsByTypeMatch`. The bug [#51171](https://bugs.llvm.org/show_bug.cgi?id=51171) was filled. The reproducer can be found at the bug description.

LIT test for the issue was added:
```
./bin/llvm-lit -v ../clang/test/SemaCXX/pr51171-crash.cpp
```

The debug stack trace is below:
```
 #0 0x00000000055afcb9 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:22
 #1 0x00000000055afd70 PrintStackTraceSignalHandler(void*) /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x00000000055add2d llvm::sys::RunSignalHandlers() /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Signals.cpp:97:20
 #3 0x00000000055af701 SignalHandler(int) /home/ivanmurashko/local/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007ffff7bc2b20 __restore_rt sigaction.c:0:0
 #5 0x00007ffff66a337f raise (/lib64/libc.so.6+0x3737f)
 #6 0x00007ffff668ddb5 abort (/lib64/libc.so.6+0x21db5)
 #7 0x00007ffff668dc89 _nl_load_domain.cold.0 loadmsgcat.c:0:0
 #8 0x00007ffff669ba76 .annobin___GI___assert_fail.end assert.c:0:0
 #9 0x000000000594b210 clang::QualType::getCommonPtr() const /home/ivanmurashko/local/llvm-project/clang/include/clang/AST/Type.h:684:5
#10 0x0000000005a12ca6 clang::QualType::getCanonicalType() const /home/ivanmurashko/local/llvm-project/clang/include/clang/AST/Type.h:6467:36
#11 0x0000000005a137a6 clang::ASTContext::getCanonicalType(clang::QualType) const /home/ivanmurashko/local/llvm-project/clang/include/clang/AST/ASTContext.h:2433:58
#12 0x0000000009204584 DeduceTemplateArgumentsByTypeMatch(clang::Sema&, clang::TemplateParameterList*, clang::QualType, clang::QualType, clang::sema::TemplateDeductionInfo&, llvm::SmallVectorImpl<clang::DeducedTemplateArgument>&, unsigned int, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaTemplateDeduction.cpp:1355:54
#13 0x000000000920df0d clang::Sema::DeduceTemplateArguments(clang::FunctionTemplateDecl*, clang::TemplateArgumentListInfo*, clang::QualType, clang::FunctionDecl*&, clang::sema::TemplateDeductionInfo&, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaTemplateDeduction.cpp:4354:47
#14 0x0000000009012b09 (anonymous namespace)::AddressOfFunctionResolver::AddMatchingTemplateFunction(clang::FunctionTemplateDecl*, clang::DeclAccessPair const&) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:12026:38
#15 0x0000000009013030 (anonymous namespace)::AddressOfFunctionResolver::FindAllFunctionsThatMatchTargetTypeExactly() /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:12119:9
#16 0x0000000009012679 (anonymous namespace)::AddressOfFunctionResolver::AddressOfFunctionResolver(clang::Sema&, clang::Expr*, clang::QualType const&, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:11931:5
#17 0x0000000009013c91 clang::Sema::ResolveAddressOfOverloadedFunction(clang::Expr*, clang::QualType, bool, clang::DeclAccessPair&, bool*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:12286:42
#18 0x0000000008fed85d IsStandardConversion(clang::Sema&, clang::Expr*, clang::QualType, bool, clang::StandardConversionSequence&, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:1712:49
#19 0x0000000008fec8ea TryImplicitConversion(clang::Sema&, clang::Expr*, clang::QualType, bool, clang::Sema::AllowedExplicit, bool, bool, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:1433:27
#20 0x0000000008ff90ba TryCopyInitialization(clang::Sema&, clang::Expr*, clang::QualType, bool, bool, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:5273:71
#21 0x00000000090024fb clang::Sema::AddBuiltinCandidate(clang::QualType*, llvm::ArrayRef<clang::Expr*>, clang::OverloadCandidateSet&, bool, unsigned int) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:7755:32
#22 0x000000000900513f (anonymous namespace)::BuiltinOperatorOverloadBuilder::addGenericBinaryArithmeticOverloads() /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:8633:30
#23 0x0000000009007624 clang::Sema::AddBuiltinOperatorCandidates(clang::OverloadedOperatorKind, clang::SourceLocation, llvm::ArrayRef<clang::Expr*>, clang::OverloadCandidateSet&) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:9205:51
#24 0x0000000009018734 clang::Sema::LookupOverloadedBinOp(clang::OverloadCandidateSet&, clang::OverloadedOperatorKind, clang::UnresolvedSetImpl const&, llvm::ArrayRef<clang::Expr*>, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:13469:1
#25 0x0000000009018d56 clang::Sema::CreateOverloadedBinOp(clang::SourceLocation, clang::BinaryOperatorKind, clang::UnresolvedSetImpl const&, clang::Expr*, clang::Expr*, bool, bool, clang::FunctionDecl*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaOverload.cpp:13568:24
#26 0x0000000008b24797 BuildOverloadedBinOp(clang::Sema&, clang::Scope*, clang::SourceLocation, clang::BinaryOperatorKind, clang::Expr*, clang::Expr*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaExpr.cpp:14606:65
#27 0x0000000008b24ed5 clang::Sema::BuildBinOp(clang::Scope*, clang::SourceLocation, clang::BinaryOperatorKind, clang::Expr*, clang::Expr*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaExpr.cpp:14691:73
#28 0x0000000008b245d4 clang::Sema::ActOnBinOp(clang::Scope*, clang::SourceLocation, clang::tok::TokenKind, clang::Expr*, clang::Expr*) /home/ivanmurashko/local/llvm-project/clang/lib/Sema/SemaExpr.cpp:14566:1
#29 0x00000000085bfafb clang::Parser::ParseRHSOfBinaryExpression(clang::ActionResult<clang::Expr*, true>, clang::prec::Level) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseExpr.cpp:630:71
#30 0x00000000085bd922 clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseExpr.cpp:177:1
#31 0x00000000085cbbcd clang::Parser::ParseExpressionList(llvm::SmallVectorImpl<clang::Expr*>&, llvm::SmallVectorImpl<clang::SourceLocation>&, llvm::function_ref<void ()>) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseExpr.cpp:3368:40
#32 0x000000000857f49c clang::Parser::ParseDeclarationAfterDeclaratorAndAttributes(clang::Declarator&, clang::Parser::ParsedTemplateInfo const&, clang::Parser::ForRangeInit*) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseDecl.cpp:2416:5
#33 0x000000000857df16 clang::Parser::ParseDeclGroup(clang::ParsingDeclSpec&, clang::DeclaratorContext, clang::SourceLocation*, clang::Parser::ForRangeInit*) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseDecl.cpp:2092:65
#34 0x000000000855f07b clang::Parser::ParseDeclOrFunctionDefInternal(clang::ParsedAttributesWithRange&, clang::ParsingDeclSpec&, clang::AccessSpecifier) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:1138:1
#35 0x000000000855f136 clang::Parser::ParseDeclarationOrFunctionDefinition(clang::ParsedAttributesWithRange&, clang::ParsingDeclSpec*, clang::AccessSpecifier) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:1153:57
#36 0x000000000855e644 clang::Parser::ParseExternalDeclaration(clang::ParsedAttributesWithRange&, clang::ParsingDeclSpec*) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:975:58
#37 0x000000000855d717 clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/Parser.cpp:720:42
#38 0x0000000008558e01 clang::ParseAST(clang::Sema&, bool, bool) /home/ivanmurashko/local/llvm-project/clang/lib/Parse/ParseAST.cpp:158:37
#39 0x000000000627a221 clang::ASTFrontendAction::ExecuteAction() /home/ivanmurashko/local/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1058:11
#40 0x0000000006bdcc31 clang::CodeGenAction::ExecuteAction() /home/ivanmurashko/local/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:1045:5
#41 0x0000000006279b4d clang::FrontendAction::Execute() /home/ivanmurashko/local/llvm-project/clang/lib/Frontend/FrontendAction.cpp:955:38
#42 0x00000000061c3fe9 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /home/ivanmurashko/local/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:974:42
#43 0x00000000063f9c5e clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /home/ivanmurashko/local/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:278:38
#44 0x0000000002603a03 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /home/ivanmurashko/local/llvm-project/clang/tools/driver/cc1_main.cpp:246:40
#45 0x00000000025f8a39 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) /home/ivanmurashko/local/llvm-project/clang/tools/driver/driver.cpp:338:20
#46 0x00000000025f9107 main /home/ivanmurashko/local/llvm-project/clang/tools/driver/driver.cpp:415:26
#47 0x00007ffff668f493 __libc_start_main (/lib64/libc.so.6+0x23493)
#48 0x00000000025f729e _start (/data/users/ivanmurashko/llvm-project/build/bin/clang-13+0x25f729e)
```

Reviewed By: erichkeane

Differential Revision: https://reviews.llvm.org/D106583
@saudet
Copy link
Author

saudet commented Dec 6, 2021

@simoll I'm starting to revisit this topic again, and I would have a couple of additional questions. The code generated using TVM and the current hpce/develop branch doesn't crash anymore, outputs correct results, and is not slower than the main develop branch, but it is also not faster, so I'm guessing the VPU doesn't get used, but I really don't have a clue. Just to make sure, is there anything else than these options to enable emitting instructions for the VPU?

with tvm.transform.PassContext(opt_level=3, required_pass=["FastMath"]):
    compiled_lib = relay.build(mod, "llvm -mtriple=ve-linux -mattr=+vpu -libs=cblas,vednn", params=params)

https://github.com/saudet/tvm/blob/aurora/apps/howto_deploy/optimize_bert.py#L120

I've also made sure to set the RV_REPORT environment variable (to 1) and I get the following in the log so I'm guessing it's doing something, but what does it mean?

[15:15:31] /home/saudet/tvm/src/target/llvm/codegen_llvm.cc:97: Set native vector bits to be 128 for ve
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_begin1.preheader
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body.us
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: if_end.15
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv:  can not vectorize this non-trivial SCC: Reduction { levelLoop = (2) for_body2 redKind Top elems:
-   %33 = tail call float @llvm.fmuladd.f32(float %32, float %32, float %28)
-   %28 = phi float [ 0.000000e+00, %for_body ], [ %33, %for_body2 ]
}

rv: x unfit loop structure
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_begin1.preheader
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_begin1.preheader
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_end3
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_begin1.preheader

So, I've tried to hard code EnableVPU = true; back on in VESubtarget.cpp, but with that LLVM still crashes with errors like:

LLVM ERROR: Cannot select: 0xaf338e0: v16f32 = vselect 0x10779a30, 0xaf33ae8, 0x15f8cc50
  0x10779a30: v16i1 = setcc 0xaf33ae8, 0x15f8cc50, setogt:ch
    0xaf33ae8: v16f32 = vselect 0x10779550, 0x19904100, 0xaf27e28
      0x10779550: v16i1 = setcc 0x19904100, 0xaf27e28, setolt:ch
        0x19904100: v16f32 = fmul 0x15f8d0c8, 0xaf32d18
          0x15f8d0c8: v16f32,ch = load<(load 64 from %ir.uglygep910, !tbaa !3197)> 0xcc11cc8, 0x15f8cab0, undef:i64
            0x15f8cab0: i64 = add 0x19914b10, 0x199145c8
              0x19914b10: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %8
                0xaf33400: i64 = Register %8
              0x199145c8: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %11
                0x19914be0: i64 = Register %11
            0x19914768: i64 = undef
          0xaf32d18: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<7.071068e-01>, Constant:i32<16>
            0xaf33948: f32 = ConstantFP<7.071068e-01>
            0xaf278e0: i32 = Constant<16>
        0xaf27e28: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<4.000000e+00>, Constant:i32<16>
          0x19903bb8: f32 = ConstantFP<4.000000e+00>
          0xaf278e0: i32 = Constant<16>
      0x19904100: v16f32 = fmul 0x15f8d0c8, 0xaf32d18
        0x15f8d0c8: v16f32,ch = load<(load 64 from %ir.uglygep910, !tbaa !3197)> 0xcc11cc8, 0x15f8cab0, undef:i64
          0x15f8cab0: i64 = add 0x19914b10, 0x199145c8
            0x19914b10: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %8
              0xaf33400: i64 = Register %8
            0x199145c8: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %11
              0x19914be0: i64 = Register %11
          0x19914768: i64 = undef
        0xaf32d18: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<7.071068e-01>, Constant:i32<16>
          0xaf33948: f32 = ConstantFP<7.071068e-01>
          0xaf278e0: i32 = Constant<16>
      0xaf27e28: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<4.000000e+00>, Constant:i32<16>
        0x19903bb8: f32 = ConstantFP<4.000000e+00>
        0xaf278e0: i32 = Constant<16>
    0x15f8cc50: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<-4.000000e+00>, Constant:i32<16>
      0xaf27dc0: f32 = ConstantFP<-4.000000e+00>
      0xaf278e0: i32 = Constant<16>
  0xaf33ae8: v16f32 = vselect 0x10779550, 0x19904100, 0xaf27e28
    0x10779550: v16i1 = setcc 0x19904100, 0xaf27e28, setolt:ch
      0x19904100: v16f32 = fmul 0x15f8d0c8, 0xaf32d18
        0x15f8d0c8: v16f32,ch = load<(load 64 from %ir.uglygep910, !tbaa !3197)> 0xcc11cc8, 0x15f8cab0, undef:i64
          0x15f8cab0: i64 = add 0x19914b10, 0x199145c8
            0x19914b10: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %8
              0xaf33400: i64 = Register %8
            0x199145c8: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %11
              0x19914be0: i64 = Register %11
          0x19914768: i64 = undef
        0xaf32d18: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<7.071068e-01>, Constant:i32<16>
          0xaf33948: f32 = ConstantFP<7.071068e-01>
          0xaf278e0: i32 = Constant<16>
      0xaf27e28: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<4.000000e+00>, Constant:i32<16>
        0x19903bb8: f32 = ConstantFP<4.000000e+00>
        0xaf278e0: i32 = Constant<16>
    0x19904100: v16f32 = fmul 0x15f8d0c8, 0xaf32d18
      0x15f8d0c8: v16f32,ch = load<(load 64 from %ir.uglygep910, !tbaa !3197)> 0xcc11cc8, 0x15f8cab0, undef:i64
        0x15f8cab0: i64 = add 0x19914b10, 0x199145c8
          0x19914b10: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %8
            0xaf33400: i64 = Register %8
          0x199145c8: i64,ch = CopyFromReg 0xcc11cc8, Register:i64 %11
            0x19914be0: i64 = Register %11
        0x19914768: i64 = undef
      0xaf32d18: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<7.071068e-01>, Constant:i32<16>
        0xaf33948: f32 = ConstantFP<7.071068e-01>
        0xaf278e0: i32 = Constant<16>
    0xaf27e28: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<4.000000e+00>, Constant:i32<16>
      0x19903bb8: f32 = ConstantFP<4.000000e+00>
      0xaf278e0: i32 = Constant<16>
  0x15f8cc50: v16f32 = VEISD::VEC_BROADCAST ConstantFP:f32<-4.000000e+00>, Constant:i32<16>
    0xaf27dc0: f32 = ConstantFP<-4.000000e+00>
    0xaf278e0: i32 = Constant<16>
In function: __tvm_parallel_lambda.235
Aborted

Is this still expected to not function? Or am I doing something wrong?

/cc @mikishin @efocht @wmeddie

@saudet
Copy link
Author

saudet commented Dec 21, 2021

I've started looking at RV more closely, and if my understanding is correct, they are no plans to support LLVM vector instructions, but it should be able to autovectorize scalar LLVM IR instructions. Since TVM doesn't offer a way to annotate loops with OpenMP pragmas and what not, I had to hack RV's code, but I've been able to get something going by setting the RV_FORCE_WIDTH environment variable between 2 and 256, and by forcing it to accept given loops by setting this condition to true for specific L.getName() instead of checking for L.isAnnotatedParallel():
https://github.com/sx-aurora-dev/rv/blob/hpce/develop/src/passes/LoopVectorizer.cpp#L297

For reference, the loops that are generated by TVM for that BERT model have the following names: "for_body", "for_body2", "for_body2.us.us", "for_body5", "for_begin1.preheader", "for_begin1.preheader.us", "for_begin4.preheader", "for_begin7.preheader", "for_begin7.preheader.1", and "for_begin7.preheader.2".

Now, the strange thing is, RV is able to vectorize successfully all of these loops (except one as shown above), but even if only a single loop is vectorized (it doesn't matter which one), the resulting output at runtime becomes incorrect. Furthermore, for large values of RV_FORCE_WIDTH like 128 and 256, the program also crashes with "corrupted double-linked list", so memory is getting corrupted. For some reason, it sounds like vectorization causes the code to access incorrect areas of memory. Would someone have an idea as to where to look at next to debug this?

@saudet
Copy link
Author

saudet commented Dec 23, 2021

Ah, I figured out what the problem was. RV seems to generate incorrect code when the LLVM flag -mattr=-vpu is set.

@simoll Instead of silently generating incorrect code, RV should abort very loudly with a fatal error when -mattr=-vpu is set.

Now, after making sure -mattr=-vpu is not set, RV is able to vectorize correctly only the "for_body2.us.us", "for_body5", and "for_body5.us.us" loops, but not the others. With only these loops and RV_FORCE_WIDTH=256 though, it's almost as fast as using external ops from VML for batch sizes of 1 and 32, respectively:

Mean Time = 89.7658 ± 0.393104 ms
Mean Time = 969.688 ± 0.455849 ms

Progress, at last!

@simoll
Copy link
Contributor

simoll commented Dec 23, 2021

Ah, I figured out what the problem was. RV seems to generate incorrect code when the LLVM flag -mattr=-vpu is set.

@simoll Instead of silently generating incorrect code, RV should abort very loudly with a fatal error when -mattr=-vpu is set.

The -mattr option is part of LLVM not of RV.
If you pass -mattr=-vpu you are basically telling the VE backend (that translates LLVM IR into VE machine code) that it should scalarize whatever vector instructions are in the LLVM IR. If that code is incorrect when the IR is correct, it is a bug in LLVM.

Now, after making sure -mattr=-vpu is not set, RV is able to vectorize correctly only the "for_body2.us.us", "for_body5", and "for_body5.us.us" loops, but not the others. With only these loops and RV_FORCE_WIDTH=256 though, it's almost as fast as using external ops from VML for batch sizes of 1 and 32, respectively:

Mean Time = 89.7658 ± 0.393104 ms
Mean Time = 969.688 ± 0.455849 ms

Progress, at last!

This is great! Can you provide an LLVM IR file for the loops that do not vectorize? It'd help if you provide reduced test cases that isolate the problem.

@saudet
Copy link
Author

saudet commented Dec 24, 2021

The -mattr option is part of LLVM not of RV. If you pass -mattr=-vpu you are basically telling the VE backend (that translates LLVM IR into VE machine code) that it should scalarize whatever vector instructions are in the LLVM IR. If that code is incorrect when the IR is correct, it is a bug in LLVM.

When you say "LLVM", do you mean the portion largely maintained by @kaz7?

This is great! Can you provide an LLVM IR file for the loops that do not vectorize? It'd help if you provide reduced test cases that isolate the problem.

Let's see, I've continued debugging this today, and I can't get "for_body" or "for_body2" to generate correct code even when pretty much everything is disabled. The debug output, that is RV_REPORT=1 and LV_DIAG=1, looks like below, which I believe includes the input IR code from TVM:

rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body
rv: loopVecPass: LoopMD {vectorizeEnable = 1, minDepDist = unbounded, }
rv: loopVecPass: with user-provided vector width (RV_FORCE_WIDTH=256)
rv: loopVecPass: Vectorize for_body with VW: 256 , Dependence Distance: unbounded and TripAlignment: 1
rv: Analyzing loop exiting block: for_body
rv:  rv::RVConfig: RVConfig {
	VA:   sa-lattice, foldAllBranches = 0
	opts: enableSplitAllocas = 0, enableStructOpt = 0, enableSROV = 0, enableHeuristicBOSCC = 0, enableCoherentIF = 0, enableOptimizedBlends = 0, enableIRPolish = 0, greedyIPV = 0, maxULPErrorBound = 1.0, useAVL = 0

	nat:  useScatterGather = 0, useSafeDiv = 0
	using LLVM-VP.

	arch: useSSE = 0, useAVX = 0, useAVX2 = 0, useAVX512 = 0, useNEON = 0, useADVSIMD = 0, useVE = 1

}
-- VA result --
VectorizationInfo for LoopRegion (header for_bodyC)

Arguments:
i32 %0 : uni
%1* %1 : uni
i8* %2 : uni

Block %for_bodyC []
  %indvars.ivC = phi i64 [ %20, %for_bodyC.ph ], [ %indvars.iv.nextC, %for_bodyC ] : cont, alignment(256, 1)
  %21 = getelementptr inbounds float, float* %7, i64 %indvars.ivC : stride(4)
  %22 = load float, float* %21, align 4, !tbaa !4 : varying
  %23 = fmul float %22, 0x3F55555560000000 : varying
  %24 = getelementptr inbounds float, float* %4, i64 %indvars.ivC : stride(4)
  store float %23, float* %24, align 4, !tbaa !8 : varying
  %indvars.iv.nextC = add nsw i64 %indvars.ivC, 1 : cont
  %exitcond.notC = icmp eq i64 %indvars.iv.nextC, %wide.trip.count : varying
  %25 = add nuw i64 %indvars.ivC, 512 : uni
  %exitcond.not.vecExit = icmp sge i64 %25, %wide.trip.count : uni
  br i1 %exitcond.not.vecExit, label %for_body.vec2scalar, label %for_bodyC, !prof !11, !llvm.loop !12 : uni

}
-- EOF --
rv: SROV opt disabled (RV_DISABLE_SROV != 0)
rv: divLoopTrans:
	1 uniform loops.
rv: redOpt: optimized 0 reduction chains.
rv: Split allocas opt disabled (RV_DISABLE_SPLITALLOCAS != 0)
rv: Struct opt disabled (RV_DISABLE_STRUCTOPT != 0)
rv: nat memory:
	uni allocas: 0
	slow allocas: 0
	scatter/gather: 0/2, masked 0/0
	inter load/store: 0/0, masked 0/0
	cons load/store: 12/10, masked 0/0
	uni load/store: 0/0, masked 0/0
	store masks (c/u/v): 10/0/0
	load  masks (c/u/v): 14/0/0
rv: nat calls:
	Vectorized: 0/0 fully/semi
	Replicated: 22/0 replicated/cascaded
	RV Intrinsics: 0 intrinsics
-- Vectorized --

for_bodyC.rv:  
...
rv: loopVecPass::scopeLoop rv:  at 
rv: Analyzing loop exiting block: for_body2
rv: loopVecPass: LoopMD {vectorizeEnable = 1, minDepDist = unbounded, }
rv: loopVecPass: with user-provided vector width (RV_FORCE_WIDTH=256)
rv: loopVecPass: Vectorize for_body2 with VW: 256 , Dependence Distance: unbounded and TripAlignment: 3072
rv: Analyzing loop exiting block: for_body2
rv:  rv::RVConfig: RVConfig {
	VA:   sa-lattice, foldAllBranches = 0
	opts: enableSplitAllocas = 0, enableStructOpt = 0, enableSROV = 0, enableHeuristicBOSCC = 0, enableCoherentIF = 0, enableOptimizedBlends = 0, enableIRPolish = 0, greedyIPV = 0, maxULPErrorBound = 1.0, useAVL = 0

	nat:  useScatterGather = 0, useSafeDiv = 0
	using LLVM-VP.

	arch: useSSE = 0, useAVX = 0, useAVX2 = 0, useAVX512 = 0, useNEON = 0, useADVSIMD = 0, useVE = 1

}
-- VA result --
VectorizationInfo for LoopRegion (header for_body2C)

Arguments:
i32 %0 : uni
%1* %1 : uni
i8* %2 : uni

Block %for_body2C []
  %indvars.ivC = phi i64 [ 0, %for_body2C.ph ], [ %indvars.iv.nextC, %for_body2C ] : cont, alignment(256, 1)
  %25 = add nsw i64 %indvars.ivC, %24 : cont
  %26 = getelementptr inbounds float, float* %10, i64 %indvars.ivC : stride(4)
  %27 = load float, float* %26, align 4, !tbaa !4 : varying
  %28 = getelementptr inbounds float, float* %7, i64 %25 : stride(4)
  %29 = load float, float* %28, align 4, !tbaa !8 : varying
  %30 = fadd float %27, %29 : varying
  %31 = getelementptr inbounds float, float* %4, i64 %25 : stride(4)
  store float %30, float* %31, align 4, !tbaa !11 : varying
  %indvars.iv.nextC = add nuw nsw i64 %indvars.ivC, 1 : cont
  %exitcond.notC = icmp eq i64 %indvars.iv.nextC, 3072 : varying
  %32 = add nuw nsw i64 %indvars.ivC, 512 : uni
  %exitcond.not.vecExit = icmp sge i64 %32, 3072 : uni
  br i1 %exitcond.not.vecExit, label %for_body2.vec2scalar, label %for_body2C, !prof !14, !llvm.loop !15 : uni

}
-- EOF --
rv: SROV opt disabled (RV_DISABLE_SROV != 0)
rv: divLoopTrans:
	1 uniform loops.
rv: redOpt: optimized 0 reduction chains.
rv: Split allocas opt disabled (RV_DISABLE_SPLITALLOCAS != 0)
rv: Struct opt disabled (RV_DISABLE_STRUCTOPT != 0)
rv: nat memory:
	uni allocas: 0
	slow allocas: 0
	scatter/gather: 0/0, masked 0/0
	inter load/store: 0/0, masked 0/0
	cons load/store: 16/7, masked 0/0
	uni load/store: 0/0, masked 0/0
	store masks (c/u/v): 7/0/0
	load  masks (c/u/v): 16/0/0
rv: nat calls:
	Vectorized: 0/0 fully/semi
	Replicated: 12/0 replicated/cascaded
	RV Intrinsics: 0 intrinsics
-- Vectorized --

for_body2C.rv:
...

Let me know if you need more than that, and how to obtain what you need, and I'll attach it!

Other than that, I found that RV can vectorize "for_begin1.preheader", "for_begin1.preheader.us", "for_begin4.preheader", "for_begin7.preheader", "for_begin7.preheader.1", and "for_begin7.preheader.2" correctly when AVL is disabled, that is RV_DISABLE_AVL=1. When it's enabled, we get errors like this from LLVM at compile time:

WidenVectorResult #0: t60: v48i8,ch = vp_gather<(load unknown-size, align 64)> t0, t4, t54, TargetConstant:i64<1>, t57, Constant:i32<48>

Do not know how to widen the result of this operator!
UNREACHABLE executed at /home/saudet/llvm-project/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp:2992!
Aborted

Those loops mainly contain gather and scatter operations, so it seems there's something funny going on when doing gather and scatter with AVL enabled. Disabling AVL though makes everything run more slowly, so it's not something we want to disable.

Furthermore, they are the main loops of the "fused_take_transpose_contrib_reverse_reshape_transpose_1" function, which is currently taking most of the execution time according to TVM's profiler, but they also appear before the main loops in other functions that still take a lot of time, mainly "fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape_1" and "fused_subtract_add_sqrt_divide_multiply_add_1", so I'll probably start debugging that myself and get them working first before anything else.

@saudet
Copy link
Author

saudet commented Jan 6, 2022

Ok, here's one for you @simoll. Let me know if you'd like me to post bugs like that somewhere else, and I will repost there!
/cc @efocht

I've continued to test stuff, and found that RV is able to vectorize the loops in "fused_take_transpose_contrib_reverse_reshape_transpose_1" without crashing LLVM, so that one in particular does not generate the crash above, but it does crash with SIGBUS at runtime.

Next, TVM has a backend that can generate C code as well, so I've used it to extract that "fused_take_transpose_contrib_reverse_reshape_transpose_1" function since it's the one I'd be most interested in getting working at the moment. After cleaning up irrelevant/unused generated code, adding a main() function for testing it, and putting that in a file named, say fused_transpose_reshape.c, without modifying the actual loops, it looks like this:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int32_t fused_take_transpose_contrib_reverse_reshape_transpose(float *placeholder, float *T_transpose) {
  #pragma omp simd
  for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 768; ++ax0_ax1_fused) {
    for (int32_t ax2_outer = 0; ax2_outer < 8; ++ax2_outer) {
      for (int32_t ax2_inner = 0; ax2_inner < 16; ++ax2_inner) {
        T_transpose[((((ax0_ax1_fused * 128) + (ax2_outer * 16)) + ax2_inner))] = placeholder[((((((ax2_outer * 36864) + (ax2_inner * 2304)) + ((ax0_ax1_fused >> 6) * 192)) + (ax0_ax1_fused & 63)) + 128))];
      }
    }
  }
  return 0;
}

int main() {
    float *in = (float*)malloc(1200000);
    float *out = (float*)malloc(400000);
    fused_take_transpose_contrib_reverse_reshape_transpose(in, out);
    printf("no crash \\(^^)/\n");
}

Compiling and running that using NCC and Clang, with and without RV, looks like this:

$ ncc -O3 fused_transpose_reshape.c; ./a.out 
ncc: vec( 102): moo.c, line 7: Partially vectorized loop.
ncc: opt(1592): moo.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): moo.c, line 9: Vectorized loop.
no crash \(^^)/
$ clang --target=ve-linux -O3 fused_transpose_reshape.c; ./a.out
no crash \(^^)/
$ clang --target=ve-linux -O3 fused_transpose_reshape.c -fopenmp-simd; ./a.out
Segmentation fault

So, there's something wrong with the generated code. Let me know if you'd like me to simplify the loops a bit more, and I'll do it. Thanks in advance for looking into this!

@simoll
Copy link
Contributor

simoll commented Jan 11, 2022

Ok, here's one for you @simoll. Let me know if you'd like me to post bugs like that somewhere else, and I will repost there! /cc @efocht

Thanks! This example code has undefined behavior. The memory allocated by the mallocs is read before it is written to. LLVM is likely seeing through this, NCC might not. I'll fix the test case on my end and look into it.

@saudet
Copy link
Author

saudet commented Jan 12, 2022

Yeah, I've noticed Clang doing weird things when inlining functions, so I've tried just now to move the main() function into another compilation unit just to make sure, but the behavior is exactly the same: Crashes with RV, doesn't crash without RV.

@saudet
Copy link
Author

saudet commented Jan 20, 2022

@simoll Since I'm not hearing back from you, I've continued debugging today, and this is the simplest function that I could come up that fails with RV (strided_loop.c):

int strided_loop(int *x) {
    #pragma omp simd
    for (int i = 0; i < 256; ++i) {
        x[i * 2] = i;
    }
    return 0;
}

Along with a main() function in a separate compilation unit (strided_loop_main.c):

#include <stdlib.h>
#include <stdio.h>

int main() {
    int *x = (int*)malloc(256 * 2 * sizeof(int));
    strided_loop(x);
    printf("no crash \\(^^)/\n");
}

Running that with NCC and Clang, without and with RV:

$ ncc -O3 strided_loop.c strided_loop_main.c; ./a.out 
ncc: vec( 101): strided_loop.c, line 3: Vectorized loop.
no crash \(^^)/
$ clang --target=ve-linux -O3 strided_loop.c strided_loop_main.c; ./a.out
no crash \(^^)/
$ clang --target=ve-linux -O3 strided_loop.c strided_loop_main.c -fopenmp-simd; ./a.out
Bus error

So it looks like RV is unable to vectorize anything that doesn't access memory sequentially. When we only shift the address, like x[i + 2] or x[i - 2], that works, but other operators like x[i / 2], x[i << 1], or x[i >> 1] do not. Incrementing the index by anything else than 1 like this also does not work:

int strided_loop(int *x) {
    #pragma omp simd
    for (int i = 0; i < 512; i+=2) {
        x[i] = i;
    }
    return 0;
}

Is this a known limitation? In any case, as usual, if I don't hear back from you I'll keep debugging by myself, but a little bit of help would be greatly appreciated. /cc @efocht

@saudet
Copy link
Author

saudet commented Jan 20, 2022

@simoll I've started peering over the assembly code generated by ncc -S ... and clang -S ..., and I've noticed that RV seems to generate the VSCL instruction incorrectly. Here's what we get from Clang with RV for the original loop with x[i * 2]:

	.text
	.file	"strided_loop.c"
	.globl	strided_loop                    # -- Begin function strided_loop
	.p2align	4
	.type	strided_loop,@function
strided_loop:                           # @strided_loop
# %bb.0:
	lea %s1, 256
	lvl %s1
	vseq %v0
	or %s2, 2, (0)1
	vmuls.l %v0, %s2, %v0
	vmuls.l %v0, 4, %v0
	vadds.l %v0, %s0, %v0
	pvseq.lo %v1
	or %s0, 0, (0)1
	vscl %v0, %v1, 0, 0
	b.l.t (, %s10)
.Lfunc_end0:
	.size	strided_loop, .Lfunc_end0-strided_loop
                                        # -- End function
	.ident	"clang version 13.0.0 (https://github.com/sx-aurora-dev/llvm-project/ 629d3be80d3fcb33fb63719f5698014e307a303e)"
	.section	".note.GNU-stack","",@progbits

Notice that the arguments for VSCL are apparently inverted. If I manually switch them to make them correct, I think, that is to say vscl %v1, %v0, 0, 0 and pass that to NAS, then strided_loop() does not crash anymore, but not only that, the resulting output of the function becomes correct! Could you please confirm that this is indeed a bug?

@simoll
Copy link
Contributor

simoll commented Jan 21, 2022

@simoll I've started peering over the assembly code generated by ncc -S ... and clang -S ..., and I've noticed that RV seems to generate the VSCL instruction incorrectly.
[..]
Notice that the arguments for VSCL are apparently inverted. If I manually switch them to make them correct, I think, that is to say vscl %v1, %v0, 0, 0 and pass that to NAS, then strided_loop() does not crash anymore, but not only that, the resulting output of the function becomes correct! Could you please confirm that this is indeed a bug?

Yes, that's a bug in the llvm-ve vector instruction selection. Thanks for following up on this and actually looking into the underlying issue! I've just checked against the ISA spec. I'll put a patch out on hpce/develop.

simoll added a commit that referenced this issue Jan 21, 2022
@saudet
Copy link
Author

saudet commented Jan 21, 2022

@simoll Thank you for the fix! Now I'm trying to do the same thing with float arrays like this (strided_loop_float.c):

int strided_loop(float *x) {
    #pragma omp simd
    for (int i = 0; i < 256; ++i) {
        x[i * 2] = i;
    }
    return 0;
}

And its main function (strided_loop_float_main.c):

#include <stdlib.h>
#include <stdio.h>

int main() {
    float *x = (float*)malloc(256 * 2 * sizeof(float));
    strided_loop(x);
    printf("no crash \\(^^)/\n");
}

It kind of works, but only if we use NAS from NCC, instead of the assembler from Clang:

$ clang --target=ve-linux -O3 strided_loop_float.c strided_loop_float_main.c -fopenmp-simd -S
$ ncc strided_loop_float.s strided_loop_float_main.s
$ ./a.out
no crash \(^^)/

However:

$ clang --target=ve-linux -O3 strided_loop_float.c strided_loop_float_main.c -fopenmp-simd -c
$ ncc strided_loop_float.o strided_loop_float_main.o
$ ./a.out
Segmentation fault

Any ideas what might be going on here?! I'll keep debugging, but I would very much welcome your insights!

EDIT: Actually, it does the same thing with the original with int arrays as well, so this doesn't have anything to do with floats, apparently. Something wrong with LLVM's assembler, I guess?

@saudet
Copy link
Author

saudet commented Jan 21, 2022

BTW, is there a disassembler for VE like objdump -d? I can't seem to find anything. Do VE engineers disassemble object files manually?? Is there at least a quick and dirty script somewhere? @efocht @mikishin

@wmeddie
Copy link

wmeddie commented Jan 21, 2022

@saudet

$ /opt/nec/ve/bin/nobjdump --help
Usage: /opt/nec/ve/bin/nobjdump <option(s)> <file(s)>
 Display information from object <file(s)>.

@saudet
Copy link
Author

saudet commented Jan 22, 2022

Ok, thanks @wmeddie. Here's what it looks like for the original strided_loop:

strided_loop.o:     file format elf64-ve


Disassembly of section .text:

0000000000000000 <strided_loop>:
   0:	00 01 00 00 	lea	%s1,0x100(0,0)
   4:	00 00 01 06 
   8:	00 00 00 00 	lvl	%s1
   c:	00 81 00 bf 
  10:	00 00 00 00 	vseq	%v0
  14:	00 00 00 99 
  18:	00 00 00 00 	or	%s2,2,(0)1
  1c:	00 02 02 45 
  20:	00 00 00 00 	vmuls.l	%v0,%s2,%v0
  24:	00 82 20 db 
  28:	00 00 00 00 	vmuls.l	%v0,%s4,%v0
  2c:	00 84 20 db 
  30:	00 00 00 00 	vadds.l	%v0,%s0,%v0
  34:	00 80 20 8b 
  38:	00 00 00 01 	pvseq.lo	%v1
  3c:	00 00 40 99 
  40:	00 00 00 00 	or	%s0,0,(0)1
  44:	00 00 00 45 
  48:	00 00 00 01 	vscl	%v1,%v0,0,0
  4c:	00 00 40 b3 
  50:	00 00 00 00 	b.l.t	0x0(,%s10)
  54:	8a 00 3f 19 

The only difference that I can see with strided_loop.s is vmuls.l %v0,%s4,%v0, which is supposed to be vmuls.l %v0, 4, %v0. I don't see any "mul" in clang -S -emit-ll ... so I'm guessing that's something wrong happening again in the lowering of llvm.vp.scatter, not in RV itself. But why would it not be happening with clang -S ..., but only with clang -c ...? 🤔

@saudet
Copy link
Author

saudet commented Jan 24, 2022

I've located where that vmuls.l comes from. It comes from here:

SDValue ScaleBroadcast = CDAG.createBroadcast(IndexVT, Scale, AVL);
ScaledIndex = CDAG.getNode(VEISD::VVP_MUL, IndexVT,
{Index, ScaleBroadcast, Mask, AVL});

The Scale that we get here is i64 = TargetConstant<4>, so I've tried to change that see what happens, and when we force it to CDAG.getConstant(4, MVT::i64) we get instead i64 = Constant<4>, and the final output of that morphs into a or %s2, 4, (0)1 followed by vmuls.l %v0, %s2, %v0, which is correct, for both clang -S ... and clang -c ..., which is weird. That way though, strided_loop.c compiled, assembled, and linked with Clang runs successfully!

@simoll Does that sound like a good fix? I think it's weird that it would choke on a TargetConstant but not a Constant, but if we're not supposed to get TargetConstant in that code, it would explain the undefined behavior.

@saudet
Copy link
Author

saudet commented Jan 25, 2022

Well, when converting that TargetConstant to a Constant, almost everything else works just fine, so I've pushed that hack on my fork in commit saudet@71bb689. @simoll Please push a proper fix when you get the time!

With that, I've been able to make some progress with TVM, as per the latest commit saudet/tvm@54c0d4d. As I mentioned above already, for that BERT model, it generates loops named:

for_body
for_body2
for_body2.us.us
for_body5
for_body5.us.us

With the latest fixes, RV can successfully vectorize and accelerate all those loops! RV can also vectorize the following (smaller) loops, but it slows down execution for most of them, so we need another way to differentiate between them:

for_begin1.preheader
for_begin1.preheader.us
for_begin7.preheader
for_begin7.preheader.1
for_begin7.preheader.2

Finally, some of the following loops crash LLVM at compile time as per the issue described at the bottom of #24 (comment):

for_begin4.preheader
for_begin4.preheader.us

To differentiate further the loops that can be accelerated from the ones that cannot, we can select all loops at the function level. Specifically, the loops in these functions become faster when vectorized:

fused_nn_dense_add
fused_nn_dense_add_1
fused_nn_dense_add_3
fused_nn_dense_add_4
fused_take_transpose_contrib_reverse_reshape_transpose_1
fused_contrib_reverse_reshape_transpose_reshape_1
fused_reshape_sequence_mask_1
fused_reshape_7
fused_reshape_6
fused_contrib_reverse_reshape_7

While the ones in these functions run more slowly when vectorized:

fused_nn_dense_add_2
fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape_1
fused_take_transpose_contrib_reverse_reshape_divide_1
fused_take_transpose_contrib_reverse_reshape_1
fused_reshape_add_1
fused_variance_1
fused_mean_1
fused_subtract_add_sqrt_divide_multiply_add_1
fused_nn_dense_add_fast_tanh
fused_cast_take_cast_take_add_transpose_add_1
fused_reshape_add_cast_expand_dims_broadcast_to_reshape_1

And finally these ones crash LLVM, ostensibly inside their for_begin4.preheader and for_begin4.preheader.us loops:

fused_subtract_add_sqrt_divide_multiply_add_sequence_mask_transpose_strided_slic_1854526830353013479__1
fused_nn_batch_flatten_1

These last two take less than 0.2% of the total execution time, so I'm happy to ignore them for now. :)

When instructing RV to vectorize only the functions and loops that reduce execution time, it beats VML by quite a wide margin, for batch sizes of 1 and 32, respectively:

Mean Time = 81.6431 ± 0.105627 ms   vs  94.1423 ± 0.396907 ms
Mean Time = 653.712 ± 0.417336 ms   vs  915.229 ± 0.394219 ms

According to TVM's profiler, with batch sizes of 32, about ~60% of the time is spent in BLAS, but that means there is still ~40% spent in code generated by LLVM and RV, a figure that isn't exactly negligible...

@saudet
Copy link
Author

saudet commented Jan 26, 2022

Given Aurora's hardware specs, compared to a CPU, it should be able to execute that BERT model 4~5 times faster, that is to say process each sequence of length 128 in ~5 ms per sequence. So I've experimented to find out if we could expect more from autovectorization, by assuming that we can make LLVM as good as NCC in that department. The heaviest functions, after the ones using GEMM from BLAS, are fused_take_transpose_contrib_reverse_reshape_transpose and fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape, so I've extracted those 2 functions from the C backend of TVM, cleaned them up a bit, and put the following in a file named test_functions.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int32_t fused_take_transpose_contrib_reverse_reshape_transpose(float *placeholder, float *T_transpose) {
  #pragma omp simd
  for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 4*768; ++ax0_ax1_fused) {
    for (int32_t ax2_outer = 0; ax2_outer < 8; ++ax2_outer) {
      for (int32_t ax2_inner = 0; ax2_inner < 16; ++ax2_inner) {
        T_transpose[((((ax0_ax1_fused * 128) + (ax2_outer * 16)) + ax2_inner))] = placeholder[((((((ax2_outer * 36864) + (ax2_inner * 2304)) + ((ax0_ax1_fused >> 6) * 192)) + (ax0_ax1_fused & 63)) + 128))];
      }
    }
  }
  return 0;
}

int32_t fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape(float *placeholder, float *T_reshape) {
  #pragma omp simd
  for (int32_t ax0 = 0; ax0 < 128; ++ax0) {
    for (int32_t ax1_outer = 0; ax1_outer < 192; ++ax1_outer) {
      for (int32_t ax1_inner = 0; ax1_inner < 16; ++ax1_inner) {
        float _1 = ((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 7.071068e-01f;
        float _2 = (_1) < (4.000000e+00f) ? (_1) : (4.000000e+00f);
        ((float*)T_reshape)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] = ((((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 5.000000e-01f) * (1.000000e+00f + ((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -2.726142e-10f) + 2.770681e-08f)) + -2.101024e-06f)) + -5.692506e-05f)) + -7.349906e-04f)) + -2.954600e-03f)) + -1.609603e-02f)) / (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -1.456607e-05f) + -2.133740e-04f)) + -1.682827e-03f)) + -7.373329e-03f)) + -1.426474e-02f))));
      }
    }
  }
  return 0;
}

Along with this main() function in a file named test_main.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    float *in = (float*)malloc(1600000);
    float *out = (float*)malloc(1600000);
    for (int i = 0; i < 10000; i++) {
        FUNCTION(in, out);
    }
    printf("no crash \\(^^)/\n");
}

And compiled and ran them using Clang with RV and NCC:

$ clang --target=ve-linux -O3 -fno-unroll-loops -fopenmp-simd -DFUNCTION=fused_take_transpose_contrib_reverse_reshape_transpose test_functions.c test_main.c; time ./a.out
no crash \(^^)/

real	0m2.111s
user	0m0.033s
sys	0m0.049s

$ clang --target=ve-linux -O3 -fno-unroll-loops -fopenmp-simd -DFUNCTION=fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape test_functions.c test_main.c; time ./a.out
no crash \(^^)/

real	0m5.303s
user	0m0.034s
sys	0m0.053s

$ ncc -O3 -DFUNCTION=fused_take_transpose_contrib_reverse_reshape_transpose test_functions.c test_main.c; time ./a.out
ncc: vec( 102): test_functions.c, line 7: Partially vectorized loop.
ncc: opt(1592): test_functions.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): test_functions.c, line 9: Vectorized loop.
ncc: vec( 101): test_functions.c, line 21: Vectorized loop.
ncc: vec( 103): test_main.c, line 8: Unvectorized loop.
no crash \(^^)/

real	0m2.282s
user	0m0.034s
sys	0m0.054s

$ ncc -O3 -DFUNCTION=fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape test_functions.c test_main.c; time ./a.out
ncc: vec( 102): test_functions.c, line 7: Partially vectorized loop.
ncc: opt(1592): test_functions.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): test_functions.c, line 9: Vectorized loop.
ncc: vec( 101): test_functions.c, line 21: Vectorized loop.
ncc: vec( 103): test_main.c, line 8: Unvectorized loop.
no crash \(^^)/

real	0m9.607s
user	0m0.039s
sys	0m0.046s

Surprisingly enough, it looks like RV is already better than NCC at vectorizing these loops! Unfortunately, I think that means we cannot expect to gain a lot more from autovectorization. To approach the speed enjoyed by CPUs and GPUs with TVM, we really need to use the high-level vector information that TVM gives us, via LLVM vector instructions. Right now, LLVM-VE is unable to process those vector instructions, and I am convinced it will prevent us from obtaining higher performance with TVM.

A related problem is with NLC not providing a properly optimized implementation of GEMM. However, when we have an LLVM backend that supports vector instructions, TVM can generate highly optimized GEMM kernels by itself, so that is not a big issue:
https://tvm.apache.org/2018/03/23/nmt-transformer-optimize

@efocht @simoll @mikishin @kaz7 Are there plans to add support for at least the LLVM vector instructions needed by TVM?

@saudet
Copy link
Author

saudet commented Jan 26, 2022

Here's a run with VE_PROGINF=detail as requested by @wmeddie:

$ export VE_PROGINF=detail
$ clang --target=ve-linux -O3 -fno-unroll-loops -fopenmp-simd -DFUNCTION=fused_take_transpose_contrib_reverse_reshape_transpose test_functions.c test_main.c; time ./a.out
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             1.993676
  User Time (sec)                         :             1.991554
  Vector Time (sec)                       :             1.990915
  Inst. Count                             :            256201392
  V. Inst. Count                          :            155520000
  V. Element Count                        :          39813120000
  V. Load Element Count                   :           3932160000
  FLOP Count                              :                    0
  MOPS                                    :         20047.325130
  MOPS (Real)                             :         20020.221805
  MFLOPS                                  :             0.000000
  MFLOPS (Real)                           :             0.000000
  A. V. Length                            :           256.000000
  V. Op. Ratio (%)                        :            99.747753
  L1 Cache Miss (sec)                     :             0.000039
  CPU Port Conf. (sec)                    :             0.526629
  V. Arith. Exec. (sec)                   :             0.497391
  V. Load Exec. (sec)                     :             1.493523
  VLD LLC Hit Element Ratio (%)           :            99.997605
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           296.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Wed Jan 26 22:30:59 2022 JST
  End   Time (date)        :        Wed Jan 26 22:31:01 2022 JST

real	0m2.140s
user	0m0.040s
sys	0m0.058s


$ clang --target=ve-linux -O3 -fno-unroll-loops -fopenmp-simd -DFUNCTION=fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape test_functions.c test_main.c; time ./a.out
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             5.175923
  User Time (sec)                         :             5.173867
  Vector Time (sec)                       :             5.173109
  Inst. Count                             :           1451881392
  V. Inst. Count                          :           1320960000
  V. Element Count                        :         169082880000
  V. Load Element Count                   :           3932160000
  FLOP Count                              :         117964800000
  MOPS                                    :         32709.868659
  MOPS (Real)                             :         32692.499529
  MFLOPS                                  :         22803.182026
  MFLOPS (Real)                           :         22791.073404
  A. V. Length                            :           128.000000
  V. Op. Ratio (%)                        :            99.922630
  L1 Cache Miss (sec)                     :             0.000038
  CPU Port Conf. (sec)                    :             4.712389
  V. Arith. Exec. (sec)                   :             4.631338
  V. Load Exec. (sec)                     :             0.541771
  VLD LLC Hit Element Ratio (%)           :            99.999437
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           296.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Wed Jan 26 22:31:27 2022 JST
  End   Time (date)        :        Wed Jan 26 22:31:32 2022 JST

real	0m5.320s
user	0m0.035s
sys	0m0.059s


$ ncc -O3 -DFUNCTION=fused_take_transpose_contrib_reverse_reshape_transpose test_functions.c test_main.c; time ./a.out
ncc: vec( 102): test_functions.c, line 7: Partially vectorized loop.
ncc: opt(1592): test_functions.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): test_functions.c, line 9: Vectorized loop.
ncc: vec( 101): test_functions.c, line 21: Vectorized loop.
ncc: vec( 103): test_main.c, line 8: Unvectorized loop.
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             2.156141
  User Time (sec)                         :             2.154043
  Vector Time (sec)                       :             2.152795
  Inst. Count                             :           2902193791
  V. Inst. Count                          :            497960000
  V. Element Count                        :           9512960000
  V. Load Element Count                   :           3932160000
  FLOP Count                              :                    0
  MOPS                                    :          5534.062951
  MOPS (Real)                             :          5527.097179
  MFLOPS                                  :             0.000000
  MFLOPS (Real)                           :             0.000000
  A. V. Length                            :            19.103864
  V. Op. Ratio (%)                        :            79.825504
  L1 Cache Miss (sec)                     :             0.000595
  CPU Port Conf. (sec)                    :             1.228800
  V. Arith. Exec. (sec)                   :             0.192999
  V. Load Exec. (sec)                     :             1.954654
  VLD LLC Hit Element Ratio (%)           :            99.999641
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           296.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Wed Jan 26 22:31:59 2022 JST
  End   Time (date)        :        Wed Jan 26 22:32:01 2022 JST

real	0m2.301s
user	0m0.041s
sys	0m0.055s


$ ncc -O3 -DFUNCTION=fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape test_functions.c test_main.c; time ./a.out
ncc: vec( 102): test_functions.c, line 7: Partially vectorized loop.
ncc: opt(1592): test_functions.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): test_functions.c, line 9: Vectorized loop.
ncc: vec( 101): test_functions.c, line 21: Vectorized loop.
ncc: vec( 103): test_main.c, line 8: Unvectorized loop.
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             9.483146
  User Time (sec)                         :             9.480355
  Vector Time (sec)                       :             9.479321
  Inst. Count                             :          14511632030
  V. Inst. Count                          :           7372800000
  V. Element Count                        :         117964800000
  V. Load Element Count                   :           3932160000
  FLOP Count                              :         165150720000
  MOPS                                    :         19834.426419
  MOPS (Real)                             :         19826.566042
  MFLOPS                                  :         17422.089684
  MFLOPS (Real)                           :         17415.185316
  A. V. Length                            :            16.000000
  V. Op. Ratio (%)                        :            96.203116
  L1 Cache Miss (sec)                     :             0.000040
  CPU Port Conf. (sec)                    :             0.000000
  V. Arith. Exec. (sec)                   :             5.090743
  V. Load Exec. (sec)                     :             0.175553
  VLD LLC Hit Element Ratio (%)           :            99.990017
  FMA Element Count                       :          62914560000
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           296.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Wed Jan 26 22:32:18 2022 JST
  End   Time (date)        :        Wed Jan 26 22:32:27 2022 JST

real	0m9.622s
user	0m0.038s
sys	0m0.052s

@saudet
Copy link
Author

saudet commented Jan 27, 2022

Just for completeness, I've also done the same for a batch size of 32, with test_functions_32.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int32_t fused_take_transpose_contrib_reverse_reshape_transpose(float *placeholder, float *T_transpose) {
  #pragma omp simd
  for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 24576; ++ax0_ax1_fused) {
    for (int32_t ax2_outer = 0; ax2_outer < 8; ++ax2_outer) {
      for (int32_t ax2_inner = 0; ax2_inner < 16; ++ax2_inner) {
        ((float*)T_transpose)[((((ax0_ax1_fused * 128) + (ax2_outer * 16)) + ax2_inner))] = ((float*)placeholder)[((((((ax2_outer * 1179648) + (ax2_inner * 73728)) + ((ax0_ax1_fused >> 6) * 192)) + (ax0_ax1_fused & 63)) + 128))];
      }
    }
  }
  return 0;
}

int32_t fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape(float *placeholder, float *T_reshape) {
  #pragma omp simd
  for (int32_t ax0 = 0; ax0 < 4096; ++ax0) {
    for (int32_t ax1_outer = 0; ax1_outer < 192; ++ax1_outer) {
      for (int32_t ax1_inner = 0; ax1_inner < 16; ++ax1_inner) {
        float _1 = ((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 7.071068e-01f;
        float _2 = (_1) < (4.000000e+00f) ? (_1) : (4.000000e+00f);
        ((float*)T_reshape)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] = ((((float*)placeholder)[((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner))] * 5.000000e-01f) * (1.000000e+00f + ((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -2.726142e-10f) + 2.770681e-08f)) + -2.101024e-06f)) + -5.692506e-05f)) + -7.349906e-04f)) + -2.954600e-03f)) + -1.609603e-02f)) / (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * (((((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f)) * ((_2) > (-4.000000e+00f) ? (_2) : (-4.000000e+00f))) * -1.456607e-05f) + -2.133740e-04f)) + -1.682827e-03f)) + -7.373329e-03f)) + -1.426474e-02f))));
      }
    }
  }
  return 0;
}

And test_main_32.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    float *in = (float*)malloc(1600000*32);
    float *out = (float*)malloc(1600000*32);
    for (int i = 0; i < 1000; i++) {
        FUNCTION(in, out);
    }
    printf("no crash \\(^^)/\n");
}

Compiled and executed like this:

$ export VE_PROGINF=detail
$ clang --target=ve-linux -O3 -fno-unroll-loops -fopenmp-simd -DFUNCTION=fused_take_transpose_contrib_reverse_reshape_transpose test_functions_32.c test_main_32.c; ./a.out
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             2.103840
  User Time (sec)                         :             2.100888
  Vector Time (sec)                       :             2.100341
  Inst. Count                             :            204811581
  V. Inst. Count                          :            124416000
  V. Element Count                        :          31850496000
  V. Load Element Count                   :           3145728000
  FLOP Count                              :                    0
  MOPS                                    :         15202.179233
  MOPS (Real)                             :         15177.448857
  MFLOPS                                  :             0.000000
  MFLOPS (Real)                           :             0.000000
  A. V. Length                            :           256.000000
  V. Op. Ratio (%)                        :            99.748220
  L1 Cache Miss (sec)                     :             0.000049
  CPU Port Conf. (sec)                    :             0.421303
  V. Arith. Exec. (sec)                   :             0.430769
  V. Load Exec. (sec)                     :             1.669572
  VLD LLC Hit Element Ratio (%)           :            18.461893
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           360.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Thu Jan 27 09:14:31 2022 JST
  End   Time (date)        :        Thu Jan 27 09:14:33 2022 JST


$ clang --target=ve-linux -O3 -fno-unroll-loops -fopenmp-simd -DFUNCTION=fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape test_functions_32.c test_main_32.c; ./a.out
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :            20.353496
  User Time (sec)                         :            20.351115
  Vector Time (sec)                       :            20.349881
  Inst. Count                             :           2765017581
  V. Inst. Count                          :           2113536000
  V. Element Count                        :         541065216000
  V. Load Element Count                   :          12582912000
  FLOP Count                              :         377487360000
  MOPS                                    :         26620.041573
  MOPS (Real)                             :         26615.415383
  MFLOPS                                  :         18549.786745
  MFLOPS (Real)                           :         18546.563053
  A. V. Length                            :           256.000000
  V. Op. Ratio (%)                        :            99.879738
  L1 Cache Miss (sec)                     :             0.000051
  CPU Port Conf. (sec)                    :            16.851214
  V. Arith. Exec. (sec)                   :            16.444695
  V. Load Exec. (sec)                     :             3.904555
  VLD LLC Hit Element Ratio (%)           :            96.496842
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           360.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Thu Jan 27 09:15:03 2022 JST
  End   Time (date)        :        Thu Jan 27 09:15:23 2022 JST


$ ncc -O3 -DFUNCTION=fused_take_transpose_contrib_reverse_reshape_transpose test_functions_32.c test_main_32.c; ./a.out
ncc: opt(1592): test_functions_32.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): test_functions_32.c, line 9: Vectorized loop.
ncc: vec( 101): test_functions_32.c, line 21: Vectorized loop.
ncc: vec( 103): test_main_32.c, line 8: Unvectorized loop.
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             7.491577
  User Time (sec)                         :             7.488899
  Vector Time (sec)                       :             7.488046
  Inst. Count                             :           2801723030
  V. Inst. Count                          :            393216000
  V. Element Count                        :           6291456000
  V. Load Element Count                   :           3145728000
  FLOP Count                              :                    0
  MOPS                                    :          1161.834850
  MOPS (Real)                             :          1161.299513
  MFLOPS                                  :             0.000000
  MFLOPS (Real)                           :             0.000000
  A. V. Length                            :            16.000000
  V. Op. Ratio (%)                        :            72.315894
  L1 Cache Miss (sec)                     :             0.000052
  CPU Port Conf. (sec)                    :             2.102139
  V. Arith. Exec. (sec)                   :             0.140434
  V. Load Exec. (sec)                     :             7.347604
  VLD LLC Hit Element Ratio (%)           :            94.651079
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           360.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Thu Jan 27 09:15:57 2022 JST
  End   Time (date)        :        Thu Jan 27 09:16:05 2022 JST


$ ncc -O3 -DFUNCTION=fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape test_functions_32.c test_main_32.c; ./a.out
ncc: opt(1592): test_functions_32.c, line 8: Outer loop unrolled inside inner loop.: ax2_outer
ncc: vec( 101): test_functions_32.c, line 9: Vectorized loop.
ncc: vec( 101): test_functions_32.c, line 21: Vectorized loop.
ncc: vec( 103): test_main_32.c, line 8: Unvectorized loop.
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :            30.535807
  User Time (sec)                         :            30.533391
  Vector Time (sec)                       :            30.531910
  Inst. Count                             :          46436399030
  V. Inst. Count                          :          23592960000
  V. Element Count                        :         377487360000
  V. Load Element Count                   :          12582912000
  FLOP Count                              :         528482304000
  MOPS                                    :         19705.804619
  MOPS (Real)                             :         19703.341234
  MFLOPS                                  :         17309.135037
  MFLOPS (Real)                           :         17306.971255
  A. V. Length                            :            16.000000
  V. Op. Ratio (%)                        :            96.203248
  L1 Cache Miss (sec)                     :             0.000052
  CPU Port Conf. (sec)                    :             0.000000
  V. Arith. Exec. (sec)                   :            16.290377
  V. Load Exec. (sec)                     :             0.879950
  VLD LLC Hit Element Ratio (%)           :             0.033762
  FMA Element Count                       :         201326592000
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           360.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Thu Jan 27 09:16:44 2022 JST
  End   Time (date)        :        Thu Jan 27 09:17:14 2022 JST

In this case, NCC performs even worse, so my conclusion is the same: I think I've done all we can from TVM's side. To increase performance further, I feel that we absolutely need better support for IR vector instructions in LLVM-VE. If anyone has a different opinion, please do let me know what you believe should be done next!

@simoll
Copy link
Contributor

simoll commented Jan 27, 2022

In this case, NCC performs even worse, so my conclusion is the same: I think I've done all we can from TVM's side. To increase performance further, I feel that we absolutely need better support for IR vector instructions in LLVM-VE. If anyone has a different opinion, please do let me know what you believe should be done next!

Strided VLD/VST in LLVM-VE should give us a good speedup here. What you got here is with VSC/VGT only.

@saudet
Copy link
Author

saudet commented Jan 28, 2022

In this case, NCC performs even worse, so my conclusion is the same: I think I've done all we can from TVM's side. To increase performance further, I feel that we absolutely need better support for IR vector instructions in LLVM-VE. If anyone has a different opinion, please do let me know what you believe should be done next!

Strided VLD/VST in LLVM-VE should give us a good speedup here. What you got here is with VSC/VGT only.

Are you saying that LLVM-VE currently isn't generating any VLD/VST instructions for strided access? That sounds about right. RV is also generating VSC/VGT for this function, while NCC correctly generates VLD/VST:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int32_t fused_nn_batch_flatten(float *placeholder, float *tensor) {
  #pragma omp simd
  for (int32_t ax0 = 0; ax0 < 32; ++ax0) {
    for (int32_t ax1_outer = 0; ax1_outer < 48; ++ax1_outer) {
      for (int32_t ax1_inner = 0; ax1_inner < 16; ++ax1_inner) {
        ((float*)tensor)[((((ax0 * 768) + (ax1_outer * 16)) + ax1_inner))] = ((float*)placeholder)[((((ax0 * 768) + (ax1_outer * 16)) + ax1_inner))];
      }
    }
  }
  return 0;
}

However, testing that by putting the above in fused_nn_batch_flatten.c and this in fused_nn_batch_flatten_main.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    float *in = (float*)malloc(100000);
    float *out = (float*)malloc(100000);
    for (int i = 0; i < 100000; i++) {
        fused_nn_batch_flatten(in, out);
    }
    printf("no crash \\(^^)/\n");
}

Still, NCC is only 2x faster:

$ export VE_PROGINF=detail
$ clang --target=ve-linux -O3 -fno-unroll-loops -fopenmp-simd fused_nn_batch_flatten.c fused_nn_batch_flatten_main.c; ./a.out
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             1.524528
  User Time (sec)                         :             1.522354
  Vector Time (sec)                       :             1.521884
  Inst. Count                             :            942421388
  V. Inst. Count                          :            614400000
  V. Element Count                        :          19660800000
  V. Load Element Count                   :           2457600000
  FLOP Count                              :                    0
  MOPS                                    :         13133.722932
  MOPS (Real)                             :         13111.496130
  MFLOPS                                  :             0.000000
  MFLOPS (Real)                           :             0.000000
  A. V. Length                            :            32.000000
  V. Op. Ratio (%)                        :            98.358976
  L1 Cache Miss (sec)                     :             0.000038
  CPU Port Conf. (sec)                    :             1.205292
  V. Arith. Exec. (sec)                   :             0.548571
  V. Load Exec. (sec)                     :             0.973311
  VLD LLC Hit Element Ratio (%)           :            99.999812
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           296.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Fri Jan 28 09:58:20 2022 JST
  End   Time (date)        :        Fri Jan 28 09:58:22 2022 JST

$ ncc -O3 fused_nn_batch_flatten.c fused_nn_batch_flatten_main.c; ./a.out
ncc: opt(1592): fused_nn_batch_flatten.c, line 8: Outer loop unrolled inside inner loop.: ax1_outer
ncc: vec( 101): fused_nn_batch_flatten.c, line 9: Vectorized loop.
ncc: vec( 103): fused_nn_batch_flatten_main.c, line 8: Unvectorized loop.
no crash \(^^)/
            ********  Program  Information  ********
  Real Time (sec)                         :             0.770533
  User Time (sec)                         :             0.768489
  Vector Time (sec)                       :             0.768072
  Inst. Count                             :           1498021837
  V. Inst. Count                          :            307200000
  V. Element Count                        :           4915200000
  V. Load Element Count                   :           2457600000
  FLOP Count                              :                    0
  MOPS                                    :          7949.143336
  MOPS (Real)                             :          7924.424063
  MFLOPS                                  :             0.000000
  MFLOPS (Real)                           :             0.000000
  A. V. Length                            :            16.000000
  V. Op. Ratio (%)                        :            80.497583
  L1 Cache Miss (sec)                     :             0.000038
  CPU Port Conf. (sec)                    :             0.000000
  V. Arith. Exec. (sec)                   :             0.109714
  V. Load Exec. (sec)                     :             0.658358
  VLD LLC Hit Element Ratio (%)           :            99.999021
  FMA Element Count                       :                    0
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Memory Size Used (MB)                   :           296.000000
  Non Swappable Memory Size Used (MB)     :            84.000000

  Start Time (date)        :        Fri Jan 28 09:58:57 2022 JST
  End   Time (date)        :        Fri Jan 28 09:58:58 2022 JST

So, I don't see how that's going to give us 4~5x overall speedup...

That said, in the case of the x86 backend, using MKL so that's already a lot better than NLC, but still, on a single CPU socket from aurora05, I'm getting ~403 ms for a batch size of 32, and with config={"tir.disable_vectorize": True}, it drops to only around ~413 ms. It sure looks like LLVM is able to autovectorize pretty much optimally all the code in there for AVX-512. How likely do you think that is @simoll?

@simoll
Copy link
Contributor

simoll commented Jan 28, 2022

That said, in the case of the x86 backend, using MKL so that's already a lot better than NLC, but still, on a single CPU socket from aurora05, I'm getting ~403 ms for a batch size of 32, and with config={"tir.disable_vectorize": True}, it drops to only around ~413 ms. It sure looks like LLVM is able to autovectorize pretty much optimally all the code in there for AVX-512. How likely do you think that is @simoll?

The inner loops run over 16 elements, 32 bits wide each.. that's exactly 512bit.. and the memory accesses are mostly contiguous in the inner iteration variable.. I am not surprised AVX512 is doing well here with LLVM.

@saudet
Copy link
Author

saudet commented Jan 28, 2022

The inner loops run over 16 elements, 32 bits wide each.. that's exactly 512bit.. and the memory accesses are mostly contiguous in the inner iteration variable.. I am not surprised AVX512 is doing well here with LLVM.

It's a lot more than 512-bit wide. If you look at fused_reshape_multiply_divide_fast_erf_add_multiply_contrib_reverse_reshape() for example, the index used for all loads and stores is ((((ax0 * 3072) + (ax1_outer * 16)) + ax1_inner)), which is exactly sequential access! Is there a reason both NCC and LLVM can't see through this? If that's the reason this isn't working, maybe there's a way to tell TVM to increase the size of the inner loops...

@saudet
Copy link
Author

saudet commented Jan 31, 2022

Thanks for the tip @simoll! I was able to increase the size of almost all inner loops to at least 256 by increasing the default split factor to 256, see commit saudet/tvm@0694ae3. Anything larger than that like 512 makes RV crash LLVM, regardless of the value of RV_FORCE_WIDTH, so I'm guessing that even if RV did support values larger than 256, it wouldn't be optimal. Is my assumption correct? Anyway, with a factor 256, the total execution time drops another notch to ~508 ms for a batch size of 32, and TVM now says that ~70% of that is spent in GEMM calls to NLC, so it's almost time to start thinking about getting rid of NLC...

@mikishin BTW, are there any plans at NEC to accelerate GEMM calls in NLC? If optimized properly, we should be able to make it at least 4~5x faster than it currently is.

@saudet
Copy link
Author

saudet commented Feb 1, 2022

I've started looking at what we could do for GEMM with TVM by following this tutorial using the backend for VE:
https://tvm.apache.org/docs/how_to/optimize_operators/opt_gemm.html
Although there is no problem getting everything running, it does not seem possible to benefit from any of the optimizations described on that page, other than the "parallel" one (OpenMP). The first and most important optimization for this kind of algorithm is blocking (aka tiling), but doing so on VE actually slows things down. The execution time with the VE backend for the baseline code in that tutorial (1024x1024 * 1024x1024 matrix multiplication) is ~50ms, but for the code optimized with tiling in 256x256 blocks, which is the least worst case, it drops to ~100ms. To make sure that this isn't something wrong with RV, I've exported to C the "baseline" and "tiled" code and put that in a file named mmult.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int32_t mmult_baseline(float *A, float *B, float *C) {
  for (int32_t m = 0; m < 1024; m++) {
    for (int32_t n = 0; n < 1024; n++) {
      C[((m*1024) + n)] = 0;
      for (int32_t k = 0; k < 1024; k++) {
        C[((m*1024) + n)] = (C[((m*1024) + n)] + (A[((m*1024) + k)]*B[((k*1024) + n)]));
      }
    }
  }
}

int32_t mmult_tiled(float *A, float *B, float *C) {
  for (int32_t m_outer = 0; m_outer < 4; m_outer++) {
    for (int32_t n_outer = 0; n_outer < 4; n_outer++) {
      for (int32_t m_inner_init = 0; m_inner_init < 256; m_inner_init++) {
        for (int32_t n_inner_init = 0; n_inner_init < 256; n_inner_init++) {
          C[((((m_outer*262144) + (m_inner_init*1024)) + (n_outer*256)) + n_inner_init)] = 0;
        }
      }
      for (int32_t k_outer = 0; k_outer < 256; k_outer++) {
        for (int32_t k_inner = 0; k_inner < 4; k_inner++) {
          for (int32_t m_inner = 0; m_inner < 256; m_inner++) {
            for (int32_t n_inner = 0; n_inner < 256; n_inner++) {
              C[((((m_outer*262144) + (m_inner*1024)) + (n_outer*256)) + n_inner)] = (C[((((m_outer*262144) + (m_inner*1024)) + (n_outer*256)) + n_inner)] + (A[((((m_outer*262144) + (m_inner*1024)) + (k_outer*4)) + k_inner)]*B[((((k_outer*4096) + (k_inner*1024)) + (n_outer*256)) + n_inner)]));
            }
          }
        }
      }
    }
  }
  return 0;
}

Along with this main() function in mmult_main.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    float *A = (float*)malloc(1024*1024*4);
    float *B = (float*)malloc(1024*1024*4);
    float *C = (float*)malloc(1024*1024*4);
    for (int i = 0; i < 100; i++) {
        MMULT(A, B, C);
    }
    printf("no crash \\(^^)/\n");
}

And compiled and executed that with GCC as well as NCC as follows:

$ gcc -O3 -DMMULT=mmult_baseline  mmult.c mmult_main.c; time ./a.out
no crash \(^^)/

real	4m5.438s
user	4m5.428s
sys	0m0.009s


$ gcc -O3 -DMMULT=mmult_tiled  mmult.c mmult_main.c; time ./a.out
no crash \(^^)/

real	0m23.088s
user	0m23.081s
sys	0m0.007s

$ ncc -O3 -DMMULT=mmult_baseline  mmult.c mmult_main.c; time ./a.out 
ncc: opt(1589): mmult.c, line 7: Outer loop moved inside inner loop(s).: n
ncc: vec( 101): mmult.c, line 7: Vectorized loop.
ncc: opt(1592): mmult.c, line 9: Outer loop unrolled inside inner loop.: k
ncc: vec( 101): mmult.c, line 9: Vectorized loop.
ncc: vec( 101): mmult.c, line 20: Vectorized loop.
ncc: opt(1418): mmult.c, line 25: Constant-length loop is expanded.
ncc: opt(1590): mmult.c, line 26: Inner loop moved outside outer loop(s).: m_inner
ncc: opt(1772): mmult.c, line 26: Loop nest fused with following nest(s).: m_inner
ncc: opt(1395): mmult.c, line 27: Inner loop stripped and strip loop moved outside outer loop.: n_inner
ncc: vec( 101): mmult.c, line 27: Vectorized loop.
ncc: vec( 103): mmult_main.c, line 9: Unvectorized loop.
no crash \(^^)/

real	0m3.621s
user	0m0.030s
sys	0m0.054s

$ ncc -O4 -DMMULT=mmult_tiled  mmult.c mmult_main.c; time ./a.out 
ncc: opt(1589): mmult.c, line 7: Outer loop moved inside inner loop(s).: n
ncc: vec( 101): mmult.c, line 7: Vectorized loop.
ncc: opt(1592): mmult.c, line 9: Outer loop unrolled inside inner loop.: k
ncc: vec( 101): mmult.c, line 9: Vectorized loop.
ncc: vec( 101): mmult.c, line 20: Vectorized loop.
ncc: opt(1418): mmult.c, line 25: Constant-length loop is expanded.
ncc: opt(1590): mmult.c, line 26: Inner loop moved outside outer loop(s).: m_inner
ncc: opt(1772): mmult.c, line 26: Loop nest fused with following nest(s).: m_inner
ncc: opt(1395): mmult.c, line 27: Inner loop stripped and strip loop moved outside outer loop.: n_inner
ncc: vec( 101): mmult.c, line 27: Vectorized loop.
ncc: vec( 103): mmult_main.c, line 9: Unvectorized loop.
no crash \(^^)/

real	0m13.584s
user	0m0.042s
sys	0m0.042s

We can see from the above that NCC is also suffering from the same problem! I can only conclude that it is not possible to get compilers for VE to produce efficient code for tiled algorithms like this one, but I am probably just missing something. How can we achieve this without resorting to intrinsics, like it's possible with both GCC or LLVM and its x86 backend?

@simoll
Copy link
Contributor

simoll commented Feb 1, 2022

Thanks for the tip @simoll! I was able to increase the size of almost all inner loops to at least 256 by increasing the default split factor to 256, see commit saudet/tvm@0694ae3. Anything larger than that like 512 makes RV crash LLVM, regardless of the value of RV_FORCE_WIDTH, so I'm guessing that even if RV did support values larger than 256, it wouldn't be optimal. Is my assumption correct? Anyway, with a factor 256, the total execution time drops another notch to ~508 ms for a batch size of 32, and TVM now says that ~70% of that is spent in GEMM calls to NLC, so it's almost time to start thinking about getting rid of NLC...

Setting the vector width > 256 shouldn't crash RV nor LLVM. If anything, you should be getting packed mode instructions. Not expecting too much benefit for packed mode here though because there is no strided packed-mode vector load in the VE ISA.

I do expect further improvements for LLVM-VE-RV with vector width == 256 and once we support strided VLD. This should become available on the hpce/develop branch in the near future.

@saudet
Copy link
Author

saudet commented Feb 2, 2022

Setting the vector width > 256 shouldn't crash RV nor LLVM. If anything, you should be getting packed mode instructions. Not expecting too much benefit for packed mode here though because there is no strided packed-mode vector load in the VE ISA.

Ok, so I guess there's some bug in that packed mode. It's crashing in the same in LegalizeVectorTypes.cpp that I describe above #24 (comment), but this only happens when AVL is enabled.

I do expect further improvements for LLVM-VE-RV with vector width == 256 and once we support strided VLD. This should become available on the hpce/develop branch in the near future.

Ok, sounds good, although right now I'm worried about GEMM. According to engineers at NEC, there is absolutely no way to make GEMM any faster than what is already implemented in NLC. The only other thing I can think of to try at the moment is running TVM itself sequentially, thus giving more rope to NLC (and RV) for vectorizing purposes, but launch 8 independent threads like that. I measured an increase in overall throughput of about 1.5x doing it like that, but that's about it. Unless someone can point me at something else to look at, it sounds like we're pretty close to the limit of what the hardware is capable of. The conclusion seems to be that we can achieve higher throughput with cheaper Xeon CPUs that consume less power than Aurora!

@saudet
Copy link
Author

saudet commented Feb 8, 2022

Important correction, the figure of ~403 ms mentioned above is actually for 2 sockets of Xeon CPUs (~1 TFLOP each). I hadn't set the OpenMP thread affinity properly. When correctly limited to a single CPU, it's around ~722 ms. So, in other words, a throughput of ~44 sequences per second. In the case of a single VE, we can get about 1.5 * 32 / 0.508 = 94 sequences per second, so we can say that VE is ~2x faster than CPU...

@saudet
Copy link
Author

saudet commented Feb 15, 2022

So, from what I understood after a lengthy exchange with engineers at NEC, to obtain maximum performance on VE, data needs to be 128-byte aligned, but not 512-byte aligned, or we get bank conflicts. The details are still a bit fuzzy to me (it is very hard to get any information out of NEC), but needless to say, those are very unusable alignment requirements. On any other architecture that I am aware of that needs something like 128-byte alignment, 512-byte alignment meets that criterion, so we're good to go. Therefore, support for a kind of setting to avoid specific alignments isn't currently anywhere to be found in TVM. With enough resources, it should be possible to add support for this, and according to microbenchmarking we've been doing, that should give us an overall ~1.5x boost in performance. And then I'm sure we'll capture the remaining "~0.5x" from future improvements in RV, getting us pretty close to 4 TFLOPS. However, it looks like this is going to require a sizeable investment to get us there. @mikishin

kaz7 pushed a commit that referenced this issue Nov 11, 2022
Found by msan -fsanitize-memory-use-after-dtor.

==8259==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x55dbec54d2b8 in dtorRecord(clang::interp::Block*, char*, clang::interp::Descriptor*) clang/lib/AST/Interp/Descriptor.cpp:150:22
    #1 0x55dbec54bfcf in dtorArrayDesc(clang::interp::Block*, char*, clang::interp::Descriptor*) clang/lib/AST/Interp/Descriptor.cpp:97:7
    #2 0x55dbec508578 in invokeDtor clang/lib/AST/Interp/InterpBlock.h:79:7
    #3 0x55dbec508578 in clang::interp::Program::~Program() clang/lib/AST/Interp/Program.h:55:19
    #4 0x55dbec50657a in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5
    #5 0x55dbec50657a in std::__msan::unique_ptr<clang::interp::Program, std::__msan::default_delete<clang::interp::Program>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7
    #6 0x55dbec5035a1 in clang::interp::Context::~Context() clang/lib/AST/Interp/Context.cpp:27:22
    #7 0x55dbebec1daa in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5
    #8 0x55dbebec1daa in std::__msan::unique_ptr<clang::interp::Context, std::__msan::default_delete<clang::interp::Context>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7
    #9 0x55dbebe285f9 in clang::ASTContext::~ASTContext() clang/lib/AST/ASTContext.cpp:1038:40
    #10 0x55dbe941ff13 in llvm::RefCountedBase<clang::ASTContext>::Release() const llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:101:7
    #11 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:159:38
    #12 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:224:7
    #13 0x55dbe94353ef in ~IntrusiveRefCntPtr llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:191:27
    #14 0x55dbe94353ef in clang::CompilerInstance::setASTContext(clang::ASTContext*) clang/lib/Frontend/CompilerInstance.cpp:178:3
    #15 0x55dbe95ad0ad in clang::FrontendAction::EndSourceFile() clang/lib/Frontend/FrontendAction.cpp:1100:8
    #16 0x55dbe9445fcf in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) clang/lib/Frontend/CompilerInstance.cpp:1047:11
    #17 0x55dbe6b3afef in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:266:25
    #18 0x55dbe6b13288 in cc1_main(llvm::ArrayRef<char const*>, char const*, void*) clang/tools/driver/cc1_main.cpp:250:15
    #19 0x55dbe6b0095f in ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) clang/tools/driver/driver.cpp:319:12
    #20 0x55dbe6aff41c in clang_main(int, char**) clang/tools/driver/driver.cpp:395:12
    #21 0x7f9be07fa632 in __libc_start_main
    #22 0x55dbe6a702e9 in _start

  Member fields were destroyed
    #0 0x55dbe6a7da5d in __sanitizer_dtor_callback_fields compiler-rt/lib/msan/msan_interceptors.cpp:949:5
    #1 0x55dbec5094ac in ~SmallVectorImpl llvm/include/llvm/ADT/SmallVector.h:479:7
    #2 0x55dbec5094ac in ~SmallVectorImpl llvm/include/llvm/ADT/SmallVector.h:612:3
    #3 0x55dbec5094ac in llvm::SmallVector<clang::interp::Record::Base, 8u>::~SmallVector() llvm/include/llvm/ADT/SmallVector.h:1207:3
    #4 0x55dbec508e79 in clang::interp::Record::~Record() clang/lib/AST/Interp/Record.h:24:7
    #5 0x55dbec508612 in clang::interp::Program::~Program() clang/lib/AST/Interp/Program.h:49:26
    #6 0x55dbec50657a in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5
    #7 0x55dbec50657a in std::__msan::unique_ptr<clang::interp::Program, std::__msan::default_delete<clang::interp::Program>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7
    #8 0x55dbec5035a1 in clang::interp::Context::~Context() clang/lib/AST/Interp/Context.cpp:27:22
    #9 0x55dbebec1daa in operator() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:55:5
    #10 0x55dbebec1daa in std::__msan::unique_ptr<clang::interp::Context, std::__msan::default_delete<clang::interp::Context>>::~unique_ptr() third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/unique_ptr.h:261:7
    #11 0x55dbebe285f9 in clang::ASTContext::~ASTContext() clang/lib/AST/ASTContext.cpp:1038:40
    #12 0x55dbe941ff13 in llvm::RefCountedBase<clang::ASTContext>::Release() const llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:101:7
    #13 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:159:38
    #14 0x55dbe94353ef in release llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:224:7
    #15 0x55dbe94353ef in ~IntrusiveRefCntPtr llvm/include/llvm/ADT/IntrusiveRefCntPtr.h:191:27
    #16 0x55dbe94353ef in clang::CompilerInstance::setASTContext(clang::ASTContext*) clang/lib/Frontend/CompilerInstance.cpp:178:3
    #17 0x55dbe95ad0ad in clang::FrontendAction::EndSourceFile() clang/lib/Frontend/FrontendAction.cpp:1100:8
    #18 0x55dbe9445fcf in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) clang/lib/Frontend/CompilerInstance.cpp:1047:11
    #19 0x55dbe6b3afef in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:266:25
    #20 0x55dbe6b13288 in cc1_main(llvm::ArrayRef<char const*>, char const*, void*) clang/tools/driver/cc1_main.cpp:250:15
    #21 0x55dbe6b0095f in ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) clang/tools/driver/driver.cpp:319:12
    #22 0x55dbe6aff41c in clang_main(int, char**) clang/tools/driver/driver.cpp:395:12
    #23 0x7f9be07fa632 in __libc_start_main
    #24 0x55dbe6a702e9 in _start
kaz7 pushed a commit that referenced this issue Mar 3, 2024
…(#80904)"

This reverts commit b1ac052.

This commit breaks coroutine splitting for non-swift calling convention
functions. In this example:

```ll
; ModuleID = 'repro.ll'
source_filename = "stdlib/test/runtime/test_llcl.mojo"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@0 = internal constant { i32, i32 } { i32 trunc (i64 sub (i64 ptrtoint (ptr @craSH to i64), i64 ptrtoint (ptr getelementptr inbounds ({ i32, i32 }, ptr @0, i32 0, i32 1) to i64)) to i32), i32 64 }

define dso_local void @af_suspend_fn(ptr %0, i64 %1, ptr %2) #0 {
  ret void
}

define dso_local void @craSH(ptr %0) #0 {
  %2 = call token @llvm.coro.id.async(i32 64, i32 8, i32 0, ptr @0)
  %3 = call ptr @llvm.coro.begin(token %2, ptr null)
  %4 = getelementptr inbounds { ptr, { ptr, ptr }, i64, { ptr, i1 }, i64, i64 }, ptr poison, i32 0, i32 0
  %5 = call ptr @llvm.coro.async.resume()
  store ptr %5, ptr %4, align 8
  %6 = call { ptr, ptr, ptr } (i32, ptr, ptr, ...) @llvm.coro.suspend.async.sl_p0p0p0s(i32 0, ptr %5, ptr @ctxt_proj_fn, ptr @af_suspend_fn, ptr poison, i64 -1, ptr poison)
  ret void
}

define dso_local ptr @ctxt_proj_fn(ptr %0) #0 {
  ret ptr %0
}

; Function Attrs: nomerge nounwind
declare { ptr, ptr, ptr } @llvm.coro.suspend.async.sl_p0p0p0s(i32, ptr, ptr, ...) #1

; Function Attrs: nounwind
declare token @llvm.coro.id.async(i32, i32, i32, ptr) #2

; Function Attrs: nounwind
declare ptr @llvm.coro.begin(token, ptr writeonly) #2

; Function Attrs: nomerge nounwind
declare ptr @llvm.coro.async.resume() #1

attributes #0 = { "target-features"="+adx,+aes,+avx,+avx2,+bmi,+bmi2,+clflushopt,+clwb,+clzero,+crc32,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+mwaitx,+pclmul,+pku,+popcnt,+prfchw,+rdpid,+rdpru,+rdrnd,+rdseed,+sahf,+sha,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+sse4a,+ssse3,+vaes,+vpclmulqdq,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" }
attributes #1 = { nomerge nounwind }
attributes #2 = { nounwind }
```

This verifier crashes after the `coro-split` pass with

```
cannot guarantee tail call due to mismatched parameter counts
  musttail call void @af_suspend_fn(ptr poison, i64 -1, ptr poison)
LLVM ERROR: Broken function
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: opt ../../../reduced.ll -O0
 #0 0x00007f1d89645c0e __interceptor_backtrace.part.0 /build/gcc-11-XeT9lY/gcc-11-11.4.0/build/x86_64-linux-gnu/libsanitizer/asan/../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4193:28
 #1 0x0000556d94d254f7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:22
 #2 0x0000556d94d19a2f llvm::sys::RunSignalHandlers() /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/Signals.cpp:105:20
 #3 0x0000556d94d1aa42 SignalHandler(int) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/Unix/Signals.inc:371:36
 #4 0x00007f1d88e42520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007f1d88e969fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #6 0x00007f1d88e969fc __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #7 0x00007f1d88e969fc pthread_kill ./nptl/pthread_kill.c:89:10
 #8 0x00007f1d88e42476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #9 0x00007f1d88e287f3 abort ./stdlib/abort.c:81:7
 #10 0x0000556d8944be01 std::vector<llvm::json::Value, std::allocator<llvm::json::Value>>::size() const /usr/include/c++/11/bits/stl_vector.h:919:40
 #11 0x0000556d8944be01 bool std::operator==<llvm::json::Value, std::allocator<llvm::json::Value>>(std::vector<llvm::json::Value, std::allocator<llvm::json::Value>> const&, std::vector<llvm::json::Value, std::allocator<llvm::json::Value>> const&) /usr/include/c++/11/bits/stl_vector.h:1893:23
 #12 0x0000556d8944be01 llvm::json::operator==(llvm::json::Array const&, llvm::json::Array const&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/Support/JSON.h:572:69
 #13 0x0000556d8944be01 llvm::json::operator==(llvm::json::Value const&, llvm::json::Value const&) (.cold) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/JSON.cpp:204:28
 #14 0x0000556d949ed2bd llvm::report_fatal_error(char const*, bool) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Support/ErrorHandling.cpp:82:70
 #15 0x0000556d8e37e876 llvm::SmallVectorBase<unsigned int>::size() const /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:91:32
 #16 0x0000556d8e37e876 llvm::SmallVectorTemplateCommon<llvm::DiagnosticInfoOptimizationBase::Argument, void>::end() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:282:41
 #17 0x0000556d8e37e876 llvm::SmallVector<llvm::DiagnosticInfoOptimizationBase::Argument, 4u>::~SmallVector() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallVector.h:1215:24
 #18 0x0000556d8e37e876 llvm::DiagnosticInfoOptimizationBase::~DiagnosticInfoOptimizationBase() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/DiagnosticInfo.h:413:7
 #19 0x0000556d8e37e876 llvm::DiagnosticInfoIROptimization::~DiagnosticInfoIROptimization() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/DiagnosticInfo.h:622:7
 #20 0x0000556d8e37e876 llvm::OptimizationRemark::~OptimizationRemark() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/DiagnosticInfo.h:689:7
 #21 0x0000556d8e37e876 operator() /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Transforms/Coroutines/CoroSplit.cpp:2213:14
 #22 0x0000556d8e37e876 emit<llvm::CoroSplitPass::run(llvm::LazyCallGraph::SCC&, llvm::CGSCCAnalysisManager&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&)::<lambda()> > /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/Analysis/OptimizationRemarkEmitter.h:83:12
 #23 0x0000556d8e37e876 llvm::CoroSplitPass::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Transforms/Coroutines/CoroSplit.cpp:2212:13
 #24 0x0000556d8c36ecb1 llvm::detail::PassModel<llvm::LazyCallGraph::SCC, llvm::CoroSplitPass, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3
 #25 0x0000556d91c1a84f llvm::PassManager<llvm::LazyCallGraph::SCC, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Analysis/CGSCCPassManager.cpp:90:12
 #26 0x0000556d8c3690d1 llvm::detail::PassModel<llvm::LazyCallGraph::SCC, llvm::PassManager<llvm::LazyCallGraph::SCC, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&>::run(llvm::LazyCallGraph::SCC&, llvm::AnalysisManager<llvm::LazyCallGraph::SCC, llvm::LazyCallGraph&>&, llvm::LazyCallGraph&, llvm::CGSCCUpdateResult&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3
 #27 0x0000556d91c2162d llvm::ModuleToPostOrderCGSCCPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Analysis/CGSCCPassManager.cpp:278:18
 #28 0x0000556d8c369035 llvm::detail::PassModel<llvm::Module, llvm::ModuleToPostOrderCGSCCPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3
 #29 0x0000556d9457abc5 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManager.h:247:20
 #30 0x0000556d8e30979e llvm::CoroConditionalWrapper::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/lib/Transforms/Coroutines/CoroConditionalWrapper.cpp:19:74
 #31 0x0000556d8c365755 llvm::detail::PassModel<llvm::Module, llvm::CoroConditionalWrapper, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:91:3
 #32 0x0000556d9457abc5 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/PassManager.h:247:20
 #33 0x0000556d89818556 llvm::SmallPtrSetImplBase::isSmall() const /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:196:33
 #34 0x0000556d89818556 llvm::SmallPtrSetImplBase::~SmallPtrSetImplBase() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:84:17
 #35 0x0000556d89818556 llvm::SmallPtrSetImpl<llvm::AnalysisKey*>::~SmallPtrSetImpl() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:321:7
 #36 0x0000556d89818556 llvm::SmallPtrSet<llvm::AnalysisKey*, 2u>::~SmallPtrSet() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:427:7
 #37 0x0000556d89818556 llvm::PreservedAnalyses::~PreservedAnalyses() /home/ubuntu/modular/third-party/llvm-project/llvm/include/llvm/IR/Analysis.h:109:7
 #38 0x0000556d89818556 llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::PassPlugin>, llvm::ArrayRef<std::function<void (llvm::PassBuilder&)>>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool, bool, bool) /home/ubuntu/modular/third-party/llvm-project/llvm/tools/opt/NewPMDriver.cpp:532:10
 #39 0x0000556d897e3939 optMain /home/ubuntu/modular/third-party/llvm-project/llvm/tools/opt/optdriver.cpp:737:27
 #40 0x0000556d89455461 main /home/ubuntu/modular/third-party/llvm-project/llvm/tools/opt/opt.cpp:25:33
 #41 0x00007f1d88e29d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
 #42 0x00007f1d88e29e40 call_init ./csu/../csu/libc-start.c:128:20
 #43 0x00007f1d88e29e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
 #44 0x0000556d897b6335 _start (/home/ubuntu/modular/.derived/third-party/llvm-project/build-relwithdebinfo-asan/bin/opt+0x150c335)
Aborted (core dumped)
kaz7 pushed a commit that referenced this issue Mar 16, 2024
TestCases/Misc/Linux/sigaction.cpp fails because dlsym() may call malloc
on failure. And then the wrapped malloc appears to access thread local
storage using global dynamic accesses, thus calling
___interceptor___tls_get_addr, before REAL(__tls_get_addr) has
been set, so we get a crash inside ___interceptor___tls_get_addr. For
example, this can happen when looking up __isoc23_scanf which might not
exist in some libcs.

Fix this by marking the thread local variable accessed inside the
debug checks as "initial-exec", which does not require __tls_get_addr.

This is probably a better alternative to llvm/llvm-project#83886.

This fixes a different crash but is related to llvm/llvm-project#46204.

Backtrace:
```
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff6a9d89e in ___interceptor___tls_get_addr (arg=0x7ffff6b27be8) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:2759
#2 0x00007ffff6a46bc6 in __sanitizer::CheckedMutex::LockImpl (this=0x7ffff6b27be8, pc=140737331846066) at /path/to/llvm/compiler-rt/lib/sanitizer_common/sanitizer_mutex.cpp:218
#3 0x00007ffff6a448b2 in __sanitizer::CheckedMutex::Lock (this=0x7ffff6b27be8, this@entry=0x730000000580) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_mutex.h:129
#4 __sanitizer::Mutex::Lock (this=0x7ffff6b27be8, this@entry=0x730000000580) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_mutex.h:167
#5 0x00007ffff6abdbb2 in __sanitizer::GenericScopedLock<__sanitizer::Mutex>::GenericScopedLock (mu=0x730000000580, this=<optimized out>) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_mutex.h:383
#6 __sanitizer::SizeClassAllocator64<__tsan::AP64>::GetFromAllocator (this=0x7ffff7487dc0 <__tsan::allocator_placeholder>, stat=stat@entry=0x7ffff570db68, class_id=11, chunks=chunks@entry=0x7ffff5702cc8, n_chunks=n_chunks@entry=128) at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_primary64.h:207
#7 0x00007ffff6abdaa0 in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__tsan::AP64> >::Refill (this=<optimized out>, c=c@entry=0x7ffff5702cb8, allocator=<optimized out>, class_id=<optimized out>)
 at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_local_cache.h:103
#8 0x00007ffff6abd731 in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__tsan::AP64> >::Allocate (this=0x7ffff6b27be8, allocator=0x7ffff5702cc8, class_id=140737311157448)
 at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_local_cache.h:39
#9 0x00007ffff6abc397 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__tsan::AP64>, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate (this=0x7ffff5702cc8, cache=0x7ffff6b27be8, size=<optimized out>, size@entry=175, alignment=alignment@entry=16)
 at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_allocator_combined.h:69
#10 0x00007ffff6abaa6a in __tsan::user_alloc_internal (thr=0x7ffff7ebd980, pc=140737331499943, sz=sz@entry=175, align=align@entry=16, signal=true) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_mman.cpp:198
#11 0x00007ffff6abb0d1 in __tsan::user_alloc (thr=0x7ffff6b27be8, pc=140737331846066, sz=11, sz@entry=175) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_mman.cpp:223
#12 0x00007ffff6a693b5 in ___interceptor_malloc (size=175) at /path/to/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:666
#13 0x00007ffff7fce7f2 in malloc (size=175) at ../include/rtld-malloc.h:56
#14 __GI__dl_exception_create_format (exception=exception@entry=0x7fffffffd0d0, objname=0x7ffff7fc3550 "/path/to/llvm/compiler-rt/cmake-build-all-sanitizers/lib/linux/libclang_rt.tsan-x86_64.so",
 fmt=fmt@entry=0x7ffff7ff2db9 "undefined symbol: %s%s%s") at ./elf/dl-exception.c:157
#15 0x00007ffff7fd50e8 in _dl_lookup_symbol_x (undef_name=0x7ffff6af868b "__isoc23_scanf", undef_map=<optimized out>, ref=0x7fffffffd148, symbol_scope=<optimized out>, version=<optimized out>, type_class=0, flags=2, skip_map=0x7ffff7fc35e0) at ./elf/dl-lookup.c:793
--Type <RET> for more, q to quit, c to continue without paging--
#16 0x00007ffff656d6ed in do_sym (handle=<optimized out>, name=0x7ffff6af868b "__isoc23_scanf", who=0x7ffff6a3bb84 <__interception::InterceptFunction(char const*, unsigned long*, unsigned long, unsigned long)+36>, vers=vers@entry=0x0, flags=flags@entry=2) at ./elf/dl-sym.c:146
#17 0x00007ffff656d9dd in _dl_sym (handle=<optimized out>, name=<optimized out>, who=<optimized out>) at ./elf/dl-sym.c:195
#18 0x00007ffff64a2854 in dlsym_doit (a=a@entry=0x7fffffffd3b0) at ./dlfcn/dlsym.c:40
#19 0x00007ffff7fcc489 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd310, operate=0x7ffff64a2840 <dlsym_doit>, args=0x7fffffffd3b0) at ./elf/dl-catch.c:237
#20 0x00007ffff7fcc5af in _dl_catch_error (objname=0x7fffffffd368, errstring=0x7fffffffd370, mallocedp=0x7fffffffd367, operate=<optimized out>, args=<optimized out>) at ./elf/dl-catch.c:256
#21 0x00007ffff64a2257 in _dlerror_run (operate=operate@entry=0x7ffff64a2840 <dlsym_doit>, args=args@entry=0x7fffffffd3b0) at ./dlfcn/dlerror.c:138
#22 0x00007ffff64a28e5 in dlsym_implementation (dl_caller=<optimized out>, name=<optimized out>, handle=<optimized out>) at ./dlfcn/dlsym.c:54
#23 ___dlsym (handle=<optimized out>, name=<optimized out>) at ./dlfcn/dlsym.c:68
#24 0x00007ffff6a3bb84 in __interception::GetFuncAddr (name=0x7ffff6af868b "__isoc23_scanf", trampoline=140737311157448) at /path/to/llvm/compiler-rt/lib/interception/interception_linux.cpp:42
#25 __interception::InterceptFunction (name=0x7ffff6af868b "__isoc23_scanf", ptr_to_real=0x7ffff74850e8 <__interception::real___isoc23_scanf>, func=11, trampoline=140737311157448)
 at /path/to/llvm/compiler-rt/lib/interception/interception_linux.cpp:61
#26 0x00007ffff6a9f2d9 in InitializeCommonInterceptors () at /path/to/llvm/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:10315
```

Reviewed By: vitalybuka, MaskRay

Pull Request: llvm/llvm-project#83890
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants