Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ansor][FLAKY] Bug fix for compute at mutation error #6557

Merged
merged 1 commit into from
Sep 25, 2020

Conversation

jcf94
Copy link
Contributor

@jcf94 jcf94 commented Sep 25, 2020

Bug fix for #6548.

From the error log:

E     tvm._ffi.base.TVMError: Traceback (most recent call last):
E     [bt] (7) /workspace/build/libtvm.so(TVMFuncCall+0x65) [0x7f2cdb26bfb5]
E     [bt] (6) /workspace/build/libtvm.so(+0x4e4dcf) [0x7f2cda602dcf]
E     [bt] (5) /workspace/build/libtvm.so(tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)+0x116) [0x7f2cda6021a6]
E     [bt] (4) /workspace/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, tvm::auto_scheduler::ProgramMeasurer)+0x214) [0x7f2cda698f64]
E     [bt] (3) /workspace/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::SearchOneRound(int, tvm::runtime::Array<tvm::auto_scheduler::State, void>*)+0x19f) [0x7f2cda6987ff]
E     [bt] (2) /workspace/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::SampleInitPopulation(tvm::runtime::Array<tvm::auto_scheduler::State, void> const&, int)+0x1fb) [0x7f2cda69395b]
E     [bt] (1) /workspace/build/libtvm.so(tvm::support::parallel_for(int, int, std::function<void (int)> const&, int, std::function<std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > (int, int, int, int)>)+0x11e8) [0x7f2cdac2b9f8]
E     [bt] (0) /workspace/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x82) [0x7f2cda606ac2]
E     [bt] (8) /workspace/build/libtvm.so(+0x5756da) [0x7f2cda6936da]
E     [bt] (7) /workspace/build/libtvm.so(tvm::auto_scheduler::InitChangeComputeLocation::Apply(tvm::auto_scheduler::SketchPolicyNode*, tvm::auto_scheduler::State*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>*) const+0x22d) [0x7f2cda6a1ccd]
E     [bt] (6) /workspace/build/libtvm.so(tvm::auto_scheduler::ComputeDAG::InferBound(tvm::auto_scheduler::State const&) const+0x253) [0x7f2cda61c783]
E     [bt] (5) /workspace/build/libtvm.so(tvm::auto_scheduler::ComputeDAG::ApplySteps(tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&, tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, bool) const+0x5e5) [0x7f2cda61c265]
E     [bt] (4) /workspace/build/libtvm.so(tvm::auto_scheduler::StepApplyToSchedule(tvm::auto_scheduler::Step const&, tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, tvm::te::Schedule*, tvm::runtime::Array<tvm::auto_scheduler::Step, void> const&)+0x220) [0x7f2cda6d7170]
E     [bt] (3) /workspace/build/libtvm.so(tvm::auto_scheduler::SplitStepNode::ApplyToSchedule(tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*) const+0x39) [0x7f2cda6d0d19]
E     [bt] (2) /workspace/build/libtvm.so(tvm::auto_scheduler::ApplySplitToSchedule(tvm::runtime::Array<tvm::te::Stage, void>*, tvm::Map<tvm::te::Stage, tvm::runtime::Array<tvm::tir::IterVar, void>, tvm::runtime::ObjectHash, tvm::runtime::ObjectEqual>*, int, int, tvm::runtime::Array<tvm::runtime::Optional<tvm::Integer>, void> const&, bool)+0xa6) [0x7f2cda6d0576]
E     [bt] (1) /workspace/build/libtvm.so(tvm::runtime::Array<tvm::tir::IterVar, void>::operator[](long) const+0xb6) [0x7f2cda626616]
E     [bt] (0) /workspace/build/libtvm.so(+0x4ef2c2) [0x7f2cda60d2c2]
E     File "/workspace/src/support/parallel_for.cc", line 92
E   TVMError: Parallel_for error with [09:24:10] /workspace/include/tvm/runtime/container.h:683: Check failed: 0 <= i && i < p->size_: IndexError: indexing 4 on an array of size 4

we can find that the error of the test was caused by the inferbound error. @merrymercy

Seems this bug was further intruded by #6512, I'm not sure which part of this mutation rule gets a wrong result now.
But it's strange that this bug is not always reproduceable, it occurs with a very small possibility(Which may caused by the multithread?). There may still be some leak conditions in our random generator design.

cc @tqchen @comaniac @FrozenGene

@FrozenGene
Copy link
Member

Do you try to build debug version of TVM and use gdb --args python ... to see the callstack and which code produce this error? Just try...catch seems a little brute force for me.

@jcf94
Copy link
Contributor Author

jcf94 commented Sep 25, 2020

Do you try to build debug version of TVM and use gdb --args python ... to see the callstack and which code produce this error? Just try...catch seems a little brute force for me.

The problem is this is not always reproduceable. The only sure thing is that the bug is caused by InitChangeComputeLocation() rule.

@FrozenGene
Copy link
Member

Do you try to build debug version of TVM and use gdb --args python ... to see the callstack and which code produce this error? Just try...catch seems a little brute force for me.

The problem is this is not always reproduceable. The only sure thing is that the bug is caused by InitChangeComputeLocation() rule.

One way you could do is remove CHECK inside tvm and just use script run many times and then let the program crash, then we will produce one core file, you could use gdb debug with core file now.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this PR first to remove the CI flaky. We should definitely continue diving into InitChangeComputeLocation() to find the root cause.

@comaniac comaniac merged commit 8889c7a into apache:master Sep 25, 2020
@comaniac
Copy link
Contributor

Thanks @jcf94 @FrozenGene

@jcf94
Copy link
Contributor Author

jcf94 commented Sep 26, 2020

Thanks @jcf94 @FrozenGene

Thanks.

@merrymercy
Copy link
Member

merrymercy commented Sep 27, 2020

This kind of general exception catch is not good for future maintenance. We should dig deeper to find out the underlying cause.
The mutation rules, LoopState, and InferBound all work well in the single thread case. I think some of these components are not thread-safe.

#6512 does not change any logic, it just moves the location of some functions. Can you confirm this is caused by #6512 or #6529?

TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 13, 2020
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 14, 2020
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 15, 2020
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 15, 2020
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 16, 2020
TusharKanekiDey pushed a commit to TusharKanekiDey/tvm that referenced this pull request Oct 16, 2020
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Oct 19, 2020
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants