Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Add a non-recursive LetNode VisitExpr_ for LabelOps Pass to avoid stack overflow #8917

Merged
merged 5 commits into from
Sep 7, 2021

Conversation

FrankYzy
Copy link
Contributor

@FrankYzy FrankYzy commented Sep 3, 2021

When I use the relay VM to compile a graph that contains a huge for loop, the LabelOps pass encountered stack overflow. This problem is caused by the LabelOps applies the default ExprMutator::VisitExpr_(const LetNode* op) function to visit the LetNode, which is a recursive function.

Here is a small testing case:

data = relay.var("data", tvm.ir.TensorType(shape = (relay.Any(),5,2), dtype = "float32"))
shape = relay.op.shape_of(data)

for i in range(1000):
    tmp = relay.op.ones(shape, "float32")
    shape = relay.op.shape_of(tmp)
res = shape

mod = tvm.IRModule()
mod["main"] = relay.Function([data], res)

In this case, I set the for loop with 1000 times, note that the specific number of times that will cause the LabelOps stack overflow may vary in different machine environment.

To fix this bug, I added a non-recursive LetNode VisitExpr_ in the Class LabelOpsMutator.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Also cc @mbrookhart

@Mousius
Copy link
Member

Mousius commented Sep 4, 2021

hi @yanggg1997, thanks for fixing this, is it possible to take the test case you have and create an automated python test which should fail in CI?

@comaniac
Copy link
Contributor

comaniac commented Sep 4, 2021

I'm not sure if it's a good idea to have a test case that attempts to cause stack overflow without this PR. That test case may be flaky due to different stack sizes on different CI machines.

@Mousius
Copy link
Member

Mousius commented Sep 5, 2021

I'm not sure if it's a good idea to have a test case that attempts to cause stack overflow without this PR. That test case may be flaky due to different stack sizes on different CI machines.

Sorry, I meant for it to be included in this PR so we have test coverage and some way of tracking if this regresses. I understand it might not always pick up the issue on all machines but it shouldn't fail if this change is in place?

@FrankYzy
Copy link
Contributor Author

FrankYzy commented Sep 6, 2021

I'm not sure if it's a good idea to have a test case that attempts to cause stack overflow without this PR. That test case may be flaky due to different stack sizes on different CI machines.

Sorry, I meant for it to be included in this PR so we have test coverage and some way of tracking if this regresses. I understand it might not always pick up the issue on all machines but it shouldn't fail if this change is in place?

Hi, thanks for your reply. I'm not sure whether my test case will run successfully on the CI machines with unknown stack size although after adding this PR, because recently I also found that with the loop times increases to a much bigger number, some other Passes will also encounter segmentation faults. For example, when the loop times is set to 1000, the LabelOps encounters stack overflow due to the recursive LetNode VisitExpr_ function. After adding this PR, the fault disappears, but when the loop times is set to 3000 or bigger, other passes will encounter unknown segmentation fault. Therefore, It may be difficult to set the loop times in the test case that only test the LabelOps without triggering other Passes' faults on the CI machine, because the specific loop times is also relative to the stack size. I hope this PR will help to strength the LabelOps Pass, and I will also take time to try to fix segmentation fault caused by other passes in the future.

Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining that to me @yanggg1997, that makes total sense!

@masahi masahi merged commit 125a74e into apache:main Sep 7, 2021
ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
…void stack overflow (apache#8917)

* Add a non-recursive Let VisitExpr_ for LabelOps

* fake commit to retrigger CI

* fake commit to retrigger the CI

* fix CI issue

* fix CI issue
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…void stack overflow (apache#8917)

* Add a non-recursive Let VisitExpr_ for LabelOps

* fake commit to retrigger CI

* fake commit to retrigger the CI

* fix CI issue

* fix CI issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants