-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] Add a non-recursive LetNode VisitExpr_ for LabelOps Pass to avoid stack overflow #8917
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Also cc @mbrookhart
hi @yanggg1997, thanks for fixing this, is it possible to take the test case you have and create an automated python test which should fail in CI? |
I'm not sure if it's a good idea to have a test case that attempts to cause stack overflow without this PR. That test case may be flaky due to different stack sizes on different CI machines. |
Sorry, I meant for it to be included in this PR so we have test coverage and some way of tracking if this regresses. I understand it might not always pick up the issue on all machines but it shouldn't fail if this change is in place? |
Hi, thanks for your reply. I'm not sure whether my test case will run successfully on the CI machines with unknown stack size although after adding this PR, because recently I also found that with the loop times increases to a much bigger number, some other Passes will also encounter segmentation faults. For example, when the loop times is set to 1000, the LabelOps encounters stack overflow due to the recursive LetNode VisitExpr_ function. After adding this PR, the fault disappears, but when the loop times is set to 3000 or bigger, other passes will encounter unknown segmentation fault. Therefore, It may be difficult to set the loop times in the test case that only test the LabelOps without triggering other Passes' faults on the CI machine, because the specific loop times is also relative to the stack size. I hope this PR will help to strength the LabelOps Pass, and I will also take time to try to fix segmentation fault caused by other passes in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining that to me @yanggg1997, that makes total sense!
…void stack overflow (apache#8917) * Add a non-recursive Let VisitExpr_ for LabelOps * fake commit to retrigger CI * fake commit to retrigger the CI * fix CI issue * fix CI issue
…void stack overflow (apache#8917) * Add a non-recursive Let VisitExpr_ for LabelOps * fake commit to retrigger CI * fake commit to retrigger the CI * fix CI issue * fix CI issue
When I use the relay VM to compile a graph that contains a huge for loop, the LabelOps pass encountered stack overflow. This problem is caused by the LabelOps applies the default
ExprMutator::VisitExpr_(const LetNode* op)
function to visit the LetNode, which is a recursive function.Here is a small testing case:
In this case, I set the for loop with 1000 times, note that the specific number of times that will cause the LabelOps stack overflow may vary in different machine environment.
To fix this bug, I added a non-recursive LetNode VisitExpr_ in the Class LabelOpsMutator.