-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] tutorials do not build from a clean source tree #9013
Comments
@electriclilies Lily, in your spelunking through build did you see any obvious global compile engine caching? |
Failed again in mainline CI: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-9050/1/pipeline |
No luck on a local repro (using nvidia docker and the ci_gpu image) as nvcc.py is failing for some reason. |
@mbs-octoml i think you can repro like this:
it definitely will not repro if you've already built the docs once and don't run |
I can repro with my local config just with make clean & make html, no need for docker etc. Good, that's easier. |
Same here in my PR's CI. #9053 |
I had the issue also in the stagging Jenkins. |
I'm looking again now. |
On the BAD runs:
I can't find any matching rewrite on the GOOD runs -- they're all of the form:
Eh? |
So the only difference is |
As expected all is well if disable AlterOpLayout. I need to log whatever hidden state is driving that rewrite. |
Ok after getting lost in AlterOpLayout I see dev/use_pass_infra.py has @relay.op.register_alter_op_layout("nn.conv2d") which is obviously sticky and still visibible to the later micro_autotune.py. Almost certainly that defn is ill-formed in some way. |
So the root problem is our tutorials need to be hermetic but there's no 'unregister' mechanism or ability to register under some 'with TvmRegistrationScope()' statement. At least making that layout xform valid will let us hobble along a bit longer tho. |
Even better- stop using sphinx_gallery. |
thanks for the detailed investigation @mbs-octoml ! I do think we should make the compiler work multiple times in a row. certainly our unit tests require this and we will expose a bunch of problems with xdist after it starts reordering them. :) |
This is fixed -- don't have edit rights on issues. |
Thanks for fixing this one @mbs-octoml ! |
I think this might be a compiler caching bug cc @tqchen @jroesch @mbs-octoml . Not caught in the regression because the regression rebuilds only changed tutorials.
Steps to reproduce:
git checkout dc2f70e3c8a9b14b9e414ecf768ad32e6c3c3960
rm -rf build
docker/bash.sh ci_gpu tests/scripts/task_config_build_gpu.sh
docker/bash.sh ci_gpu tests/scripts/task_build.sh build -j16
docker/bash.sh ci_gpu bash -c 'cd docs && make clean'
docker/bash.sh ci_gpu tests/scripts/task_ci_setup.sh
docker/bash.sh ci_gpu tests/scripts/task_python_docs.sh
Will show this traceback somewhere along the way. micro_autotune was just trying to build a relay model. I think the shapes look correct to me. Rerunning task_python_docs.sh should cause them to build.
The text was updated successfully, but these errors were encountered: