Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during stage3: Unimplemented MHLO #127

Open
cyz2727327 opened this issue Sep 24, 2022 · 3 comments
Open

Error during stage3: Unimplemented MHLO #127

cyz2727327 opened this issue Sep 24, 2022 · 3 comments

Comments

@cyz2727327
Copy link

Hi everyone,

I am using Window10 with a single GPU (Quatro RTX8000), I run the code by commenting out the 3 lines python code requiring 8 PGU.
I was able to complete the first 2 stages of training on the Chair dataset, though in my case it took significantly longer than expected (50+ hours). However, I was not able to successfully train stage3, and this error keeps coming out

" Attempting to fetch value instead of handling error UNIMPLEMENTED: Unimplemented MHLO -> HloOpcode: %113 = mhlo.round_nearest_even %112 : tensor<1048576x8xf32>"

Can anyone give me some hint to fix this? Thank you

@junhua-l
Copy link

junhua-l commented Oct 1, 2022

I have a similar situation to you. The first two stages are normal. While I have tried several times to run stage 3, the error always keeps coming out and seems like endless:
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: stream did not block host until done; was already in an error state
2022-10-01 05:44:29.136328: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***

Strangely, I do not take so much time running stages 1 and 2, but It takes so long to run stage 3.

@Wushuangpin
Copy link

I also have a similar situation.
root@container-c42911ad3c-daf18020:~/mobilenerf# python stage3.py && shutdown train images: (100, 800, 800, 3) c2w: (100, 4, 4) hwf: (3,) test images: (200, 800, 800, 3) c2w: (200, 4, 4) hwf: (3,) Number of quad faces: 137472 Removing invisible triangles 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [53:41<00:00, 32.22s/it] Removing invisible triangles 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [2:40:52<00:00, 32.17s/it] Number of quad faces: 89443 Testing 46%|███████████████████████████████████████████████████████████████████▍ | 93/200 [50:11<57:10, 32.06s/it]2022-09-24 19:30:13.127196: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1163] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x559810d57c40; GPU src: 0x7f9e77657600; size: 32768=0x8000 2022-09-24 19:30:13.127242: E external/org_tensorflow/tensorflow/stream_executor/stream.cc:344] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2022-09-24 19:30:13.127256: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:618] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 46%|███████████████████████████████████████████████████████████████████▍ | 93/200 [50:44<58:22, 32.73s/it] Traceback (most recent call last): File "stage3.py", line 2222, in <module> out = render_loop(camera_ray_batch(p, hwf), vars, point_UV_grid, texture_alpha, texture_features, test_batch_size) File "stage3.py", line 2071, in render_loop outs = [render_test([x[i:i+chunk] for x in rays], vars, uv, alp, feat) File "stage3.py", line 2071, in <listcomp> outs = [render_test([x[i:i+chunk] for x in rays], vars, uv, alp, feat) File "stage3.py", line 2055, in render_test selected_uv = numpy.array(selected_uv) File "/root/miniconda3/lib/python3.8/site-packages/jax/_src/device_array.py", line 264, in __array__ return np.asarray(self._value, dtype=dtype) File "/root/miniconda3/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 674, in _sda_value npy_value[self.indices[i]] = self.device_buffers[i].to_py() jaxlib.xla_extension.XlaRuntimeError: INTERNAL: stream did not block host until done; was already in an error state 2022-09-24 19:30:13.284806: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace *** _PyGC_CollectNoFail PyImport_Cleanup Py_FinalizeEx Py_RunMain Py_BytesMain __libc_start_main *** End stack trace *** 2022-09-24 19:30:13.285110: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:284] Check failed: pair.first->SynchronizeAllActivity() Aborted (core dumped)

@cyz2727327
Copy link
Author

I fixed this issue by using Linux Ubuntu system, now I have a new problem when trying to run real360 data: not enough memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants