Error during stage3: Unimplemented MHLO #127

cyz2727327 · 2022-09-24T17:23:27Z

Hi everyone,

I am using Window10 with a single GPU (Quatro RTX8000), I run the code by commenting out the 3 lines python code requiring 8 PGU.
I was able to complete the first 2 stages of training on the Chair dataset, though in my case it took significantly longer than expected (50+ hours). However, I was not able to successfully train stage3, and this error keeps coming out

" Attempting to fetch value instead of handling error UNIMPLEMENTED: Unimplemented MHLO -> HloOpcode: %113 = mhlo.round_nearest_even %112 : tensor<1048576x8xf32>"

Can anyone give me some hint to fix this? Thank you

junhua-l · 2022-10-01T00:04:01Z

I have a similar situation to you. The first two stages are normal. While I have tried several times to run stage 3, the error always keeps coming out and seems like endless:
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: stream did not block host until done; was already in an error state
2022-10-01 05:44:29.136328: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***

Strangely, I do not take so much time running stages 1 and 2, but It takes so long to run stage 3.

Wushuangpin · 2022-10-01T06:38:50Z

I also have a similar situation.
root@container-c42911ad3c-daf18020:~/mobilenerf# python stage3.py && shutdown train images: (100, 800, 800, 3) c2w: (100, 4, 4) hwf: (3,) test images: (200, 800, 800, 3) c2w: (200, 4, 4) hwf: (3,) Number of quad faces: 137472 Removing invisible triangles 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [53:41<00:00, 32.22s/it] Removing invisible triangles 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [2:40:52<00:00, 32.17s/it] Number of quad faces: 89443 Testing 46%|███████████████████████████████████████████████████████████████████▍ | 93/200 [50:11<57:10, 32.06s/it]2022-09-24 19:30:13.127196: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1163] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x559810d57c40; GPU src: 0x7f9e77657600; size: 32768=0x8000 2022-09-24 19:30:13.127242: E external/org_tensorflow/tensorflow/stream_executor/stream.cc:344] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2022-09-24 19:30:13.127256: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:618] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 46%|███████████████████████████████████████████████████████████████████▍ | 93/200 [50:44<58:22, 32.73s/it] Traceback (most recent call last): File "stage3.py", line 2222, in <module> out = render_loop(camera_ray_batch(p, hwf), vars, point_UV_grid, texture_alpha, texture_features, test_batch_size) File "stage3.py", line 2071, in render_loop outs = [render_test([x[i:i+chunk] for x in rays], vars, uv, alp, feat) File "stage3.py", line 2071, in <listcomp> outs = [render_test([x[i:i+chunk] for x in rays], vars, uv, alp, feat) File "stage3.py", line 2055, in render_test selected_uv = numpy.array(selected_uv) File "/root/miniconda3/lib/python3.8/site-packages/jax/_src/device_array.py", line 264, in __array__ return np.asarray(self._value, dtype=dtype) File "/root/miniconda3/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 674, in _sda_value npy_value[self.indices[i]] = self.device_buffers[i].to_py() jaxlib.xla_extension.XlaRuntimeError: INTERNAL: stream did not block host until done; was already in an error state 2022-09-24 19:30:13.284806: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace *** _PyGC_CollectNoFail PyImport_Cleanup Py_FinalizeEx Py_RunMain Py_BytesMain __libc_start_main *** End stack trace *** 2022-09-24 19:30:13.285110: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:284] Check failed: pair.first->SynchronizeAllActivity() Aborted (core dumped)

cyz2727327 · 2022-10-01T23:07:24Z

I fixed this issue by using Linux Ubuntu system, now I have a new problem when trying to run real360 data: not enough memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during stage3: Unimplemented MHLO #127

Error during stage3: Unimplemented MHLO #127

cyz2727327 commented Sep 24, 2022

junhua-l commented Oct 1, 2022

Wushuangpin commented Oct 1, 2022

cyz2727327 commented Oct 1, 2022

Error during stage3: Unimplemented MHLO #127

Error during stage3: Unimplemented MHLO #127

Comments

cyz2727327 commented Sep 24, 2022

junhua-l commented Oct 1, 2022

Wushuangpin commented Oct 1, 2022

cyz2727327 commented Oct 1, 2022