-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'libtpu.so already in use' but actually not used #10192
Comments
@skye I found at least one of the cause of the problem: $ export TF_CPP_MIN_LOG_LEVEL=0
$ python3 -c 'import jax; print(jax.devices())'
2022-04-09 23:15:46.156187: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.156229: I external/org_tensorflow/tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
WARNING: Logging before InitGoogle() is written to STDERR
I0409 23:15:46.181567 1339488 tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.195009: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.198059: I external/org_tensorflow/tensorflow/core/tpu/tpu_executor_dlsym_initializer.cc:68] Libtpu path is: libtpu.so
2022-04-09 23:15:46.198096: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.833145: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x31c7860 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-04-09 23:15:46.833194: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179] StreamExecutor device (0): Interpreter, <undefined>
2022-04-09 23:15:46.850903: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:163] TfrtCpuClient created.
2022-04-09 23:15:46.851741: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) So I opened tensorflow/core/tpu/tpu_initializer_helper.cc, it shows that the program checks a lock file So I: $ sudo rm -f /tmp/libtpu_lockfile
$ python3 -c 'import jax; print(jax.devices())' Then it complains about After that the command works: $ python3 -c 'import jax; print(jax.devices())'
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)] |
@skye This is indeed the cause of the problem! And this also explains why TPU is available for one user but not for others: (user1) $ ls -l /tmp/libtpu_lockfile
-rw-r--r-- 1 user1 user1 0 Mar 30 10:32 /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())' # works
(user1) $ sudo -H -- su -
(root) $ python -c 'import jax; print(jax.devices())' # not working
(root) $ chown root:root /tmp/libtpu_lockfile
(root) $ python -c 'import jax; print(jax.devices())' # works
(root) $ exit
(user1) $ python -c 'import jax; print(jax.devices())' # not working
(user1) $ sudo chmod go+rw /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())' # works |
Hey @ayaka14732, sorry for the delay! I was hiking in Nepal 🏔️ Just to make sure I understand, is the issue that |
@skye Haven't seen you on GitHub for a long time. I miss you so much!
Yes!
"works" means can access TPU (prints |
This is strange. The lockfile should be automatically deleted when the Python process exits. Can you rerun the same set of commands again, but make sure the lockfile doesn't exist at the beginning? (i.e. |
Sometimes the Python process was aborted, and therefore the file is not removed. For example, import jax
devices = jax.devices()
import sys
sys.exit(-1) The set of commands above was an explanation that |
I'm having the exact same issue - but deleting the lockfile doesn't seem to fix it. Any pointers would be super appreciated! @skye @ayaka14732 |
@shashank2000 that error message is only triggered by the presence of the lockfile. Please doublecheck that the lockfile isn't there and that you're getting the exact error message? |
@skye I can verify I still see the message even after I delete both EDIT: can confirm that things work 'fine' when I run my commands standalone as opposed to a |
The underlying lock file (/tmp/libtpu_lockfile) is not removed on program exit but reused across runs and users. So the open() permissions for it should be 0666 in order to work with multiple users (different logins on the same TPU VMs for example). The lockfile will only be openable in write mode by the first user that creates it. Setting permissions to 0666 instead of 0644 will allow others to open it and thus have a valid fd for the lockf() call below. |
This would happen if the Python program is still running. I use |
It's been a while since someone commented on this. But if anyone is having the same issue, in my case It also shows the PID, which can then be used to kill the process as mentioned above by @ayaka14732 . |
Sep 2022 Update
Solution: Run
before running Python.
Original Post
We can test if TPU is being used by this command:
python -c 'import jax; print(jax.devices())'
In theory, if the TPU is not in use, it will print
TpuDevice
; otherwise, it will printCpuDevice
, and a warning will be shown:However, sometimes the command shows that TPU is being used, but I can be sure that the TPU is not being used. Besides,
sudo lsof -w /dev/accel0
shows no process is using TPU.In order to rule out the possibility that another process that was using TPU just exited, I reran the command for several times and the results are the same.
This bug even happens when I created multiple users on the TPU VM. I login in as one user and it shows that the TPU is in used, but then I immediately log in as another user and it works fine.
I want to help to debug this issue but I don't know where to start.
The text was updated successfully, but these errors were encountered: