Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'libtpu.so already in use' but actually not used #10192

Open
ayaka14732 opened this issue Apr 8, 2022 · 12 comments
Open

'libtpu.so already in use' but actually not used #10192

ayaka14732 opened this issue Apr 8, 2022 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@ayaka14732
Copy link
Member

ayaka14732 commented Apr 8, 2022

Sep 2022 Update

Solution: Run

rm -rf /tmp/libtpu_lockfile /tmp/tpu_logs

before running Python.


Original Post

We can test if TPU is being used by this command:

python -c 'import jax; print(jax.devices())'

In theory, if the TPU is not in use, it will print TpuDevice; otherwise, it will print CpuDevice, and a warning will be shown:

I0000 00:00:1649423660.053391 1924758 f236.cc:165] libtpu.so already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. Not attempting to load libtpu.so in this process.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

However, sometimes the command shows that TPU is being used, but I can be sure that the TPU is not being used. Besides, sudo lsof -w /dev/accel0 shows no process is using TPU.

In order to rule out the possibility that another process that was using TPU just exited, I reran the command for several times and the results are the same.

This bug even happens when I created multiple users on the TPU VM. I login in as one user and it shows that the TPU is in used, but then I immediately log in as another user and it works fine.

I want to help to debug this issue but I don't know where to start.

@ayaka14732 ayaka14732 added the bug Something isn't working label Apr 8, 2022
@ayaka14732
Copy link
Member Author

@skye I found at least one of the cause of the problem:

$ export TF_CPP_MIN_LOG_LEVEL=0
$ python3 -c 'import jax; print(jax.devices())'
2022-04-09 23:15:46.156187: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.156229: I external/org_tensorflow/tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
WARNING: Logging before InitGoogle() is written to STDERR
I0409 23:15:46.181567 1339488 tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.195009: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.198059: I external/org_tensorflow/tensorflow/core/tpu/tpu_executor_dlsym_initializer.cc:68] Libtpu path is: libtpu.so
2022-04-09 23:15:46.198096: I external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:68] libtpu.so already in used by another process. Not attempting to load libtpu.so in this process.
2022-04-09 23:15:46.833145: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x31c7860 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-04-09 23:15:46.833194: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Interpreter, <undefined>
2022-04-09 23:15:46.850903: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:163] TfrtCpuClient created.
2022-04-09 23:15:46.851741: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

So I opened tensorflow/core/tpu/tpu_initializer_helper.cc, it shows that the program checks a lock file /tmp/libtpu_lockfile.

So I:

$ sudo rm -f /tmp/libtpu_lockfile
$ python3 -c 'import jax; print(jax.devices())'

Then it complains about /tmp/tpu_logs, so I removed the directory.

After that the command works:

$ python3 -c 'import jax; print(jax.devices())'
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]

@ayaka14732
Copy link
Member Author

@skye This is indeed the cause of the problem! And this also explains why TPU is available for one user but not for others:

(user1) $ ls -l /tmp/libtpu_lockfile
-rw-r--r-- 1 user1 user1 0 Mar 30 10:32 /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())'  # works
(user1) $ sudo -H -- su -
(root) $ python -c 'import jax; print(jax.devices())'  # not working
(root) $ chown root:root /tmp/libtpu_lockfile
(root) $ python -c 'import jax; print(jax.devices())'  # works
(root) $ exit
(user1) $ python -c 'import jax; print(jax.devices())'  # not working
(user1) $ sudo chmod go+rw /tmp/libtpu_lockfile
(user1) $ python -c 'import jax; print(jax.devices())'  # works

@skye
Copy link
Member

skye commented May 3, 2022

Hey @ayaka14732, sorry for the delay! I was hiking in Nepal 🏔️

Just to make sure I understand, is the issue that /tmp/libtpu_lockfile sometimes exists even when no process is using the TPU? I'm not sure what "works" and "not works" means in your comment above.

@ayaka14732
Copy link
Member Author

@skye Haven't seen you on GitHub for a long time. I miss you so much!

Just to make sure I understand, is the issue that /tmp/libtpu_lockfile sometimes exists even when no process is using the TPU?

Yes!

I'm not sure what "works" and "not works" means in your comment above.

"works" means can access TPU (prints TpuDevice), otherwise it prints CpuDevice.

@skye
Copy link
Member

skye commented May 3, 2022

This is strange. The lockfile should be automatically deleted when the Python process exits. Can you rerun the same set of commands again, but make sure the lockfile doesn't exist at the beginning? (i.e. ls -l /tmp/libtpu_lockfile should say the file doesn't exist)

@ayaka14732
Copy link
Member Author

Sometimes the Python process was aborted, and therefore the file is not removed. For example,

import jax
devices = jax.devices()
import sys
sys.exit(-1)

The set of commands above was an explanation that /tmp/libtpu_lockfile is indeed the cause of the problem.

@shashank2000
Copy link

shashank2000 commented May 18, 2022

I'm having the exact same issue - but deleting the lockfile doesn't seem to fix it. Any pointers would be super appreciated! @skye @ayaka14732

@skye
Copy link
Member

skye commented May 23, 2022

@shashank2000 that error message is only triggered by the presence of the lockfile. Please doublecheck that the lockfile isn't there and that you're getting the exact error message?

@shashank2000
Copy link

shashank2000 commented May 31, 2022

@skye I can verify I still see the message even after I delete both tpu_logs and the lockfile. I've been running processes as Python subprocesses through a script - is there a reason to believe this could be an issue?

EDIT: can confirm that things work 'fine' when I run my commands standalone as opposed to a subprocess or os.system call which is what used to be the case; is there a reason my parent script (which doesn't even have a jax dependency, all I'm doing is parsing args and running the appropriate training job) is the process holding the lock?

@janimo
Copy link

janimo commented Apr 15, 2023

The underlying lock file (/tmp/libtpu_lockfile) is not removed on program exit but reused across runs and users.

So the open() permissions for it should be 0666 in order to work with multiple users (different logins on the same TPU VMs for example). The lockfile will only be openable in write mode by the first user that creates it.

Setting permissions to 0666 instead of 0644 will allow others to open it and thus have a valid fd for the lockf() call below.

https://github.com/tensorflow/tensorflow/blob/68c69e710f95f9a5a8fa91c1fbcb642166176944/tensorflow/core/tpu/tpu_initializer_helper.cc#L159

@ayaka14732
Copy link
Member Author

@shashank2000

I'm having the exact same issue - but deleting the lockfile doesn't seem to fix it. Any pointers would be super appreciated!

This would happen if the Python program is still running.

I use ps aux | grep python to view all existing Python programs. If the program is still running, I use kill -9 <PID> to kill it.

@JadM133
Copy link

JadM133 commented Nov 7, 2024

It's been a while since someone commented on this.

But if anyone is having the same issue, in my case ps aux | grep python didn't show my processes when connected by ssh to the TPU node, but top did.

It also shows the PID, which can then be used to kill the process as mentioned above by @ayaka14732 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants