Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 #721

Closed
rlratzel opened this issue Mar 9, 2021 · 20 comments
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@rlratzel
Copy link

rlratzel commented Mar 9, 2021

When rmm.enable_logging() is passed a log_file_name, a dev0 extension is always used even when the device in use is not device 0.

This reproducer demonstrates the issue:

import cudf
import rmm
from rmm._cuda.gpu import getDevice

import glob
import os
import time

rmm.enable_logging(log_file_name="/tmp/rmmlog.csv")
s = cudf.Series([1])

print(f'CUDA_VISIBLE_DEVICES={os.environ.get("CUDA_VISIBLE_DEVICES")}')
print(f"getDevice() returned: {getDevice()}")
print(f'RMM logs present on disk: {glob.glob("/tmp/rmmlog.*")}')
print("sleeping 10 seconds to check nvidia-smi on the host...")
time.sleep(10)

rmm.mr._flush_logs()
rmm.disable_logging()

output:

$> python /tmp/repro.py
CUDA_VISIBLE_DEVICES=1
getDevice() returned: 0
RMM logs present on disk: ['/tmp/rmmlog.dev0.csv']
sleeping 10 seconds to check nvidia-smi on the host...

nvidia-smi output:

...
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1594      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1594      G   /usr/lib/xorg/Xorg                 64MiB |
|    1   N/A  N/A      1673      G   /usr/bin/gnome-shell               65MiB |
|    1   N/A  N/A     17629      C   python                            609MiB |
+-----------------------------------------------------------------------------+

The process in question here is 17629 using GPU 1.

The expected behavior is to create a logfile with an extension matching the GPU in use, which in the example above would be: /tmp/rmmlog.dev1.csv

NOTE: Just in case this is related, this demo was run in a container (hence the need to check nvidia-smi on the host) with multiple GPUs exposed from the host machine. Setting CUDA_VISIBLE_DEVICES=0 in the container shows the process running on GPU 0 on the host as expected.

@rlratzel rlratzel added ? - Needs Triage Need team to review and classify bug Something isn't working labels Mar 9, 2021
@rlratzel rlratzel changed the title [BUG] RMM log file name sets .dev0 extension when GPU device used is not 0 [BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 Mar 9, 2021
@jrhemstad
Copy link
Contributor

CUDA_VISIBLE_DEVICES=1

This makes what otherwise would be device 1 to appear as device 0, so what you've described is expected behavior.

@rlratzel
Copy link
Author

rlratzel commented Mar 9, 2021

CUDA_VISIBLE_DEVICES=1

This makes what otherwise would be device 1 to appear as device 0, so what you've described is expected behavior.

Ah, okay, thanks @jrhemstad . That explains what I was about to add to the issue - that the contents of the log file are correct even though the log file extension seemed wrong.

Using my example above, is it possible to have RMM create the extension as .dev1 instead, or is there a reason the log file is better off named .dev0 in this case? (cc @shwina ). This isn't necessarily a showstopper, since we can form a unique log name ourselves in the call to enable_logging() and just ignore the .devn extension when looking for it afterwards.

@jrhemstad
Copy link
Contributor

I honestly don't know where the .dev0 extension is coming from. That must be getting set at the Python level.

Normally you could override the log file by setting RMM_LOG_FILE, but since it is getting passed explicitly, the envvar won't override it.

@jakirkham
Copy link
Member

This line seems relevant

return f"{name}.dev{id}{ext}"

@shwina
Copy link
Contributor

shwina commented Mar 9, 2021

@jrhemstad one question re: your comment, so if I set CUDA_VISIBLE_DEVICES=1,0] and then did cudaSetDevice(0), is it actually device 1 beings set as the current device?

@jrhemstad
Copy link
Contributor

@jrhemstad one question re: your comment, so if I set CUDA_VISIBLE_DEVICES=1,0] and then did cudaSetDevice(0), is it actually device 1 beings set as the current device?

Yes.

@jakirkham
Copy link
Member

Yeah Dask-CUDA uses this same trick actually

@shwina
Copy link
Contributor

shwina commented Mar 9, 2021

To leave no room for ambiguity, @kkraus14 suggested we don't use device IDs 0...n in the log file names, but rather the device UUIDs -- would that work for you @rlratzel?

@jakirkham
Copy link
Member

jakirkham commented Mar 9, 2021

I wonder if users could specify where this info is injected. IOW if a user provides the filename "rmm_log_{id}.txt", could we have {id} replaced with the UUID?

@shwina
Copy link
Contributor

shwina commented Mar 9, 2021

@jakirkham - I like that suggestion; what should we do if they don't specify that though? e.g., if they specify filename="rmm_log.txt"?

@jakirkham
Copy link
Member

jakirkham commented Mar 9, 2021

Raise a ValueError and complain it is not present. Alternatively we could fallback to the current behavior

@rlratzel
Copy link
Author

rlratzel commented Mar 9, 2021

To leave no room for ambiguity, @kkraus14 suggested we don't use device IDs 0...n in the log file names, but rather the device UUIDs -- would that work for you @rlratzel?

I ultimately need to know the final filename to look for. I know GPU ID because the user told my application about it, so is there a way to map the GPU ID the user is aware of and using (ie. the one they may have set in CUDA_VISIBLE_DEVICES) to the UUID you would use to generate the log file? If so, then that would work fine.

...user provides the filename "rmm_log_{id}.txt"

FYI I personally don't need that level of customization, and I might just prefer a documented naming behavior such as what's in place now. If this is better for others though then I'm in favor.

@rlratzel
Copy link
Author

rlratzel commented Mar 9, 2021

Another option that would work for our application would be to just have rmm.enable_logging() return the final log file name you generate, freeing you up to generate it any way you like.

@shwina
Copy link
Contributor

shwina commented Mar 9, 2021

On second thoughts, maybe using a UUID just makes things less user-friendly?

From an end-user standpoint, device ID 0 should refer unambiguously to the 0th physical GPU, i.e., the one on top in the output of nvidia-smi), even though internally 0 can actually mean something else depending on CUDA_VISIBLE_DEVICES.

@harrism
Copy link
Member

harrism commented Mar 9, 2021

I think that depends. Some users who set CUDA_VISIBLE_DEVICES will know and expect the first GPU they set in that list to come out as device 0. Others might expect 0 to be the first physical GPU.

Perhaps we need to log information about the GPU into the header of the log file so that the reader can unambiguously interpret it.

@shwina
Copy link
Contributor

shwina commented Mar 9, 2021

I think that depends. Some users who set CUDA_VISIBLE_DEVICES will know and expect the first GPU they set in that list to come out as device 0. Others might expect 0 to be the first physical GPU.

I see your point. But environment variables like CUDA_VISIBLE_DEVICES are ephemeral while file names are "forever". Looking at a file rmm_log.dev0.txt 1 year from now, it would be hard to say which physical GPU it corresponds to.

Perhaps we need to log information about the GPU into the header of the log file so that the reader can unambiguously interpret it.

+1

@jakirkham
Copy link
Member

cc @charlesbluca (in case you have thoughts here after having implemented RMM logging support in Dask-CUDA 🙂)

@harrism
Copy link
Member

harrism commented Mar 9, 2021

Looking at a file rmm_log.dev0.txt 1 year from now, it would be hard to say which physical GPU it corresponds to.

Looking at that file a year later and you aren't likely to have access to the same machine or configuration. I think you really have to log the configuration with the log files if you want to be able to reconstruct.

@shwina
Copy link
Contributor

shwina commented Mar 10, 2021

OK maybe 1 year was an exaggeration - apologies :) But even a week from now, it can be difficult -- especially if you're in some sort of shared computing environment where CUDA_VISIBLE_DEVICES is used/changed frequently.

Anyway - after discussing more with @rlratzel, we decided to:

  1. Keep the current behaviour w.r.t suffixes and CUDA_VISIBLE_DEVICES but document it carefully

  2. Provide a get_log_filenames API that returns a device-id-to-filename mapping. For now, the mapping would just be something like {1: "rmmlog.dev1.txt", 0: "rmmlog.dev0.txt", 2: "rmmlog.dev2.txt"}, where the keys are the "internal" device IDs. Users using logging in conjunction with CUDA_VISIBLE_DEVICES will need to do extra bookkeeping to map the internal device IDs back to physical IDs.

A mapping is returned instead of a list as (1) it gives us more flexibility as to how to name the output files, (2) when initializing RMM, the user can specify devices in any arbitrary order (e.g., devices=[2, 0, 1] and returning a list doesn't clarify which log filename corresponds to which device.

@kkraus14
Copy link
Contributor

Not a bug and documentation was updated in #722 to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants