[BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 #721

rlratzel · 2021-03-09T01:05:15Z

When rmm.enable_logging() is passed a log_file_name, a dev0 extension is always used even when the device in use is not device 0.

This reproducer demonstrates the issue:

import cudf
import rmm
from rmm._cuda.gpu import getDevice

import glob
import os
import time

rmm.enable_logging(log_file_name="/tmp/rmmlog.csv")
s = cudf.Series([1])

print(f'CUDA_VISIBLE_DEVICES={os.environ.get("CUDA_VISIBLE_DEVICES")}')
print(f"getDevice() returned: {getDevice()}")
print(f'RMM logs present on disk: {glob.glob("/tmp/rmmlog.*")}')
print("sleeping 10 seconds to check nvidia-smi on the host...")
time.sleep(10)

rmm.mr._flush_logs()
rmm.disable_logging()

output:

$> python /tmp/repro.py
CUDA_VISIBLE_DEVICES=1
getDevice() returned: 0
RMM logs present on disk: ['/tmp/rmmlog.dev0.csv']
sleeping 10 seconds to check nvidia-smi on the host...

nvidia-smi output:

...
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1594      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1594      G   /usr/lib/xorg/Xorg                 64MiB |
|    1   N/A  N/A      1673      G   /usr/bin/gnome-shell               65MiB |
|    1   N/A  N/A     17629      C   python                            609MiB |
+-----------------------------------------------------------------------------+

The process in question here is 17629 using GPU 1.

The expected behavior is to create a logfile with an extension matching the GPU in use, which in the example above would be: /tmp/rmmlog.dev1.csv

NOTE: Just in case this is related, this demo was run in a container (hence the need to check nvidia-smi on the host) with multiple GPUs exposed from the host machine. Setting CUDA_VISIBLE_DEVICES=0 in the container shows the process running on GPU 0 on the host as expected.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-03-09T01:09:06Z

CUDA_VISIBLE_DEVICES=1

This makes what otherwise would be device 1 to appear as device 0, so what you've described is expected behavior.

rlratzel · 2021-03-09T01:21:01Z

CUDA_VISIBLE_DEVICES=1

This makes what otherwise would be device 1 to appear as device 0, so what you've described is expected behavior.

Ah, okay, thanks @jrhemstad . That explains what I was about to add to the issue - that the contents of the log file are correct even though the log file extension seemed wrong.

Using my example above, is it possible to have RMM create the extension as .dev1 instead, or is there a reason the log file is better off named .dev0 in this case? (cc @shwina ). This isn't necessarily a showstopper, since we can form a unique log name ourselves in the call to enable_logging() and just ignore the .devn extension when looking for it afterwards.

jrhemstad · 2021-03-09T02:20:11Z

I honestly don't know where the .dev0 extension is coming from. That must be getting set at the Python level.

Normally you could override the log file by setting RMM_LOG_FILE, but since it is getting passed explicitly, the envvar won't override it.

jakirkham · 2021-03-09T06:14:04Z

This line seems relevant

rmm/python/rmm/_lib/memory_resource.pyx

Line 373 in 3b4a555

return f"{name}.dev{id}{ext}"

shwina · 2021-03-09T16:01:06Z

@jrhemstad one question re: your comment, so if I set CUDA_VISIBLE_DEVICES=1,0] and then did cudaSetDevice(0), is it actually device 1 beings set as the current device?

jrhemstad · 2021-03-09T16:07:06Z

@jrhemstad one question re: your comment, so if I set CUDA_VISIBLE_DEVICES=1,0] and then did cudaSetDevice(0), is it actually device 1 beings set as the current device?

Yes.

jakirkham · 2021-03-09T17:53:57Z

Yeah Dask-CUDA uses this same trick actually

shwina · 2021-03-09T18:36:00Z

To leave no room for ambiguity, @kkraus14 suggested we don't use device IDs 0...n in the log file names, but rather the device UUIDs -- would that work for you @rlratzel?

jakirkham · 2021-03-09T18:44:15Z

I wonder if users could specify where this info is injected. IOW if a user provides the filename "rmm_log_{id}.txt", could we have {id} replaced with the UUID?

shwina · 2021-03-09T19:14:59Z

@jakirkham - I like that suggestion; what should we do if they don't specify that though? e.g., if they specify filename="rmm_log.txt"?

jakirkham · 2021-03-09T19:55:30Z

Raise a ValueError and complain it is not present. Alternatively we could fallback to the current behavior

rlratzel · 2021-03-09T20:21:19Z

To leave no room for ambiguity, @kkraus14 suggested we don't use device IDs 0...n in the log file names, but rather the device UUIDs -- would that work for you @rlratzel?

I ultimately need to know the final filename to look for. I know GPU ID because the user told my application about it, so is there a way to map the GPU ID the user is aware of and using (ie. the one they may have set in CUDA_VISIBLE_DEVICES) to the UUID you would use to generate the log file? If so, then that would work fine.

...user provides the filename "rmm_log_{id}.txt"

FYI I personally don't need that level of customization, and I might just prefer a documented naming behavior such as what's in place now. If this is better for others though then I'm in favor.

rlratzel · 2021-03-09T20:25:37Z

Another option that would work for our application would be to just have rmm.enable_logging() return the final log file name you generate, freeing you up to generate it any way you like.

shwina · 2021-03-09T20:47:09Z

On second thoughts, maybe using a UUID just makes things less user-friendly?

From an end-user standpoint, device ID 0 should refer unambiguously to the 0th physical GPU, i.e., the one on top in the output of nvidia-smi), even though internally 0 can actually mean something else depending on CUDA_VISIBLE_DEVICES.

harrism · 2021-03-09T20:58:30Z

I think that depends. Some users who set CUDA_VISIBLE_DEVICES will know and expect the first GPU they set in that list to come out as device 0. Others might expect 0 to be the first physical GPU.

Perhaps we need to log information about the GPU into the header of the log file so that the reader can unambiguously interpret it.

shwina · 2021-03-09T21:06:38Z

I think that depends. Some users who set CUDA_VISIBLE_DEVICES will know and expect the first GPU they set in that list to come out as device 0. Others might expect 0 to be the first physical GPU.

I see your point. But environment variables like CUDA_VISIBLE_DEVICES are ephemeral while file names are "forever". Looking at a file rmm_log.dev0.txt 1 year from now, it would be hard to say which physical GPU it corresponds to.

Perhaps we need to log information about the GPU into the header of the log file so that the reader can unambiguously interpret it.

+1

jakirkham · 2021-03-09T22:48:54Z

cc @charlesbluca (in case you have thoughts here after having implemented RMM logging support in Dask-CUDA 🙂)

harrism · 2021-03-09T22:56:09Z

Looking at a file rmm_log.dev0.txt 1 year from now, it would be hard to say which physical GPU it corresponds to.

Looking at that file a year later and you aren't likely to have access to the same machine or configuration. I think you really have to log the configuration with the log files if you want to be able to reconstruct.

shwina · 2021-03-10T23:21:33Z

OK maybe 1 year was an exaggeration - apologies :) But even a week from now, it can be difficult -- especially if you're in some sort of shared computing environment where CUDA_VISIBLE_DEVICES is used/changed frequently.

Anyway - after discussing more with @rlratzel, we decided to:

Keep the current behaviour w.r.t suffixes and CUDA_VISIBLE_DEVICES but document it carefully
Provide a get_log_filenames API that returns a device-id-to-filename mapping. For now, the mapping would just be something like {1: "rmmlog.dev1.txt", 0: "rmmlog.dev0.txt", 2: "rmmlog.dev2.txt"}, where the keys are the "internal" device IDs. Users using logging in conjunction with CUDA_VISIBLE_DEVICES will need to do extra bookkeeping to map the internal device IDs back to physical IDs.

A mapping is returned instead of a list as (1) it gives us more flexibility as to how to name the output files, (2) when initializing RMM, the user can specify devices in any arbitrary order (e.g., devices=[2, 0, 1] and returning a list doesn't clarify which log filename corresponds to which device.

kkraus14 · 2021-03-18T15:51:30Z

Not a bug and documentation was updated in #722 to address this.

rlratzel added ? - Needs Triage Need team to review and classify bug Something isn't working labels Mar 9, 2021

rlratzel changed the title ~~[BUG] RMM log file name sets .dev0 extension when GPU device used is not 0~~ [BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 Mar 9, 2021

shwina self-assigned this Mar 9, 2021

shwina mentioned this issue Mar 9, 2021

Clarify log file name behaviour in docs #722

Merged

ajschmidt8 mentioned this issue Mar 9, 2021

[FEA] Add support for RMM logging to rapids-pytest-benchmark rapidsai/benchmark#27

Closed

kkraus14 closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 #721

[BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 #721

rlratzel commented Mar 9, 2021

jrhemstad commented Mar 9, 2021

rlratzel commented Mar 9, 2021

jrhemstad commented Mar 9, 2021

jakirkham commented Mar 9, 2021

shwina commented Mar 9, 2021

jrhemstad commented Mar 9, 2021

jakirkham commented Mar 9, 2021

shwina commented Mar 9, 2021

jakirkham commented Mar 9, 2021 •

edited

Loading

shwina commented Mar 9, 2021

jakirkham commented Mar 9, 2021 •

edited

Loading

rlratzel commented Mar 9, 2021

rlratzel commented Mar 9, 2021

shwina commented Mar 9, 2021

harrism commented Mar 9, 2021

shwina commented Mar 9, 2021 •

edited

Loading

jakirkham commented Mar 9, 2021

harrism commented Mar 9, 2021

shwina commented Mar 10, 2021 •

edited

Loading

kkraus14 commented Mar 18, 2021

[BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 #721

[BUG] RMM log file name contains .dev0 extension when GPU device used is not 0 #721

Comments

rlratzel commented Mar 9, 2021

jrhemstad commented Mar 9, 2021

rlratzel commented Mar 9, 2021

jrhemstad commented Mar 9, 2021

jakirkham commented Mar 9, 2021

shwina commented Mar 9, 2021

jrhemstad commented Mar 9, 2021

jakirkham commented Mar 9, 2021

shwina commented Mar 9, 2021

jakirkham commented Mar 9, 2021 • edited Loading

shwina commented Mar 9, 2021

jakirkham commented Mar 9, 2021 • edited Loading

rlratzel commented Mar 9, 2021

rlratzel commented Mar 9, 2021

shwina commented Mar 9, 2021

harrism commented Mar 9, 2021

shwina commented Mar 9, 2021 • edited Loading

jakirkham commented Mar 9, 2021

harrism commented Mar 9, 2021

shwina commented Mar 10, 2021 • edited Loading

kkraus14 commented Mar 18, 2021

jakirkham commented Mar 9, 2021 •

edited

Loading

jakirkham commented Mar 9, 2021 •

edited

Loading

shwina commented Mar 9, 2021 •

edited

Loading

shwina commented Mar 10, 2021 •

edited

Loading