Don't output nvidia-smi failure in automated platform search #693

IAlibay · 2023-04-27T15:07:23Z

We've had folks a bit confused about this message showing up when they don't have a GPU device:

/bin/sh: nvidia-smi: command not found

Is there a way to avoid outputting this to users?

The text was updated successfully, but these errors were encountered:

ijpulidos · 2023-04-27T16:15:28Z

Ok, I think I understand the confusion even though this lives at the DEBUG level of the logger, so it shouldn't be hard to filter that out or avoid printing it, unless we have deeper issues with the logging.

At the time this was implemented we wanted to make sure we have some place in the log that reports the GPUs found in the system, for debugging purposes. We were having cases where the simulation was falling back to run on CPU because there was some problem accessing the GPU/CUDA devices, so we just wanted to make sure the devices were available/found. On the other hand, I think we could try getting the exit code of that call to nvidia-smi and if the command isn't found for it to not output anything. Would that be better? (We would still output other errors/messages using it)

mikemhenry · 2023-04-27T16:17:02Z

Yes, I think checking the error code

openmmtools/openmmtools/multistate/multistatesampler.py

Lines 1774 to 1780 in abd4011

    
           def _display_cuda_devices(): 
        
               """Query system nvidia-smi to get available GPUs indices and names in debug log.""" 
        
               # Read nvidia-smi query, should return empty strip if no GPU is found. 
        
               cuda_query_output = os.popen("nvidia-smi --query-gpu=index,gpu_name --format=csv,noheader").read().strip() 
        
               # Split by line jump and comma 
        
               cuda_devices_list = [entry.split(',') for entry in cuda_query_output.split('\n')] 
        
               logger.debug(f"CUDA devices available: {*cuda_devices_list,}")

and if it fails, we can output something like "No GPU detected" since that is more clear with what is going on.

mikemhenry · 2023-04-27T16:17:48Z

Actually all we need to do is capture the output of the os.popen call and not have it dump to std err

mikemhenry · 2023-04-27T16:18:08Z

I can fix this

mikemhenry · 2023-05-25T18:58:13Z

@ijpulidos Why do we split by comma? # Split by line jump and comma I am using a slightly different method to invoke the subprocess call to nvidia-smi and the captured output looks like '0, NVIDIA GeForce RTX 3060 Laptop GPU\n' So I was thinking we would want to save it like 0, NVIDIA GeForce RTX 3060 Laptop GPU (in a list) so that users know the GPU index, thoughts?

mikemhenry self-assigned this Apr 27, 2023

ijpulidos mentioned this issue May 25, 2023

[ENH] Utility for catching GPUs in exclusive mode #697

Closed

mikemhenry mentioned this issue May 25, 2023

report mode GPU is in and write out gpu UUID #699

Merged

5 tasks

mikemhenry closed this as completed in #699 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't output nvidia-smi failure in automated platform search #693

Don't output nvidia-smi failure in automated platform search #693

IAlibay commented Apr 27, 2023

ijpulidos commented Apr 27, 2023

mikemhenry commented Apr 27, 2023

mikemhenry commented Apr 27, 2023

mikemhenry commented Apr 27, 2023

mikemhenry commented May 25, 2023

Don't output nvidia-smi failure in automated platform search #693

Don't output nvidia-smi failure in automated platform search #693

Comments

IAlibay commented Apr 27, 2023

ijpulidos commented Apr 27, 2023

mikemhenry commented Apr 27, 2023

mikemhenry commented Apr 27, 2023

mikemhenry commented Apr 27, 2023

mikemhenry commented May 25, 2023