-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage device choice at a single place with set_device #272
Conversation
- remove ruamel_yaml import from active scripts - fix dontcare2background related to PR NRCan#256
…ts the device to be used - check if model can be pushed to device, else catch exception and try with cuda, not cuda:0 (HPC bug)
utils/logger.py
Outdated
logger = logging.getLogger(name) | ||
logger.setLevel(level) | ||
|
||
# this ensures all logging levels get marked with the rank zero decorator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was @CharlesAuthier addition though. I'm just moving it from utils.py to logger.py :)
utils/utils.py
Outdated
else: | ||
logging.warning(f"\nNo Cuda device available. This process will only run on CPU") | ||
device = torch.device('cpu') | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this try statement necessary
utils/utils.py
Outdated
device = torch.device('cpu') | ||
try: | ||
models.resnet18().to(device) # test with a small model | ||
except (RuntimeError, AssertionError): # HPC: when device 0 not available. Error: Cuda invalid device ordinal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the comment there is no longer valid there, over the summer we solved the problem of Cuda invalid device ordinal. We can just say # Error !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! If you say the problem is solved, I can simply remove the try/except all together. Thanks!
# list of GPU devices that are available and unused. If no GPUs, returns empty dict | ||
gpu_devices_dict = get_device_ids(num_devices, max_used_ram_perc=max_used_ram, max_used_perc=max_used_perc) | ||
chunk_size = get_key_def('chunk_size', params['inference'], default=512, expected_type=int) | ||
chunk_size = calc_inference_chunk_size(gpu_devices_dict=gpu_devices_dict, max_pix_per_mb_gpu=50, default=chunk_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note: calculating chunk size dynamically here specifically at inference may cause problems (memory) with test time augmentation and smoothing. A default size of 512 is advised. 512 is padded to 512 * 2 during test time routines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, do you agree to keep the "automatic chunk_size" option? it's meant to be tunable, i.e. max_pix_per_mb_gpu can be lowered if the automatic way busts memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed we can lower the max_pix_per_mb_gpu
for now, keeping in mind that we could later optimize its definition during inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I'll address this in PR #276
# Conflicts: # utils/data_analysis.py # utils/utils.py
* - remove unused functions - remove ruamel_yaml import from active scripts - fix dontcare2background related to PR NRCan#256 * - create set_device function: rom dictionary of available devices, sets the device to be used - check if model can be pushed to device, else catch exception and try with cuda, not cuda:0 (HPC bug) * remove try/except statement for old HPC bug (device ordinal error)
closes #269