Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage device choice at a single place with set_device #272

Merged
merged 4 commits into from
Feb 15, 2022

Conversation

remtav
Copy link
Collaborator

@remtav remtav commented Feb 9, 2022

closes #269

- remove ruamel_yaml import from active scripts
- fix dontcare2background related to PR NRCan#256
…ts the device to be used

- check if model can be pushed to device, else catch exception and try with cuda, not cuda:0 (HPC bug)
utils/logger.py Outdated
logger = logging.getLogger(name)
logger.setLevel(level)

# this ensures all logging levels get marked with the rank zero decorator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

Copy link
Collaborator Author

@remtav remtav Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was @CharlesAuthier addition though. I'm just moving it from utils.py to logger.py :)

utils/utils.py Outdated
else:
logging.warning(f"\nNo Cuda device available. This process will only run on CPU")
device = torch.device('cpu')
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this try statement necessary

utils/utils.py Outdated
device = torch.device('cpu')
try:
models.resnet18().to(device) # test with a small model
except (RuntimeError, AssertionError): # HPC: when device 0 not available. Error: Cuda invalid device ordinal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the comment there is no longer valid there, over the summer we solved the problem of Cuda invalid device ordinal. We can just say # Error !

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! If you say the problem is solved, I can simply remove the try/except all together. Thanks!

# list of GPU devices that are available and unused. If no GPUs, returns empty dict
gpu_devices_dict = get_device_ids(num_devices, max_used_ram_perc=max_used_ram, max_used_perc=max_used_perc)
chunk_size = get_key_def('chunk_size', params['inference'], default=512, expected_type=int)
chunk_size = calc_inference_chunk_size(gpu_devices_dict=gpu_devices_dict, max_pix_per_mb_gpu=50, default=chunk_size)
Copy link
Collaborator

@valhassan valhassan Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note: calculating chunk size dynamically here specifically at inference may cause problems (memory) with test time augmentation and smoothing. A default size of 512 is advised. 512 is padded to 512 * 2 during test time routines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, do you agree to keep the "automatic chunk_size" option? it's meant to be tunable, i.e. max_pix_per_mb_gpu can be lowered if the automatic way busts memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we can lower the max_pix_per_mb_gpu for now, keeping in mind that we could later optimize its definition during inference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I'll address this in PR #276

@remtav remtav merged commit 685f960 into NRCan:develop Feb 15, 2022
@remtav remtav deleted the 269-set_device branch February 16, 2022 17:44
remtav added a commit to remtav/geo-deep-learning that referenced this pull request Jul 5, 2022
* - remove unused functions
- remove ruamel_yaml import from active scripts
- fix dontcare2background related to PR NRCan#256

* - create set_device function: rom dictionary of available devices, sets the device to be used
- check if model can be pushed to device, else catch exception and try with cuda, not cuda:0 (HPC bug)

* remove try/except statement for old HPC bug (device ordinal error)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor: manage device selection and error-handling (HPC) with set_device function
4 participants