Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerfile for GPU container. Fix for installing GPU version of MXNet #403

Merged
merged 8 commits into from
Oct 21, 2019

Conversation

strawberrypie
Copy link
Contributor

Description of changes:

  • fixes a bug in setup.py that didn't work with the current version of requirements
  • Dockerfile.gpu for building GPU-enabled Docker image

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@jaheba jaheba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

Do you have any experience running gluon-ts on GPU instances?

Dockerfile.gpu Outdated Show resolved Hide resolved
setup.py Outdated
Comment on lines 52 to 57
re.subn(
pattern=mxnet_old,
repl=mxnet_new,
string=line.rstrip(),
count=1,
)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does re.subn help over just str.replace here? We should maybe think about making this more robust.

But more importantly, I don't really like what we are doing here. Are there always compatible releases between mxnet and mxnet-cu92mkl? And, should we be more explicit with which version we install? However, we should discuss this probably in another issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I was using the substitution with regex like mxnet[><=]?=, but it looks like simple substitution in enough. Will change it to str.replace.

Regarding the choice of mxnet vs mxnet-cu92mkl — I think that we should do the same thing as with MXNet releases — separate MXNet, MXNet + CUDA, MXNet + MKL versions and Docker images.
Personally, I would prefer to have a GPU version that could seamlessly switch to CPU if there are 0 GPUs found — that's how PyTorch works by default. @jaheba do you know if MXNet can work the same way? Currently, the GPU image fails to work on my device without the Nvidia GPU (Macbook Pro), throws the following error:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/gluonts/shell/__main__.py", line 27, in <module>
    from gluonts.model.estimator import Estimator
  File "/usr/local/lib/python3.7/dist-packages/gluonts/model/estimator.py", line 19, in <module>
    from mxnet.gluon import HybridBlock
  File "/usr/local/lib/python3.7/dist-packages/mxnet/__init__.py", line 24, in <module>
    from .context import Context, current_context, cpu, gpu, cpu_pinned
  File "/usr/local/lib/python3.7/dist-packages/mxnet/context.py", line 24, in <module>
    from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 213, in <module>
    _LIB = _load_lib()
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 204, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
  File "/usr/lib/python3.7/ctypes/__init__.py", line 356, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcuda.so.1: cannot open shared object file: No such file or directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would prefer to have a GPU version that could seamlessly switch to CPU if there are 0 GPUs found — that's how PyTorch works by default. @jaheba do you know if MXNet can work the same way? Currently, the GPU image fails to work on my device without the Nvidia GPU (Macbook Pro), throws the following error:

Yes, I agree. Maybe @szha can help us out here.

@strawberrypie
Copy link
Contributor Author

I've just started to experiment with GPU instances. I'm using DeepAR in my project and using p2.xlarge with this change is just a bit faster than c5.4xlarge. Sagemaker shows that only 20% of GPU is used. @jaheba can you suggest something to speed it up? Or should I create an issue to investigate GPU usage?

Copy link
Contributor

@jaheba jaheba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should have mentioned these before.

But otherwise looks really good to me 👍

Dockerfile.gpu Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
strawberrypie and others added 2 commits October 21, 2019 15:54
Co-Authored-By: Jasper Schulz <jasper.b.schulz@googlemail.com>
Co-Authored-By: Jasper Schulz <jasper.b.schulz@googlemail.com>
@jaheba
Copy link
Contributor

jaheba commented Oct 21, 2019

I've just started to experiment with GPU instances. I'm using DeepAR in my project and using p2.xlarge with this change is just a bit faster than c5.4xlarge. Sagemaker shows that only 20% of GPU is used. @jaheba can you suggest something to speed it up? Or should I create an issue to investigate GPU usage?

We've also seen no real performance benefit using GPUs with SageMaker DeepAR as well. There might be some performance increase when large batch sizes are used.

However, other models (e.g. Wavenet) should benefit much more from using GPUs than DeepAR does.

/cc @vafl

Copy link
Contributor

@jaheba jaheba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, again!

@jaheba jaheba merged commit a894aee into awslabs:master Oct 21, 2019
@jaheba
Copy link
Contributor

jaheba commented Oct 21, 2019

@strawberrypie oh, I think having a dedicated issues regarding GPUs would be great, thanks.

FadhelA pushed a commit to FadhelA/gluon-ts that referenced this pull request Nov 29, 2019
…awslabs#403)

* Dockerfile for GPU container. Fix for installing GPU version of MXNet

* Typo fix. Replacing requirement without regex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants