Install Failure on GCP Deep Learning VM #259

glenn-jocher · 2019-04-17T10:35:25Z

I created a simple GCP Deep Learning VM:
https://cloud.google.com/deep-learning-vm/

I followed the install directions, and the install failed with errors:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

...
Command "/opt/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-j0qgf5ds/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, 
__file__, 'exec'))" --cpp_ext --cuda_ext install --record /tmp/pip-record-1yr2fag5/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-j0qgf5ds/
Exception information:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 143, in main
    status = self.run(options, args)
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 366, in run
    use_user_site=options.use_user_site,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/__init__.py", line 49, in install_given_reqs
    **kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 791, in install
    spinner=spinner,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 705, in call_subprocess
    % (command_desc, proc.returncode, cwd))

The Python-only option also failed:

pip install -v --no-cache-dir .

...
Command "/opt/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-eedemek6/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, 
__file__, 'exec'))" install --record /tmp/pip-record-ehl5a4y7/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-req-build-eedemek6/
Exception information:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 143, in main
    status = self.run(options, args)
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 366, in run
    use_user_site=options.use_user_site,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/__init__.py", line 49, in install_given_reqs
    **kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/req/req_install.py", line 791, in install
    spinner=spinner,
  File "/opt/anaconda3/lib/python3.7/site-packages/pip/_internal/utils/misc.py", line 705, in call_subprocess
    % (command_desc, proc.returncode, cwd))

It would seem like installation on a GCP Deep Learning VM would be one of the tested use cases here no?? If it doesn't work there of all places, where is it intended to work?

mcarilli · 2019-04-17T21:26:03Z

I'm not sure if this issue is specific to apex. I think you need to make sure your instance has python-dev:
google/python-subprocess32#38
See also
https://medium.com/giscle/setting-up-a-google-cloud-instance-for-deep-learning-d182256cb894
(scroll down to "Installing Tensorflow," which is not directly relevant, but does also say to sudo apt-get install python3-pip python3-dev).

Also, I don't think this issue is related to cpp extension building in particular. I think if the suggested fix resolves your issue for the Python-only build, the cpp and cuda extension build is definitely worth another try.

glenn-jocher · 2019-04-17T22:50:58Z

@mcarilli ah, thanks for the reply! I tried what you said, but they seem to be already installed. For completeness I included all the header information from the VM when it starts up below. These VMs come with PyTorch (and almost everything else) preinstalled. We use them in our GCP Quickstart Guide on our YOLOv3 repo:
https://github.com/ultralytics/yolov3/wiki/GCP-Quickstart

Version: m23
Based on: Debian GNU/Linux 9.8 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
Resources:
 * Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questi
ons/tagged/google-dl-platform
 * Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
 * Google Group: https://groups.google.com/forum/#!forum/google-dl-platform

To reinstall Nvidia driver (if needed) run:
sudo /opt/deeplearning/install-driver.sh

This image uses python 3.7 from the Anaconda. Anaconda is installed to:
/opt/anaconda3/

Linux instance-2 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

ultralytics@instance-2:~$ sudo apt-get install python3-pip python3-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-pip is already the newest version (9.0.1-2).
python3-dev is already the newest version (3.5.3-1).
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.

mcarilli · 2019-04-17T23:23:16Z

Following https://medium.com/giscle/setting-up-a-google-cloud-instance-for-deep-learning-d182256cb894, maybe the solution is as simple as using pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . instead of pip install ... Before February, Apex served primarily as a research/internal toolkit. The growth in popularity was been a (pleasant) surprise. I only recently made it my fulltime project and I haven't actually tested the install on google cloud before so this is valuable information.

glenn-jocher · 2019-04-18T14:58:11Z

@mcarilli thanks, the change worked. The line I used to successfully install is:

pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

Unfortunately after install the apex module can be found, but not amp:

...
  running install_egg_info
    running egg_info
    creating apex.egg-info
    writing apex.egg-info/PKG-INFO
    writing top-level names to apex.egg-info/top_level.txt
    writing dependency_links to apex.egg-info/dependency_links.txt
    writing manifest file 'apex.egg-info/SOURCES.txt'
    reading manifest file 'apex.egg-info/SOURCES.txt'
    writing manifest file 'apex.egg-info/SOURCES.txt'
    Copying apex.egg-info to /home/ultralytics/.local/lib/python3.5/site-packages/apex-0.1-py3.5.egg-info
    running install_scripts
    writing list of installed files to '/tmp/pip-ln69wwvt-record/install-record.txt'
done
  Removing source in /tmp/pip-5vfngf45-build
Successfully installed apex-0.1
Cleaning up...

ultralytics@instance-2:~/apex$ cd ..
ultralytics@instance-2:~$ python3 -c "import apex"
ultralytics@instance-2:~$ python3 -c "from apex import amp"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'amp' from 'apex' (unknown location)
ultralytics@instance-2:~$ python3 -c "import apex; a=apex.amp"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'apex' has no attribute 'amp'

mcarilli · 2019-04-18T15:11:52Z

This may be an artifact of where you tried to run import apex (from one level above the apex repo directory). I think when you say import apex from ~, Python is attempting to import the cloned repo directory called apex which is obviously not the right thing.

Try this, starting in the apex repo directory:

ultralytics@instance-2:~/apex$ cd ..
ultralytics@instance-2:~$ python
...
>>> import apex
>>> import sys
>>> sys.modules['apex']

should show where the files are being imported from, which should be some system install path, e.g. on my system

>>> sys.modules['apex']
<module 'apex' from '/home/mcarilli/anaconda3/lib/python3.6/site-packages/apex/__init__.py'>
>>>

After installing, you can also try running the L0 tests:

cd tests/L0
python run_test.py

They should all pass if you installed with cpp/cuda extensions.

glenn-jocher · 2019-04-18T15:41:00Z

@mcarilli ah yes you are right! It was importing from the cloned repo. After I removed the /apex repo it would not longer import apex.

I'm starting to think this is a conda install issue (the GCP Deep Learning VMs use Anaconda 3.7). From these directions on installing non-conda packages I activated the conda environment first before trying the install. Install was successful but then the package is missing from conda list, and import fails. I think somehow I need to direct it to install to opt/anaconda3, because I see in the install output instead a mention of a seperate python 3.5: Copying apex.egg-info to /home/ultralytics/.local/lib/python3.5/site-packages/apex-0.1-py3.5.egg-info

ultralytics@instance-2:~$ conda info --envs
WARNING: The conda.compat module is deprecated and will be removed in a future release.
# conda environments:
#
base                  *  /opt/anaconda3
ultralytics@instance-2:~$ source activate base
(base) ultralytics@instance-2:~$ git clone https://github.com/NVIDIA/apex
(base) ultralytics@instance-2:~$ cd apex
(base) ultralytics@instance-2:~/apex$ pip3 install -v --no-cache-dir .
...
Successfully installed apex-0.1
Cleaning up...
(base) ultralytics@instance-2:~/apex$ cd .. && rm -rf apex
(base) ultralytics@instance-2:~$ python3
Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from apex import amp
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'apex'
>>>

mcarilli · 2019-04-18T22:16:03Z

Hmm, if I try this on my local machine, it appears to install to the correct location. I'm not sure what's different/lacking about the conda environment on the GCP instance...

apex_fresh$ source activate base
(base) apex_fresh$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
...
Copying apex.egg-info to /home/mcarilli/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6.egg-info
...
(base) apex_fresh$ cd ../..
(base) Desktop$ python
Python 3.6.7 |Anaconda custom (64-bit)| (default, Oct 23 2018, 19:16:44) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import apex
>>> from apex import amp
>>>

mcarilli · 2019-04-24T02:55:50Z

Did you ever figure out why conda on GCP was installing to the wrong directory? I'm not a conda expert so if you managed to resolve this issue it will be helpful for future users.

glenn-jocher · 2019-04-24T13:27:24Z

No, no luck. I created a blank PyTorch deep learning VM and tried again from scratch, but it's installing to a different python 3.5 rather than anaconda. It seems to be an anaconda issue, and unfortunately I'm not the best conda expert either. I think pip installs to conda are generally not always problem free, I've seen other repos with conda-specific install instructions.
creating /home/ultralytics/.local/lib/python3.5/site-packages/apex/amp

In your above example, you see apex in your conda list right?

mcarilli · 2019-04-29T16:59:02Z

Yes:

(base) apex_fresh$ conda list | grep apex
apex                      0.1                       <pip>

When I've had issues using pip installs in conda environments in the past, I've sometimes resolved them by explicitly running conda install pip within the conda environment before doing pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . within the same environment.

mcarilli · 2019-05-10T23:24:52Z

Is it possible to use a Docker container on the gcp instance as a potential workaround? There are several options for Docker containers in which we test the Apex install regularly: https://github.com/NVIDIA/apex/tree/master/examples/docker

Even if Docker containers succeed, this does not alleviate the importance of having the bare-metal Apex install also work. I'll consult some people who have more experience with conda.

ngimel · 2019-05-11T00:04:05Z

My guess that it's installing to a python 3.5 because it's using' OS's pip3 version 3.5, rather than conda's python 3.7, you can confirm by running pip3 --version and python --version.

glenn-jocher · 2019-05-11T09:11:12Z

@ngimel yes, you are correct! pip itself directs correctly to anaconda3/lib/python3.7, but pip3 is directing to a local python3.5.

glenn@instance-1:~$ pip3 --version
pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.5)
glenn@instance-1:~$ python --version
Python 3.7.3
glenn@instance-1:~$ pip --version
pip 19.0.3 from /opt/anaconda3/lib/python3.7/site-packages/pip (python 3.7)

@mcarilli so I understand the situation now

pip attempts to install to the correct location (conda's python 3.7), but install fails
pip3 installs to an incorrect location (OS python 3.5), but is inaccessable from the conda env.

Yes, if you could get someone to spin up a PyTorch 1.1 VM in GCP and work through the apex install that would help tremendously. Docker might be a fallback, but I think might also be a bridge too far for many users.

ngimel · 2019-05-14T23:09:47Z

I can't repro on the latest pytorch vm (Pytorch 1.1 + fastai 1.0 (CUDA 10.0))

(base) root@tensorflow-1-vm:~/apex# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
...
...
...
    writing list of installed files to '/tmp/pip-record-b_l2sreu/install-record.txt'
done
  Removing source in /tmp/pip-req-build-koc3g9j3
Successfully installed apex-0.1
Cleaning up...

glenn-jocher · 2019-05-18T10:06:45Z

@ngimel I just checked on a new PyTorch 1.1 vm. This time I got a permission denied error:
error: could not create '/opt/anaconda3/lib/python3.7/site-packages/apex': Permission denied

so I tried to use sudo pip install -v --no-cache-dir . which installs without error, but to the incorrect python 2.7. So I still can not install apex to Anaconda 3.7.
Copying apex.egg-info to /usr/local/lib/python2.7/dist-packages/apex-0.1-py2.7.egg-info

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
sudo pip install -v --no-cache-dir .

see-- · 2019-06-07T14:49:19Z

Using --user worked for me with python3. I guess that is what you get from using 3 different python versions.

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

glenn-jocher · 2019-06-07T15:13:21Z

@see-- this works! I was able to successfully install on a GCP VM with the following commands:

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir . --user

UPDATE 1: On running a mixed precision model with the above install I get the following warning: Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.

Installing instead with the following line removed the warning:

source activate base
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

mcarilli · 2019-06-14T22:24:28Z

Excellent, thanks guys. Sorry I haven't had time to do a deep dive myself, but i'm pinning this issue for others.

sleepinyourhat · 2019-07-26T19:10:03Z

For posterity, I was only able to get this to work (after trying many other things) with:

sudo pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

(note the sudo.)

morganmcg1 · 2019-10-31T08:50:12Z

I had to use Conda forge to get this working within my conda environment

conda install -c conda-forge nvidia-apex

MuhammadAsadJaved · 2020-01-08T08:51:12Z

Using --user worked for me with python3. I guess that is what you get from using 3 different python versions.
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user

@see-- @glenn-jocher @sleepinyourhat
if we install with above mentioned command can we import it in both python2.xx and python3.xx? stuck here. My project used python3.xx and i am unable to install with pip3 before i tried with pip but without --user and i was able to import in python2.xx but I need it with python3.xx .
So if i install with pip and --user it will install for all python versions?

My environment
ubuntu 16.04
CUDA Version 10.0.130
CuDNN 7.4.1
torch.version '1.3.1'
Python 3.5.2

glenn-jocher mentioned this issue Apr 17, 2019

Adding mixed precision training for RTX graphic cards ultralytics/yolov3#210

Closed

mcarilli added gcp and removed gcp labels Apr 17, 2019

glenn-jocher closed this as completed Jun 14, 2019

mcarilli pinned this issue Jun 14, 2019

valeriobasile mentioned this issue Aug 5, 2019

Unable to use learner.fit() because of Apex dependencies utterworks/fast-bert#2

Closed

mcarilli mentioned this issue Nov 4, 2019

apex ImportError on Google Colab #585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Install Failure on GCP Deep Learning VM #259

Install Failure on GCP Deep Learning VM #259

glenn-jocher commented Apr 17, 2019 •

edited

Loading

mcarilli commented Apr 17, 2019 •

edited

Loading

glenn-jocher commented Apr 17, 2019

mcarilli commented Apr 17, 2019

glenn-jocher commented Apr 18, 2019

mcarilli commented Apr 18, 2019 •

edited

Loading

glenn-jocher commented Apr 18, 2019 •

edited

Loading

mcarilli commented Apr 18, 2019 •

edited

Loading

mcarilli commented Apr 24, 2019

glenn-jocher commented Apr 24, 2019

mcarilli commented Apr 29, 2019

mcarilli commented May 10, 2019

ngimel commented May 11, 2019

glenn-jocher commented May 11, 2019

ngimel commented May 14, 2019

glenn-jocher commented May 18, 2019

see-- commented Jun 7, 2019

glenn-jocher commented Jun 7, 2019 •

edited

Loading

mcarilli commented Jun 14, 2019 •

edited

Loading

sleepinyourhat commented Jul 26, 2019 •

edited

Loading

morganmcg1 commented Oct 31, 2019

MuhammadAsadJaved commented Jan 8, 2020 •

edited

Loading

Install Failure on GCP Deep Learning VM #259

Install Failure on GCP Deep Learning VM #259

Comments

glenn-jocher commented Apr 17, 2019 • edited Loading

mcarilli commented Apr 17, 2019 • edited Loading

glenn-jocher commented Apr 17, 2019

mcarilli commented Apr 17, 2019

glenn-jocher commented Apr 18, 2019

mcarilli commented Apr 18, 2019 • edited Loading

glenn-jocher commented Apr 18, 2019 • edited Loading

mcarilli commented Apr 18, 2019 • edited Loading

mcarilli commented Apr 24, 2019

glenn-jocher commented Apr 24, 2019

mcarilli commented Apr 29, 2019

mcarilli commented May 10, 2019

ngimel commented May 11, 2019

glenn-jocher commented May 11, 2019

ngimel commented May 14, 2019

glenn-jocher commented May 18, 2019

see-- commented Jun 7, 2019

glenn-jocher commented Jun 7, 2019 • edited Loading

mcarilli commented Jun 14, 2019 • edited Loading

sleepinyourhat commented Jul 26, 2019 • edited Loading

morganmcg1 commented Oct 31, 2019

MuhammadAsadJaved commented Jan 8, 2020 • edited Loading

glenn-jocher commented Apr 17, 2019 •

edited

Loading

mcarilli commented Apr 17, 2019 •

edited

Loading

mcarilli commented Apr 18, 2019 •

edited

Loading

glenn-jocher commented Apr 18, 2019 •

edited

Loading

mcarilli commented Apr 18, 2019 •

edited

Loading

glenn-jocher commented Jun 7, 2019 •

edited

Loading

mcarilli commented Jun 14, 2019 •

edited

Loading

sleepinyourhat commented Jul 26, 2019 •

edited

Loading

MuhammadAsadJaved commented Jan 8, 2020 •

edited

Loading