Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Build Fails #184

Open
TaekyungHeo opened this issue Jan 4, 2024 · 7 comments
Open

Docker Build Fails #184

TaekyungHeo opened this issue Jan 4, 2024 · 7 comments
Assignees

Comments

@TaekyungHeo
Copy link
Member

TaekyungHeo commented Jan 4, 2024

Issue Description

When attempting to build a Docker image using the latest branch of the NeMo-Megatron-Launcher, the build fails.

Steps to Reproduce

Run the Docker build command:

$ docker build .
...
52.26   WARNING: Missing build requirements in pyproject.toml for megatron-core==0.4.0 from https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from nemo-toolkit==1.21.0rc0).
52.26   WARNING: The project does not specify a build backend, and pip cannot fall back to setuptools without 'wheel'.
52.26   Getting requirements to build wheel: started
52.70   Getting requirements to build wheel: finished with status 'error'
52.70   ERROR: Command errored out with exit status 1:
52.70    command: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmphls_nknx
52.70        cwd: /tmp/pip-install-nn9_ikhg/megatron-core_0c928c7c63d747598ef18d54f6ec6286
52.70   Complete output (18 lines):
52.70   Traceback (most recent call last):
52.70     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 349, in <module>
52.70       main()
52.70     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 331, in main
52.70       json_out['return_val'] = hook(**hook_input['kwargs'])
52.70     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 117, in get_requires_for_build_wheel
52.70       return hook(config_settings)
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
52.70       return self._get_build_requires(config_settings, requirements=['wheel'])
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 320, in _get_build_requires
52.70       self.run_setup()
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 483, in run_setup
52.70       super(_BuildMetaLegacyBackend,
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 335, in run_setup
52.70       exec(code, locals())
52.70     File "<string>", line 52, in <module>
52.70     File "<string>", line 45, in req_file
52.70   FileNotFoundError: [Errno 2] No such file or directory: 'megatron/core/requirements.txt'
52.70   ----------------------------------------
52.70 WARNING: Discarding https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from https://pypi.org/simple/megatron-core/). Command errored out with exit status 1: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmphls_nknx Check the logs for full command output.
52.70 ERROR: Could not find a version that satisfies the requirement megatron-core==0.4.0; extra == "nlp" (from nemo-toolkit[nlp]) (from versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0)
52.70 ERROR: No matching distribution found for megatron-core==0.4.0; extra == "nlp"
52.93 WARNING: You are using pip version 21.2.4; however, version 23.3.2 is available.
52.93 You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.
------
Dockerfile:110
--------------------
 109 |     ARG NEMO_COMMIT
 110 | >>> RUN git clone https://github.com/NVIDIA/NeMo.git && \
 111 | >>>     cd NeMo && \
 112 | >>>     if [ ! -z $NEMO_COMMIT ]; then \
 113 | >>>         git fetch origin $NEMO_COMMIT && \
 114 | >>>         git checkout FETCH_HEAD; \
 115 | >>>     fi && \
 116 | >>>     pip uninstall -y nemo_toolkit sacrebleu && \
 117 | >>>     pip install -e ".[nlp]" && \
 118 | >>>     cd nemo/collections/nlp/data/language_modeling/megatron && \
 119 | >>>     make
 120 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c git clone https://github.com/NVIDIA/NeMo.git &&     cd NeMo &&     if [ ! -z $NEMO_COMMIT ]; then         git fetch origin $NEMO_COMMIT &&         git checkout FETCH_HEAD;     fi &&     pip uninstall -y nemo_toolkit sacrebleu &&     pip install -e \".[nlp]\" &&     cd nemo/collections/nlp/data/language_modeling/megatron &&     make" did not complete successfully: exit code: 1

Additional Context

  • This issue seems to originate from the megatron_core==0.4.0 package, which is installed as part of the Docker build process.
  • I have already reported this bug to the megatron_core team. Peter Dykas replied that we need to use python3.10.
  • Attempting a workaround by cloning a previous commit of NeMo (using --build-arg NEMO_COMMIT=c7948b26a00c91a7332d9eb04f4d66725e9d62e3) installs a previous megatron package (0.3.0) but leads to failure in the data preparation stage, possibly due to other issues resolved in the latest NeMo version.
@TaekyungHeo TaekyungHeo changed the title Docker Build Fails with Latest Branch Due to megatron_core Bug Docker Build Fails Jan 4, 2024
@TaekyungHeo
Copy link
Member Author

Related issue: NVIDIA/Megatron-LM#650

@JanuszL
Copy link

JanuszL commented Jan 4, 2024

If:

# Install Megatron-core
ARG MEGATRONCORE_COMMIT
RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
    cd Megatron-LM && \
    if [ ! -z $MEGATRONCORE_COMMIT ]; then \
        git fetch origin $MEGATRONCORE_COMMIT && \
        git checkout FETCH_HEAD; \
    fi && \
    pip install -e .

goes before:

ARG NEMO_COMMIT
RUN git clone https://github.com/NVIDIA/NeMo.git && \
    cd NeMo && \
    if [ ! -z $NEMO_COMMIT ]; then \
        git fetch origin $NEMO_COMMIT && \
        git checkout FETCH_HEAD; \
    fi && \
    pip uninstall -y nemo_toolkit sacrebleu && \
    pip install -e ".[nlp]" && \
    cd nemo/collections/nlp/data/language_modeling/megatron && \
    make```
Then it should be build from source correctly. Now NeMo is installed first and then Megatron-core which is its dependency (should be the reverse).

@TaekyungHeo
Copy link
Member Author

I appreciate your suggestion, @JanuszL ! However, it seems that it is not working. I will wait for the bugfix from the developers.

My diff for Dockerfile:

diff --git Dockerfile Dockerfile
index b250d0d..f823caf 100644
--- Dockerfile
+++ Dockerfile
@@ -105,6 +105,17 @@ RUN pip uninstall -y apex && \
     fi && \
     pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distribut
 
+# Install Megatron-core
+ARG MEGATRONCORE_COMMIT
+RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
+    cd Megatron-LM && \
+    if [ ! -z $MEGATRONCORE_COMMIT ]; then \
+        git fetch origin $MEGATRONCORE_COMMIT && \
+        git checkout FETCH_HEAD; \
+    fi && \
+    pip install -e .
+
+
 # Install NeMo
 ARG NEMO_COMMIT
 RUN git clone https://github.com/NVIDIA/NeMo.git && \
@@ -131,17 +142,6 @@ RUN git clone https://github.com/NVIDIA/TransformerEngine.git && \
     fi && \
     git submodule init && git submodule update && \
     NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
-
-# Install Megatron-core
-ARG MEGATRONCORE_COMMIT
-RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
-    cd Megatron-LM && \
-    if [ ! -z $MEGATRONCORE_COMMIT ]; then \
-        git fetch origin $MEGATRONCORE_COMMIT && \
-        git checkout FETCH_HEAD; \
-    fi && \
-    pip install -e .
-
 # Install launch scripts
 COPY . NeMo-Megatron-Launcher
 RUN cd NeMo-Megatron-Launcher && \

Docker build result:

$ docker build .
104.0   WARNING: Missing build requirements in pyproject.toml for megatron-core==0.4.0 from https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from nemo-toolkit==1.21.0rc0).
104.0   WARNING: The project does not specify a build backend, and pip cannot fall back to setuptools without 'wheel'.
104.0   Getting requirements to build wheel: started
104.4   Getting requirements to build wheel: finished with status 'error'
104.4   ERROR: Command errored out with exit status 1:
104.4    command: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpo5jo_szg
104.4        cwd: /tmp/pip-install-t3ld3nxc/megatron-core_34e185c9e75b43368c5fb95d140aec35
104.4   Complete output (18 lines):
104.4   Traceback (most recent call last):
104.4     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 349, in <module>
104.4       main()
104.4     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 331, in main
104.4       json_out['return_val'] = hook(**hook_input['kwargs'])
104.4     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 117, in get_requires_for_build_wheel
104.4       return hook(config_settings)
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
104.4       return self._get_build_requires(config_settings, requirements=['wheel'])
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 320, in _get_build_requires
104.4       self.run_setup()
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 483, in run_setup
104.4       super(_BuildMetaLegacyBackend,
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 335, in run_setup
104.4       exec(code, locals())
104.4     File "<string>", line 52, in <module>
104.4     File "<string>", line 45, in req_file
104.4   FileNotFoundError: [Errno 2] No such file or directory: 'megatron/core/requirements.txt'
104.4   ----------------------------------------
104.4 WARNING: Discarding https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from https://pypi.org/simple/megatron-core/). Command errored out with exit status 1: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpo5jo_szg Check the logs for full command output.
104.4 ERROR: Could not find a version that satisfies the requirement megatron-core==0.4.0; extra == "nlp" (from nemo-toolkit[nlp]) (from versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0)
104.4 ERROR: No matching distribution found for megatron-core==0.4.0; extra == "nlp"
105.4 WARNING: You are using pip version 21.2.4; however, version 23.3.2 is available.
105.4 You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.
------
ERROR: failed to solve: failed to solve with frontend dockerfile.v0: failed to build LLB: executor failed running [/bin/sh -c git clone https://github.com/NVIDIA/NeMo.git &&     cd NeMo &&     if [ ! -z $NEMO_COMMIT ]; then         git fetch origin $NEMO_COMMIT &&         git checkout FETCH_HEAD;     fi &&     pip uninstall -y nemo_toolkit sacrebleu &&     pip install -e ".[nlp]" &&     cd nemo/collections/nlp/data/language_modeling/megatron &&     make]: runc did not terminate sucessfully

@JanuszL
Copy link

JanuszL commented Jan 5, 2024

It seems that the ToT of Megatron-LM build 0.4.0rc0 while NeMo expects 0.4.0.
What you can do on top of your Dockerfile change is add --build-arg "MEGATRONCORE_COMMIT=core_v0.4.0" parameter to the docker build cmd.

@TaekyungHeo
Copy link
Member Author

Thanks, @JanuszL. I tried your suggestion both before and after modifying the Dockerfile. Without the modifications, it still prints out the same error. However, when I change the Dockerfile, the pip installation stage takes an unusually long time.

@JanuszL
Copy link

JanuszL commented Jan 5, 2024

@TaekyungHeo thank you for checking. I think I may lack the necessary understanding of the build logic used here. Let us wait for the project maintainers to share their thoughts.

@vishakha-lall
Copy link

Not that this is a valid solution, however I was facing the same issue while installing nemo_toolkit[all] and I reverted the version of the package to the previous one nemo-toolkit==1.21.0 released in 2023 as opposed to the current one which released in Jan 2024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants