Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

up to 2.0.1 #172

Closed
wants to merge 10 commits into from
Closed

up to 2.0.1 #172

wants to merge 10 commits into from

Conversation

ngam
Copy link
Contributor

@ngam ngam commented Jun 16, 2023

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@ngam
Copy link
Contributor Author

ngam commented Jun 17, 2023

@conda-forge-admin, please rerender

@RaulPPelaez
Copy link
Contributor

RaulPPelaez commented Jun 19, 2023

GCC 10 builds fail due to OneDNN:

  /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1686968646633/work/torch/csrc/jit/codegen/onednn/graph_helper.h:3:10: fatal error: oneapi/dnnl/dnnl_graph.hpp: No such file or directory
      3 | #include <oneapi/dnnl/dnnl_graph.hpp>
        |   

OneDNN does not seem to be listed in the requirements for build: https://github.com/conda-forge/pytorch-cpu-feedstock/blob/f2474772cb070bd3131bbb349e0d6f856c79747a/recipe/meta.yaml
GCC 12 builds fail because of an incompatibility between FBGMM and GCC 12:
pytorch/pytorch#77939
pytorch/FBGEMM#1666
Apparently the error is a false positive and can be ignored. Users got around this error by just silencing maybe-uninitialized (this is recommended in FBGEMM`s README now):

export CXXFLAGS+='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull' 
export CFLAGS+='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull'

I wanna help, but I do not really know how a feedstock works -.-

@ngam
Copy link
Contributor Author

ngam commented Jun 19, 2023

Thanks, I will follow up with your suggested fixes soon.

We all learn the conda-forge ways by watching 😉

@github-actions
Copy link

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/pytorch-cpu-feedstock/actions/runs/5315469498.

recipe/build_pytorch.sh Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated
@@ -70,6 +70,7 @@ outputs:
- cudnn # [cuda_compiler_version != "None"]
- nccl # [cuda_compiler_version != "None"]
- magma # [cuda_compiler_version != "None"]
- onednn # [cuda_compiler_version == "11.2"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for run too? Unsure, but we will assess later

@ngam
Copy link
Contributor Author

ngam commented Jun 19, 2023

We still have the onednn issue. Let me think how we address this......

@RaulPPelaez
Copy link
Contributor

Inspecting the first lines of the failed builds:

## Package Plan ##

  environment location: /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687227140059/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh


The following NEW packages will be INSTALLED:

    _libgcc_mutex:      0.1-conda_forge            conda-forge
    _openmp_mutex:      4.5-2_kmp_llvm             conda-forge
    brotli:             1.0.9-h166bdaf_8           conda-forge
    brotli-bin:         1.0.9-h166bdaf_8           conda-forge
    bzip2:              1.0.8-h7f98852_4           conda-forge
    ca-certificates:    2023.5.7-hbcca054_0        conda-forge
    certifi:            2023.5.7-pyhd8ed1ab_0      conda-forge
    charset-normalizer: 3.1.0-pyhd8ed1ab_0         conda-forge
    cuda-version:       11.8-h70ddcb2_2            conda-forge
    cudatoolkit:        11.8.0-h37601d7_11         conda-forge
    cudnn:              8.8.0.121-h0800d71_1       conda-forge
    future:             0.18.3-pyhd8ed1ab_0        conda-forge
    icu:                72.1-hcb278e6_0            conda-forge
    idna:               3.4-pyhd8ed1ab_0           conda-forge
    ld_impl_linux-64:   2.40-h41732ed_0            conda-forge
    libblas:            3.9.0-16_linux64_mkl       conda-forge
    libbrotlicommon:    1.0.9-h166bdaf_8           conda-forge
    libbrotlidec:       1.0.9-h166bdaf_8           conda-forge
    libbrotlienc:       1.0.9-h166bdaf_8           conda-forge
    libcblas:           3.9.0-16_linux64_mkl       conda-forge
    libffi:             3.4.2-h7f98852_5           conda-forge
    libgcc-ng:          13.1.0-he5830b7_0          conda-forge
    libgomp:            13.1.0-he5830b7_0          conda-forge
    libhwloc:           2.9.1-nocuda_h7313eea_6    conda-forge
    libiconv:           1.17-h166bdaf_0            conda-forge
    liblapack:          3.9.0-16_linux64_mkl       conda-forge
    libmagma:           2.7.1-hc72dce7_3           conda-forge
    libmagma_sparse:    2.7.1-hc72dce7_4           conda-forge
    libnsl:             2.0.0-h7f98852_0           conda-forge
    libprotobuf:        3.21.12-h3eb15da_0         conda-forge
    libsqlite:          3.42.0-h2797004_0          conda-forge
    libstdcxx-ng:       13.1.0-hfd8a6a1_0          conda-forge
    libuuid:            2.38.1-h0b41bf4_0          conda-forge
    libuv:              1.44.2-h166bdaf_0          conda-forge
    libxml2:            2.11.4-h0d562d8_0          conda-forge
    libzlib:            1.2.13-hd590300_5          conda-forge
    llvm-openmp:        16.0.6-h4dfa4b3_0          conda-forge
    magma:              2.7.1-ha770c72_4           conda-forge
    mkl:                2022.2.1-h84fe81f_16997    conda-forge
    mkl-devel:          2022.2.1-ha770c72_16998    conda-forge
    mkl-include:        2022.2.1-h84fe81f_16997    conda-forge
    nccl:               2.18.3.1-h12f7317_0        conda-forge
    ncurses:            6.4-hcb278e6_0             conda-forge
    numpy:              1.21.6-py310h45f3432_0     conda-forge
    openssl:            3.1.1-hd590300_1           conda-forge
    pip:                23.1.2-pyhd8ed1ab_0        conda-forge
    pkg-config:         0.29.2-h36c2ea0_1008       conda-forge
    pysocks:            1.7.1-pyha2e5f31_6         conda-forge
    python:             3.10.11-he550d4f_0_cpython conda-forge
    python_abi:         3.10-3_cp310               conda-forge
    pyyaml:             6.0-py310h5764c6d_5        conda-forge
    readline:           8.2-h8228510_1             conda-forge
    requests:           2.31.0-pyhd8ed1ab_0        conda-forge
    setuptools:         67.7.2-pyhd8ed1ab_0        conda-forge
    six:                1.16.0-pyh6c4a22f_0        conda-forge
    sleef:              3.5.1-h9b69904_2           conda-forge
    tbb:                2021.9.0-hf52228f_0        conda-forge
    tk:                 8.6.12-h27826a3_0          conda-forge
    typing:             3.10.0.0-pyhd8ed1ab_0      conda-forge
    typing_extensions:  4.6.3-pyha770c72_0         conda-forge
    tzdata:             2023c-h71feb2d_0           conda-forge
    urllib3:            2.0.3-pyhd8ed1ab_0         conda-forge
    wheel:              0.40.0-pyhd8ed1ab_0        conda-forge
    xz:                 5.2.6-h166bdaf_0           conda-forge
    yaml:               0.2.5-h7f98852_2           conda-forge
    zstd:               1.5.2-h3eb15da_6           conda-forge

OneDNN is not there

@RaulPPelaez
Copy link
Contributor

I see you added mkl_include to meta.yml, but this is a different package with different headers and names.
Pytorch has the -DUSE_MKLDNN CMake flag to use mkl instead of the newer oneAPI, but there has been effort to translate to oneAPI in pytorch:
pytorch/pytorch#32422
So I guess we should use that. The conda-forge package is just "onednn" and includes the missing header.

@ngam
Copy link
Contributor Author

ngam commented Jun 20, 2023

I tried adding onednn in 48b2813 and it didn’t work. I got the mkl-include idea from looking at how pytorch builds their own conda package (no onednn, but they use mkl-include)

@RaulPPelaez
Copy link
Contributor

What was the error when you added onednn? I cannot see the log.

@ngam
Copy link
Contributor Author

ngam commented Jun 20, 2023

My plan was to add onednn again if mkl-include didn't work (i.e., adding one piece at a time)

@RaulPPelaez
Copy link
Contributor

RaulPPelaez commented Jun 20, 2023

The compiler finds oneDNN, but eventually it fails with this error:

  /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687262527413/work/torch/csrc/jit/codegen/onednn/operator.h:98:15: error: no matching function for call to 'dnnl::graph::op::set_attr(std::string&, std::__cxx11::basic_string<char>)'
  /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687262527413/work/torch/csrc/jit/codegen/onednn/operator.h:98:16: note:   cannot convert 'name' (type 'std::string' {aka 'std::__cxx11::basic_string<char>'}) to type 'dnnl::graph::op::attr'

It looks to me like a version mismatch between oneDNN and pytorch. Honestly no idea how to solve it -.-
Sadly I cannot find any references to it online.

These official tips are 4 years old, but its the best I could find. https://github.com/pytorch/pytorch/blame/main/CONTRIBUTING.md#c-development-tips

EDIT:
I found these:
pytorch/pytorch#103745
pytorch/pytorch#97957
Which makes me think that onednn should be included in third_party automatically. My head hurts -.-

@ngam
Copy link
Contributor Author

ngam commented Jun 20, 2023

Which makes me think that onednn should be included in third_party automatically. My head hurts -.-

That was gonna be the next question --- why didn't we encounter this problem before? I am not sure what else is changing, but in our end, it seems nothing changed since the last build. Is there anything that could be related in here: pytorch/pytorch@v2.0.0...v2.0.1

@ngam
Copy link
Contributor Author

ngam commented Jun 20, 2023

I found these:
pytorch/pytorch#103745
pytorch/pytorch#97957

We can either:

  1. Apply one or both these patches to this PR
  2. use the older onednn version these PRs discuss (currently we are pulling the newer onednn)

Do you have a preference?

@RaulPPelaez
Copy link
Contributor

Which makes me think that onednn should be included in third_party automatically. My head hurts -.-

That was gonna be the next question --- why didn't we encounter this problem before? I am not sure what else is changing, but in our end, it seems nothing changed since the last build. Is there anything that could be related in here: pytorch/pytorch@v2.0.0...v2.0.1

As far as I can tell there is nothing in that diff messing with the building process.

@RaulPPelaez
Copy link
Contributor

Do you have a preference?

I would use the older DNN

@RaulPPelaez
Copy link
Contributor

OTOH the CPU builds work, and I believe it is due to this USE_MKLDNN here:

else
if [[ "$target_platform" == *-64 ]]; then
export BLAS="MKL"
else
# Breakpad seems to not work on aarch64 or ppc64le
# https://github.com/pytorch/pytorch/issues/67083
export USE_BREAKPAD=0
fi
export USE_CUDA=0
export USE_MKLDNN=1
export CMAKE_TOOLCHAIN_FILE="${RECIPE_DIR}/cross-linux.cmake"
fi

This branch only runs when cuda compiler is None, so maybe that makes it so oneDNN is not even needed.

@ngam
Copy link
Contributor Author

ngam commented Jun 21, 2023

Smart! Let’s add MKLDNN to the cuda ones too. I will edit the PR with this first, then the older onednn if it doesn’t work

@ngam
Copy link
Contributor Author

ngam commented Jun 21, 2023

Btw, do you know if they added support for grace hopper gpus yet? If so, we may need to edit the compute list we have (TORCH_CUDA_ARCH_LIST). I can double check once everything passes.

@ngam
Copy link
Contributor Author

ngam commented Jun 21, 2023

Don’t hold your breath! The cuda builds won’t finish in 6 hours. What we are hoping for here is that they time out…

@ngam
Copy link
Contributor Author

ngam commented Jun 21, 2023

Well damn… they’re finishing in time. Could you test one of the gpu artifacts to see if they work fine locally? Let me know if you don’t know how to do that

@ngam ngam marked this pull request as ready for review June 21, 2023 23:05
@ngam
Copy link
Contributor Author

ngam commented Jun 21, 2023

@conda-forge/pytorch-cpu @hmaarrfk @h-vetinari @Tobias-Fischer, since when are we able to build cuda112 on the CI here??? This can't be right!

Could you please have a look? Would you like me to incorporate the cuda12 migration in here too?

@ngam
Copy link
Contributor Author

ngam commented Jun 21, 2023

@RaulPPelaez I will add you as a coauthor to commits in this PR if you don't mind. Please object if you don't want me to do that.

@ngam
Copy link
Contributor Author

ngam commented Jun 21, 2023

@conda-forge/pytorch-cpu @hmaarrfk @h-vetinari @Tobias-Fischer, since when are we able to build cuda112 on the CI here??? This can't be right!

Could you please have a look? Would you like me to incorporate the cuda12 migration in here too?

I should have checked the logs more carefully:

  -- Could NOT find CUDA (missing: CUDA_INCLUDE_DIRS) (found version "11.2")
  CMake Warning at cmake/public/cuda.cmake:31 (message):
    Caffe2: CUDA cannot be found.  Depending on whether you are building Caffe2
    or a Caffe2 dependent library, the next warning / error will give you more
    info.
  Call Stack (most recent call first):
    cmake/Dependencies.cmake:43 (include)
    CMakeLists.txt:717 (include)


  CMake Warning at cmake/Dependencies.cmake:66 (message):
    Not compiling with CUDA.  Suppress this warning with -DUSE_CUDA=OFF.
  Call Stack (most recent call first):
    CMakeLists.txt:717 (include)


@RaulPPelaez
Copy link
Contributor

RaulPPelaez commented Jun 22, 2023

The local build looks fine to me. The docker container appears to provide CUDA in /usr/local/cuda. I do not see that message about "CUDA not found".
The only difference between your last commit and my local version is that I did not introduced this line in the CUDA branch in the build script:

export CMAKE_TOOLCHAIN_FILE="${RECIPE_DIR}/cross-linux.cmake"

@Tobias-Fischer
Copy link
Contributor

I don’t know how to fix the problem, but we should check whether we can catch this somehow in a test case. It’s problematic that the build still passes.

@RaulPPelaez
Copy link
Contributor

In my local build, adding the CMAKE_TOOLCHAIN_FILE makes CMake not pick CUDA up. By removing it the error goes away.

Not sure how to go about catching this particular mistake.

@RaulPPelaez RaulPPelaez mentioned this pull request Jun 22, 2023
@jamestwebber
Copy link

What's the status of this? Confusingly for me, I see pytorch 2.0.1 when I run mamba search but the pyro-ppl feedstock is failing its test because it can't satisfy the requirement. I've been waiting on this PR before I update the feed to the latest version.

@RaulPPelaez
Copy link
Contributor

The CUDA builds are giving us a hard time. For some reason CMake is not finding CUDA correctly and we ran out of ideas.
Mysteriously a local build used to pass for me, let me see if the latest commit here still does and I will give it another spin.

Check out the channels in your env, in a base env I only see up to 2.0.0 in conda-forge:

$ mamba search pytorch
...
pytorch                       1.13.1 cuda112py39hb0b7ed5_200  conda-forge         
pytorch                        2.0.0 cpu_py310hd11e9c7_0  conda-forge         
pytorch                        2.0.0 cpu_py311h410fd25_0  conda-forge         
pytorch                        2.0.0 cpu_py38h019455c_0  conda-forge         
pytorch                        2.0.0 cpu_py39he4d1dc0_0  conda-forge         
pytorch                        2.0.0 cuda112py310he33e0d6_200  conda-forge         
pytorch                        2.0.0 cuda112py311h13fee9e_200  conda-forge         
pytorch                        2.0.0 cuda112py38h5e67e12_200  conda-forge         
pytorch                        2.0.0 cuda112py39ha9981d0_200  conda-forge   

Note that the pytorch channel offers 2.0.1.

@hmaarrfk
Copy link
Contributor

@conda-forge-admin, please rerender

@jamestwebber
Copy link

Check out the channels in your env, in a base env I only see up to 2.0.0 in conda-forge:

Ah yeah I'm actually seeing it in pkgs/main, I should remove that channel...

Note that the pytorch channel offers 2.0.1.

Yeah it is definitely available there and I assume most people install it from that channel. But for the purposes of the conda-forge feedstock I like to make sure everything is available before updating. Those that need the latest can always use pip.

@RaulPPelaez
Copy link
Contributor

By looking here:
https://github.com/pytorch/pytorch/blob/0ab74044c2775970d3bc3668454a3152ae18ea82/.ci/docker/common/install_conda.sh#L54
It seems like a particular version of mkl is hardcoded
This dockerfile is used to build with CUDA:
https://github.com/pytorch/pytorch/blob/main/.ci/docker/ubuntu-cuda/Dockerfile
If one follows the breadcrumb trail I believe the MKLDNN variable is set to 0.
With these changes from the latest commit, a11a83c, a local build succeeds for me

modified   recipe/build_pytorch.sh
@@ -125,7 +125,7 @@ if [[ ${cuda_compiler_version} != "None" ]]; then
     export USE_STATIC_CUDNN=0
     export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
     export MAGMA_HOME="${PREFIX}"
-    export USE_MKLDNN=1
+    export USE_MKLDNN=0
 else
     if [[ "$target_platform" == *-64 ]]; then
       export BLAS="MKL"
modified   recipe/meta.yaml
@@ -79,8 +79,8 @@ outputs:
         - requests
         - future
         - six
-        - mkl-devel {{ mkl }}    # [x86]
-        - mkl-include {{ mkl }}  # [x86]
+        - mkl==2021.4.0          # [x86]
+        - mkl-include==2021.4.0  # [x86]
         - libcblas * *_mkl       # [x86]
         - libcblas               # [not x86]
         - liblapack              # [not x86]

The test then fails with:

import: 'torch'                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                           
  File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/test_tmp/run_test.py", line 2, in <module>                                                   
    import torch                                                                                                                                                             
  File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/site-packages/torch/__init__.py", line 228, in <module>              
    _load_global_deps()                                                                                                                                                      
  File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/site-packages/torch/__init__.py", line 187, in _load_global_deps     
    raise err                                                                                                                                                                
  File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/site-packages/torch/__init__.py", line 168, in _load_global_deps     
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)                                                                                                                           
  File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/ctypes/__init__.py", line 374, in __init__                           
    self._handle = _dlopen(self._name, mode)                                                                                                                                 
OSError: libmkl_intel_lp64.so.1: cannot open shared object file: No such file or directory 

@ngam
Copy link
Contributor Author

ngam commented Jul 25, 2023

See #176 for updates

@ngam ngam mentioned this pull request Jul 31, 2023
5 tasks
@RaulPPelaez RaulPPelaez mentioned this pull request Aug 17, 2023
5 tasks
@shermansiu
Copy link

#176 was closed because the author didn't have enough time to "debug so much for a patch release."

@RaulPPelaez
Copy link
Contributor

At this point I think that going directly for 2.1.0 https://github.com/pytorch/pytorch/releases/tag/v2.1.0

Alas the problems I had compiling this one are probably still there.

Looking at the build instructions, it seems like they are somewhat simpler than before, though:
https://github.com/pytorch/pytorch#from-source

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 4, 2023

i'm not too sure what changed in 4 months but seems like something got fixed
#199

working on getting through this.

CPU build passed, so now onto everything else....

@jakirkham
Copy link
Member

Do we still want this given that PR ( #195 ) added 2.1.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants