Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: yet another attempt to add windows builds #231

Draft
wants to merge 82 commits into
base: main
Choose a base branch
from

Conversation

baszalmstra
Copy link
Member

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Fixes #32

This PR is another attempt to add Windows builds (see #134) .

For now I disabled all other builds to be able to test the windows part first. I made this PR draft so we don't accidentally merge it.

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe:

  • It looks like the 'libtorch' output doesn't have any tests.

@baszalmstra baszalmstra marked this pull request as draft April 5, 2024 13:00
recipe/meta.yaml Outdated Show resolved Hide resolved
@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe) and found some lint.

Here's what I've got...

For recipe:

  • Old-style Python selectors (py27, py35, etc) are only available for Python 2.7, 3.4, 3.5, and 3.6. Please use explicit comparisons with the integer py, e.g. # [py==37] or # [py>=37]. See lines [54]

For recipe:

  • It looks like the 'libtorch' output doesn't have any tests.

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe:

  • It looks like the 'libtorch' output doesn't have any tests.

@baszalmstra
Copy link
Member Author

Both pipelines failed because they ran out of disk space:

FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/te_wrapper.cpp.obj 
C:\PROGRA~1\MICROS~2\2022\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\HostX64\x64\cl.exe  /nologo /TP -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -Dtorch_cpu_EXPORTS -I%SRC_DIR%\build\aten\src -I%SRC_DIR%\aten\src -I%SRC_DIR%\build -I%SRC_DIR% -I%SRC_DIR%\third_party\onnx -I%SRC_DIR%\build\third_party\onnx -I%SRC_DIR%\third_party\foxi -I%SRC_DIR%\build\third_party\foxi -I%SRC_DIR%\third_party\mimalloc\include -I%SRC_DIR%\torch\csrc\api -I%SRC_DIR%\torch\csrc\api\include -I%SRC_DIR%\caffe2\aten\src\TH -I%SRC_DIR%\build\caffe2\aten\src\TH -I%SRC_DIR%\build\caffe2\aten\src -I%SRC_DIR%\build\caffe2\..\aten\src -I%SRC_DIR%\torch\csrc -I%SRC_DIR%\third_party\miniz-2.1.0 -I%SRC_DIR%\third_party\kineto\libkineto\include -I%SRC_DIR%\third_party\kineto\libkineto\src -I%SRC_DIR%\aten\src\ATen\.. -I%SRC_DIR%\c10\.. -I%SRC_DIR%\third_party\pthreadpool\include -I%SRC_DIR%\third_party\cpuinfo\include -I%SRC_DIR%\third_party\fbgemm\include -I%SRC_DIR%\third_party\fbgemm -I%SRC_DIR%\third_party\fbgemm\third_party\asmjit\src -I%SRC_DIR%\third_party\ittapi\src\ittnotify -I%SRC_DIR%\third_party\FP16\include -I%SRC_DIR%\third_party\fmt\include -I%SRC_DIR%\build\third_party\ideep\mkl-dnn\include -I%SRC_DIR%\third_party\ideep\mkl-dnn\src\..\include -I%SRC_DIR%\third_party\flatbuffers\include -external:I%SRC_DIR%\build\third_party\gloo -external:I%SRC_DIR%\cmake\..\third_party\gloo -external:I%SRC_DIR%\third_party\protobuf\src -external:I%SRC_DIR%\third_party\XNNPACK\include -external:I%SRC_DIR%\third_party\ittapi\include -external:I%SRC_DIR%\cmake\..\third_party\eigen -external:I%SRC_DIR%\third_party\ideep\mkl-dnn\include\oneapi\dnnl -external:I%SRC_DIR%\third_party\ideep\include -external:I%SRC_DIR%\caffe2 -external:W0 /DWIN32 /D_WINDOWS /GR /EHsc /bigobj /FS -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /utf-8 /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD /EHsc /bigobj -O2 -DONNX_BUILD_MAIN_LIB -openmp:experimental /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\jit\runtime\static\te_wrapper.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c %SRC_DIR%\torch\csrc\jit\runtime\static\te_wrapper.cpp
%SRC_DIR%\torch\csrc\jit\runtime\static\te_wrapper.cpp : fatal error C1085: Cannot write compiler generated file: '%SRC_DIR%\build\caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\jit\runtime\static\te_wrapper.cpp.obj': No space left on device

What would be the most idiomatic way to solve this issue?

@weiji14
Copy link
Member

weiji14 commented Apr 6, 2024

Try following https://conda-forge.org/docs/maintainer/conda_forge_yml/#azure to clear some disk space. Set this in conda-forge.yml

azure:
  free_disk_space: true

and then rerender the feedstock.

@Tobias-Fischer
Copy link
Contributor

Tobias-Fischer commented Apr 6, 2024

I think there’s little we can do - the Azure free disk space setting is already enabled. I’d try and see if these build locally. Perhaps there is a way to use the Quansight servers for Windows as well, the same way they are used for Linux builds? If not, I guess if there are some volunteers to build these locally then this would be an option - I did that for aarch64 for a while for qt. Conda-forge has a windows server too, but disk space has always been quite restricted there too so it might be a bit of a pain.

@jakirkham
Copy link
Member

Perhaps cross-compiling Windows from Linux is worth trying? Here is a different feedstock PR that does this ( conda-forge/polars-feedstock#187 )

If we were to use Quansight resources for Windows, being able to run the build on Linux (so cross-compiling) would be very helpful

@baszalmstra
Copy link
Member Author

Try following conda-forge.org/docs/maintainer/conda_forge_yml/#azure to clear some disk space. Set this in conda-forge.yml

azure:
  free_disk_space: true

Sadly thats already set:

free_disk_space: true

I think there’s little we can do - the Azure free disk space setting is already enabled. I’d try and see if these build locally. Perhaps there is a way to use the Quantstack servers for Windows as well, the same way they are used for Linux builds?

I assume you mean the runners provided through open-gpu-server by Quantsight and MetroStar? This PR only build the cpu-only version but if we also start building for Cuda I think this is the only possible way forward (let alone for other related repositories like tensorflow). However, the open-gpu-servers don't seem to provide any Windows images. Do you know who I should contact to get the ball rolling?

If not, I guess if there are some volunteers to build these locally then this would be an option

That would be an option but Id prefer to automate and open-source things as much as possible. Having something hooked up to this repository would be ideal.

Perhaps cross-compiling Windows from Linux is worth trying?

The native code of the example you linked is using Rust which makes this much easier. I doubt that this would be easy to achieve with pytorch.

@baszalmstra
Copy link
Member Author

I also expect another error when actual linking starts. On my local machine that takes at least 16GB of memory. The cuda version will mostly require more.

@jakirkham
Copy link
Member

Perhaps cross-compiling Windows from Linux is worth trying?

The native code of the example you linked is using Rust which makes this much easier. I doubt that this would be easy to achieve with pytorch.

If we don't try, we won't know

@baszalmstra
Copy link
Member Author

If we don't try, we won't know

Although that is technically true, its already hard enough to build pytorch natively. Adding cross-compilation in the mix seems to me to complicate this even further. Id much rather first focus on getting native builds working. Even if we need to modify the infrastructure to do so. I think having the ability to do resource intensive windows builds would be a huge benefit for the conda-forge ecosystem in general.

However, if all else fails cross-compiling seems like a worthwhile avenue to explore.

@bkpoon
Copy link
Member

bkpoon commented Apr 6, 2024

One thing to try is to move the build from D:\ to a directory that you have write access to on C:\. I have done this on a personal feedstock where I needed much more disk space. You can modify your conda-forge.yml file with

azure:
  settings_win:
    variables:
      CONDA_BLD_PATH: C:\\Miniconda\\envs\\

You should have roughly 70 GB free on C:\.

@baszalmstra
Copy link
Member Author

Thanks! I added that to the PR. I quickly searched github and it seems c:\bld\ is used more often so I tried that.

@bkpoon
Copy link
Member

bkpoon commented Apr 6, 2024

Just make sure that the directory exists and is writeable. Also, you need to rerender for the variable to be set. This comment should trigger the bot.

@conda-forge-admin, please rerender​

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 6, 2024

This PR only build the cpu-only version but if we also start building for Cuda I think this is the only possible way forward (let alone for other related repositories like tensorflow). However, the open-gpu-servers don't seem to provide any Windows images. Do you know who I should contact to get the ball rolling?

A bit of history. Back when this feedstock was created 6 years ago, the pytorch officially suggested that people install two distinct packages pytorch-cpu or pytorch-gpu. Therefore it felt appropriate to create pytorch-cpu package because it would throw an error for those trying to install pytorch-gpu. These instructions have changed upstream.

I personally feel like for windows users, we would HURT their experience to not have a GPU package in 2024.

@baszalmstra
Copy link
Member Author

I personally feel like for windows users, we would HURT their experience to not have a GPU package in 2024.

Couldnt agree more. I started with CPU only to be able to make incremental progression. My goal is definitely to be able to build the cuda version too!

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 9, 2024

well few things:

  1. I might try to build locally.
  2. After locally works for 1 python, I might try to enable the mega builds. When you build locally, it saves all the pytorch library compilation and makes compilation take "1.2x" time instead of "4x" time due to the repeated compilaiton of the library for each python version.
  3. Try to enable cuda using the CI.

Typically we "stop" the compilation on the CIs when we reach your stage (seems like it is working OK enough...).

@Tobias-Fischer
Copy link
Contributor

Hi @baszalmstra @hmaarrfk - do you have any updates on this? It would be amazing to see this happen :)!

@baszalmstra
Copy link
Member Author

@Tobias-Fischer Im still working on the Cuda builds but its a slow process because it takes ages to build them locally so iteration times are suuuper slow.

In parallel we are also looking into getting large Windows runners into the conda-forge infrastructure.

@baszalmstra
Copy link
Member Author

Small update:

image

I have something compiling locally. Still lots of issues (like Windows builds of pytorch 2.1.2 dont compile with python 3.12) but making steady progress. Currently getting megabuilds to work. Will push when I have something reliably working.

@baszalmstra
Copy link
Member Author

baszalmstra commented May 12, 2024

I got to the testing stage and noticed this:

- OMP_NUM_THREADS=4 python ./test/run_test.py || true # [not win and not (aarch64 and cuda_compiler_version != "None")]

However this seems to always fail with (this is from the logs of the latest release):

Ignoring disabled issues:  ['']
Unable to import boto3. Will not be emitting metrics.... Reason: No module named 'boto3'
Missing pip dependency: pytest-rerunfailures, please run `pip install -r .ci/docker/requirements-ci.txt`

Some dependencies are missing. Particularly:

  • pytest-rerunfailures
  • pytest-shard (not on conda-forge)
  • pytest-flakefinder (not on conda-forge)
  • pytest-xdist

(as can be seen here https://github.com/pytorch/pytorch/blob/6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0/test/run_test.py#L1705)

Given that the test is allowed to fail (due to || true). Should we just remove it? Or put in the effort to fix these tests?

@h-vetinari
Copy link
Member

Given that the test is allowed to fail (due to || true). Should we just remove it? Or put in the effort to fix these tests?

The more we fix, the better. If it's really a lot of failures, we might not fix it right away (though depending on the severity of the failures, we might want to think twice about releasing something in that state).

In any case, let's leave the testing in, add the required dependencies, and pick up as many fixes as we can.

recipe/meta.yaml Outdated Show resolved Hide resolved
@h-vetinari
Copy link
Member

Looks like we're getting straigh-up segfaults now:

============================= test session starts =============================
platform win32 -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: %SRC_DIR%
plugins: hypothesis-6.123.2, flakefinder-1.1.0, rerunfailures-15.0, xdist-3.6.1
Fatal Python error: Aborted

Thread 0x00001950 (most recent call first):
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\socket.py", line 294 in accept
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\site-packages\pytest_rerunfailures.py", line 440 in run_server
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\threading.py", line 982 in run
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\threading.py", line 1045 in _bootstrap_inner
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\threading.py", line 1002 in _bootstrap

Current thread 0x00001adc (most recent call first):
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 362 in kldivloss_with_log_target_no_reduce_test
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 1091 in <module>
  [...]

@mgorny
Copy link
Contributor

mgorny commented Dec 31, 2024

I don't know how segfaults exhibit on Windows, but on Linux "Aborted" usually means some failed assertions. You can then use pytest -s to get the actual assertion message, since it is normally captured by pytest.

@h-vetinari
Copy link
Member

Pytest should produce regular reporting for failed tests. The "aborted" is an irregular shutdown

h-vetinari and others added 2 commits December 31, 2024 19:02
Co-authored-by: Michał Górny <mgorny@gentoo.org>
@Tobias-Fischer
Copy link
Contributor

Error for reference:

2024-12-31T22:08:38.0509971Z OMP: Error #15: Initializing libiomp5md.dll, but found libomp140.x86_64.dll already initialized.
2024-12-31T22:08:38.0513317Z OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
2024-12-31T22:08:38.5588572Z Fatal Python error: Aborted

I'll see if it also happens with a non-mkl build.

@Tobias-Fischer
Copy link
Contributor

@conda-forge-admin please rerender

conda-forge-webservices[bot] and others added 2 commits December 31, 2024 22:27
@h-vetinari
Copy link
Member

I'll see if it also happens with a non-mkl build.

It's another reason why we should get rid of intel-openmp already for this PR

@Tobias-Fischer
Copy link
Contributor

I'll see if it also happens with a non-mkl build.

It's another reason why we should get rid of intel-openmp already for this PR

I still don’t know enough to fix this unfortunately - if you know what to do, please go ahead (or if you can provide guidance/pointers what to do)

@Tobias-Fischer
Copy link
Contributor

We are back to the cuda_version issue:

2025-01-01T02:39:35.0198872Z ___________________ ERROR collecting test/test_autograd.py ____________________
2025-01-01T02:39:35.0199315Z test\test_autograd.py:62: in <module>
2025-01-01T02:39:35.0199822Z     from torch.testing._internal.common_methods_invocations import mask_not_all_zeros
2025-01-01T02:39:35.0200527Z ..\_test_env\lib\site-packages\torch\testing\_internal\common_methods_invocations.py:54: in <module>
2025-01-01T02:39:35.0201155Z     from torch.testing._internal.opinfo.core import (  # noqa: F401
2025-01-01T02:39:35.0201714Z ..\_test_env\lib\site-packages\torch\testing\_internal\opinfo\__init__.py:4: in <module>
2025-01-01T02:39:35.0202179Z     import torch.testing._internal.opinfo.definitions
2025-01-01T02:39:35.0202639Z ..\_test_env\lib\site-packages\torch\testing\_internal\opinfo\definitions\__init__.py:6: in <module>
2025-01-01T02:39:35.0203112Z     from torch.testing._internal.opinfo.definitions import (
2025-01-01T02:39:35.0203593Z ..\_test_env\lib\site-packages\torch\testing\_internal\opinfo\definitions\linalg.py:1585: in <module>
2025-01-01T02:39:35.0204088Z     _get_torch_cuda_version() < (11, 4), "not available before CUDA 11.3.1"
2025-01-01T02:39:35.0205125Z ..\_test_env\lib\site-packages\torch\testing\_internal\common_cuda.py:240: in _get_torch_cuda_version
2025-01-01T02:39:35.0205867Z     return tuple(int(x) for x in cuda_version.split("."))
2025-01-01T02:39:35.0206289Z ..\_test_env\lib\site-packages\torch\testing\_internal\common_cuda.py:240: in <genexpr>
2025-01-01T02:39:35.0206709Z     return tuple(int(x) for x in cuda_version.split("."))
2025-01-01T02:39:35.0207047Z E   ValueError: invalid literal for int() with base 10: 'None'

/cc @isuruf

@mgorny
Copy link
Contributor

mgorny commented Jan 1, 2025

Looks like there's some memory corruption after all:

2025-01-01T10:31:54.7488844Z Windows fatal exception: code 0xc0000374
2025-01-01T10:31:54.7489138Z 
2025-01-01T10:31:54.7489270Z Thread 0x00001018 (most recent call first):
2025-01-01T10:31:54.7489844Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\socket.py", line 295 in accept
2025-01-01T10:31:54.7490764Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pytest_rerunfailures.py", line 440 in run_server
2025-01-01T10:31:54.7491680Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\threading.py", line 1012 in run
2025-01-01T10:31:54.7492488Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\threading.py", line 1075 in _bootstrap_inner
2025-01-01T10:31:54.7493456Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\threading.py", line 1032 in _bootstrap
2025-01-01T10:31:54.7494004Z 
2025-01-01T10:31:54.7494169Z Current thread 0x00000f90 (most recent call first):
2025-01-01T10:31:54.7495002Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\autograd\graph.py", line 825 in _engine_run_backward
2025-01-01T10:31:54.7496139Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\autograd\__init__.py", line 347 in backward
2025-01-01T10:31:54.7497158Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\_tensor.py", line 581 in backward
2025-01-01T10:31:54.7498032Z   File "C:\bld\libtorch_1735710398339\test_tmp\test\test_nn.py", line 87 in _backward
2025-01-01T10:31:54.7499057Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 3482 in test_noncontig
2025-01-01T10:31:54.7500290Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 3432 in __call__
2025-01-01T10:31:54.7501266Z   File "C:\bld\libtorch_1735710398339\test_tmp\test\test_nn.py", line 7213 in <lambda>
2025-01-01T10:31:54.7502265Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_utils.py", line 2979 in wrapper
2025-01-01T10:31:54.7503313Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\unittest\case.py", line 589 in _callTestMethod
2025-01-01T10:31:54.7504131Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\unittest\case.py", line 634 in run
2025-01-01T10:31:54.7505735Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_utils.py", line 3084 in _run_custom
2025-01-01T10:31:54.7507308Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_utils.py", line 3112 in run
2025-01-01T10:31:54.7508349Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\unittest\case.py", line 690 in __call__
2025-01-01T10:31:54.7509233Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\unittest.py", line 351 in runtest
2025-01-01T10:31:54.7510261Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 174 in pytest_runtest_call
2025-01-01T10:31:54.7511299Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7512269Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7513230Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7514183Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 242 in <lambda>
2025-01-01T10:31:54.7515145Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 341 in from_call
2025-01-01T10:31:54.7516210Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 241 in call_and_report
2025-01-01T10:31:54.7517267Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 132 in runtestprotocol
2025-01-01T10:31:54.7518362Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 113 in pytest_runtest_protocol
2025-01-01T10:31:54.7519417Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7520394Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7521352Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7522351Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 362 in pytest_runtestloop
2025-01-01T10:31:54.7523449Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7524474Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7525425Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7526336Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 337 in _main
2025-01-01T10:31:54.7527282Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 283 in wrap_session
2025-01-01T10:31:54.7528369Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 330 in pytest_cmdline_main
2025-01-01T10:31:54.7529412Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7530396Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7531357Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7532335Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\config\__init__.py", line 175 in main
2025-01-01T10:31:54.7533384Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\config\__init__.py", line 201 in console_main
2025-01-01T10:31:54.7534403Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pytest\__main__.py", line 9 in <module>
2025-01-01T10:31:54.7535150Z   File "<frozen runpy>", line 88 in _run_code
2025-01-01T10:31:54.7535631Z   File "<frozen runpy>", line 198 in _run_module_as_main

@mgorny
Copy link
Contributor

mgorny commented Jan 2, 2025

Okay, so I don't think the issues we're seeing are trivial to fix, but I think the overall build architecture is good. And I'd like to give rattler-build a shot here without creating merge conflicts. So I'd like to propose that:

  1. I prepare a fix for non-Windows builds here — either remove sccache bits from build.sh and make the dependency specific to win, or fix them. To be honest, I'd lean towards the former, as I think sccache is only needed here because Windows builds don't reuse the build tree the way Linux builds do — and it conflicts with my local use of ccache.
  2. We temporarily disable Windows builds again and merge this — perhaps into Add triton dependency, readd cudss and cusparselt, mention dev speedup tricks in the README #309 to avoid double rebuilds.
  3. You can work on trying to fix the test issues (I suspect this will mostly involve patches), while I work on getting rattler-build working.

WDYT?

@h-vetinari
Copy link
Member

What concrete benefit do you forsee from a switch to rattler? This has nothing to do with the test errors. I'd prefer to keep this on conda-build for now, not least because the anaconda folks are planning to contribute some of their recipe bits, and this will surely be easier if we stay on the same format.

For the rest of the plan I'm fine - we can merge this PR as long as the windows builds are still disabled and the linux builds are passing (and not breaking your local workflow).

@mgorny
Copy link
Contributor

mgorny commented Jan 2, 2025

Well, it will definitely make my local testing faster (given ccache makes the actual build very fast), dependency resolution and other processing in conda-build is a bottleneck. But mostly, I'm worried that the growing complexity of the recipe is making the eventual conversion harder, so I'd rather do it sooner than later, so that future changes are already done in the new format.

Of course, I'm not insisting. There's definitely some cost involved: like the ugly comment form seen in triton-feedstock now, or the fact that the {% for ... %}s will probably have to be inlined.

@mgorny
Copy link
Contributor

mgorny commented Jan 2, 2025

Hmm, I don't think we have python312_d.lib in Conda-forge (or at least streamlit can't find it).

As a data point, it looks like the upstream build from PyPI doesn't crash — test\test_modules.py passes on my laptop 100%. Not sure how valuable a data point that would be, but perhaps it would make sense to try pip install --force-reinstall torch in the test pipeline, to confirm it's definitely something about the build rather than the environment.

@mgorny
Copy link
Contributor

mgorny commented Jan 2, 2025

Looking at https://github.com/pytorch/pytorch/actions/runs/11471282816/job/31928694532, upstream is building with mkl as BLAS backend and libiomp5md rather than libomp.

@h-vetinari
Copy link
Member

The openmp change I was talking about is not difficult at all, just replace intel-openmp with llvm-openmp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows builds