feat: yet another attempt to add windows builds #231

baszalmstra · 2024-04-05T12:55:37Z

Checklist

Used a personal fork of the feedstock to propose changes
Bumped the build number (if the version is unchanged)
Reset the build number to 0 (if the version changed)
Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
Ensured the license file is being packaged.

Fixes #32

This PR is another attempt to add Windows builds (see #134) .

For now I disabled all other builds to be able to test the windows part first. I made this PR draft so we don't accidentally merge it.

conda-forge-webservices · 2024-04-05T12:56:09Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe:

It looks like the 'libtorch' output doesn't have any tests.

recipe/meta.yaml

conda-forge-webservices · 2024-04-05T13:17:04Z

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe) and found some lint.

Here's what I've got...

For recipe:

Old-style Python selectors (py27, py35, etc) are only available for Python 2.7, 3.4, 3.5, and 3.6. Please use explicit comparisons with the integer py, e.g. # [py==37] or # [py>=37]. See lines [54]

For recipe:

It looks like the 'libtorch' output doesn't have any tests.

conda-forge-webservices · 2024-04-05T13:21:57Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe:

It looks like the 'libtorch' output doesn't have any tests.

baszalmstra · 2024-04-06T06:36:53Z

Both pipelines failed because they ran out of disk space:

FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/te_wrapper.cpp.obj 
C:\PROGRA~1\MICROS~2\2022\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\HostX64\x64\cl.exe  /nologo /TP -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -Dtorch_cpu_EXPORTS -I%SRC_DIR%\build\aten\src -I%SRC_DIR%\aten\src -I%SRC_DIR%\build -I%SRC_DIR% -I%SRC_DIR%\third_party\onnx -I%SRC_DIR%\build\third_party\onnx -I%SRC_DIR%\third_party\foxi -I%SRC_DIR%\build\third_party\foxi -I%SRC_DIR%\third_party\mimalloc\include -I%SRC_DIR%\torch\csrc\api -I%SRC_DIR%\torch\csrc\api\include -I%SRC_DIR%\caffe2\aten\src\TH -I%SRC_DIR%\build\caffe2\aten\src\TH -I%SRC_DIR%\build\caffe2\aten\src -I%SRC_DIR%\build\caffe2\..\aten\src -I%SRC_DIR%\torch\csrc -I%SRC_DIR%\third_party\miniz-2.1.0 -I%SRC_DIR%\third_party\kineto\libkineto\include -I%SRC_DIR%\third_party\kineto\libkineto\src -I%SRC_DIR%\aten\src\ATen\.. -I%SRC_DIR%\c10\.. -I%SRC_DIR%\third_party\pthreadpool\include -I%SRC_DIR%\third_party\cpuinfo\include -I%SRC_DIR%\third_party\fbgemm\include -I%SRC_DIR%\third_party\fbgemm -I%SRC_DIR%\third_party\fbgemm\third_party\asmjit\src -I%SRC_DIR%\third_party\ittapi\src\ittnotify -I%SRC_DIR%\third_party\FP16\include -I%SRC_DIR%\third_party\fmt\include -I%SRC_DIR%\build\third_party\ideep\mkl-dnn\include -I%SRC_DIR%\third_party\ideep\mkl-dnn\src\..\include -I%SRC_DIR%\third_party\flatbuffers\include -external:I%SRC_DIR%\build\third_party\gloo -external:I%SRC_DIR%\cmake\..\third_party\gloo -external:I%SRC_DIR%\third_party\protobuf\src -external:I%SRC_DIR%\third_party\XNNPACK\include -external:I%SRC_DIR%\third_party\ittapi\include -external:I%SRC_DIR%\cmake\..\third_party\eigen -external:I%SRC_DIR%\third_party\ideep\mkl-dnn\include\oneapi\dnnl -external:I%SRC_DIR%\third_party\ideep\include -external:I%SRC_DIR%\caffe2 -external:W0 /DWIN32 /D_WINDOWS /GR /EHsc /bigobj /FS -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /utf-8 /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -DCAFFE2_USE_GLOO -DTH_HAVE_THREAD /EHsc /bigobj -O2 -DONNX_BUILD_MAIN_LIB -openmp:experimental /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\jit\runtime\static\te_wrapper.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c %SRC_DIR%\torch\csrc\jit\runtime\static\te_wrapper.cpp
%SRC_DIR%\torch\csrc\jit\runtime\static\te_wrapper.cpp : fatal error C1085: Cannot write compiler generated file: '%SRC_DIR%\build\caffe2\CMakeFiles\torch_cpu.dir\__\torch\csrc\jit\runtime\static\te_wrapper.cpp.obj': No space left on device

What would be the most idiomatic way to solve this issue?

weiji14 · 2024-04-06T07:14:05Z

Try following https://conda-forge.org/docs/maintainer/conda_forge_yml/#azure to clear some disk space. Set this in conda-forge.yml

azure:
  free_disk_space: true

and then rerender the feedstock.

Tobias-Fischer · 2024-04-06T07:45:34Z

I think there’s little we can do - the Azure free disk space setting is already enabled. I’d try and see if these build locally. Perhaps there is a way to use the Quansight servers for Windows as well, the same way they are used for Linux builds? If not, I guess if there are some volunteers to build these locally then this would be an option - I did that for aarch64 for a while for qt. Conda-forge has a windows server too, but disk space has always been quite restricted there too so it might be a bit of a pain.

jakirkham · 2024-04-06T07:49:24Z

Perhaps cross-compiling Windows from Linux is worth trying? Here is a different feedstock PR that does this ( conda-forge/polars-feedstock#187 )

If we were to use Quansight resources for Windows, being able to run the build on Linux (so cross-compiling) would be very helpful

baszalmstra · 2024-04-06T08:05:34Z

Try following conda-forge.org/docs/maintainer/conda_forge_yml/#azure to clear some disk space. Set this in conda-forge.yml
azure:
  free_disk_space: true

Sadly thats already set:

pytorch-cpu-feedstock/conda-forge.yml

Line 2 in 9e99e03

free_disk_space: true

I think there’s little we can do - the Azure free disk space setting is already enabled. I’d try and see if these build locally. Perhaps there is a way to use the Quantstack servers for Windows as well, the same way they are used for Linux builds?

I assume you mean the runners provided through open-gpu-server by Quantsight and MetroStar? This PR only build the cpu-only version but if we also start building for Cuda I think this is the only possible way forward (let alone for other related repositories like tensorflow). However, the open-gpu-servers don't seem to provide any Windows images. Do you know who I should contact to get the ball rolling?

If not, I guess if there are some volunteers to build these locally then this would be an option

That would be an option but Id prefer to automate and open-source things as much as possible. Having something hooked up to this repository would be ideal.

Perhaps cross-compiling Windows from Linux is worth trying?

The native code of the example you linked is using Rust which makes this much easier. I doubt that this would be easy to achieve with pytorch.

baszalmstra · 2024-04-06T08:06:48Z

I also expect another error when actual linking starts. On my local machine that takes at least 16GB of memory. The cuda version will mostly require more.

jakirkham · 2024-04-06T08:15:09Z

Perhaps cross-compiling Windows from Linux is worth trying?

The native code of the example you linked is using Rust which makes this much easier. I doubt that this would be easy to achieve with pytorch.

If we don't try, we won't know

baszalmstra · 2024-04-06T09:38:07Z

If we don't try, we won't know

Although that is technically true, its already hard enough to build pytorch natively. Adding cross-compilation in the mix seems to me to complicate this even further. Id much rather first focus on getting native builds working. Even if we need to modify the infrastructure to do so. I think having the ability to do resource intensive windows builds would be a huge benefit for the conda-forge ecosystem in general.

However, if all else fails cross-compiling seems like a worthwhile avenue to explore.

bkpoon · 2024-04-06T20:24:50Z

One thing to try is to move the build from D:\ to a directory that you have write access to on C:\. I have done this on a personal feedstock where I needed much more disk space. You can modify your conda-forge.yml file with

azure:
  settings_win:
    variables:
      CONDA_BLD_PATH: C:\\Miniconda\\envs\\

You should have roughly 70 GB free on C:\.

baszalmstra · 2024-04-06T20:52:21Z

Thanks! I added that to the PR. I quickly searched github and it seems c:\bld\ is used more often so I tried that.

bkpoon · 2024-04-06T20:57:07Z

Just make sure that the directory exists and is writeable. Also, you need to rerender for the variable to be set. This comment should trigger the bot.

@conda-forge-admin, please rerender

hmaarrfk · 2024-04-06T21:08:02Z

This PR only build the cpu-only version but if we also start building for Cuda I think this is the only possible way forward (let alone for other related repositories like tensorflow). However, the open-gpu-servers don't seem to provide any Windows images. Do you know who I should contact to get the ball rolling?

A bit of history. Back when this feedstock was created 6 years ago, the pytorch officially suggested that people install two distinct packages pytorch-cpu or pytorch-gpu. Therefore it felt appropriate to create pytorch-cpu package because it would throw an error for those trying to install pytorch-gpu. These instructions have changed upstream.

I personally feel like for windows users, we would HURT their experience to not have a GPU package in 2024.

baszalmstra · 2024-04-07T05:31:07Z

I personally feel like for windows users, we would HURT their experience to not have a GPU package in 2024.

Couldnt agree more. I started with CPU only to be able to make incremental progression. My goal is definitely to be able to build the cuda version too!

hmaarrfk · 2024-04-09T23:20:22Z

well few things:

I might try to build locally.
After locally works for 1 python, I might try to enable the mega builds. When you build locally, it saves all the pytorch library compilation and makes compilation take "1.2x" time instead of "4x" time due to the repeated compilaiton of the library for each python version.
Try to enable cuda using the CI.

Typically we "stop" the compilation on the CIs when we reach your stage (seems like it is working OK enough...).

Tobias-Fischer · 2024-05-06T22:43:52Z

Hi @baszalmstra @hmaarrfk - do you have any updates on this? It would be amazing to see this happen :)!

baszalmstra · 2024-05-07T04:56:56Z

@Tobias-Fischer Im still working on the Cuda builds but its a slow process because it takes ages to build them locally so iteration times are suuuper slow.

In parallel we are also looking into getting large Windows runners into the conda-forge infrastructure.

baszalmstra · 2024-05-11T14:33:24Z

Small update:

I have something compiling locally. Still lots of issues (like Windows builds of pytorch 2.1.2 dont compile with python 3.12) but making steady progress. Currently getting megabuilds to work. Will push when I have something reliably working.

baszalmstra · 2024-05-12T08:48:27Z

I got to the testing stage and noticed this:

pytorch-cpu-feedstock/recipe/meta.yaml

Line 300 in f9fd731

    
           - OMP_NUM_THREADS=4 python ./test/run_test.py || true  # [not win and not (aarch64 and cuda_compiler_version != "None")]

However this seems to always fail with (this is from the logs of the latest release):

Ignoring disabled issues:  ['']
Unable to import boto3. Will not be emitting metrics.... Reason: No module named 'boto3'
Missing pip dependency: pytest-rerunfailures, please run `pip install -r .ci/docker/requirements-ci.txt`

Some dependencies are missing. Particularly:

pytest-rerunfailures
pytest-shard (not on conda-forge)
pytest-flakefinder (not on conda-forge)
pytest-xdist

(as can be seen here https://github.com/pytorch/pytorch/blob/6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0/test/run_test.py#L1705)

Given that the test is allowed to fail (due to || true). Should we just remove it? Or put in the effort to fix these tests?

h-vetinari · 2024-05-12T09:02:10Z

Given that the test is allowed to fail (due to || true). Should we just remove it? Or put in the effort to fix these tests?

The more we fix, the better. If it's really a lot of failures, we might not fix it right away (though depending on the severity of the failures, we might want to think twice about releasing something in that state).

In any case, let's leave the testing in, add the required dependencies, and pick up as many fixes as we can.

recipe/patches/0015-fix-cuda-test-win.patch

@isuruf

credits @isuruf

recipe/meta.yaml

h-vetinari · 2024-12-31T16:09:08Z

Looks like we're getting straigh-up segfaults now:

============================= test session starts =============================
platform win32 -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: %SRC_DIR%
plugins: hypothesis-6.123.2, flakefinder-1.1.0, rerunfailures-15.0, xdist-3.6.1
Fatal Python error: Aborted

Thread 0x00001950 (most recent call first):
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\socket.py", line 294 in accept
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\site-packages\pytest_rerunfailures.py", line 440 in run_server
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\threading.py", line 982 in run
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\threading.py", line 1045 in _bootstrap_inner
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\threading.py", line 1002 in _bootstrap

Current thread 0x00001adc (most recent call first):
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 362 in kldivloss_with_log_target_no_reduce_test
  File "C:\bld\libtorch_1735641247947\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 1091 in <module>
  [...]

mgorny · 2024-12-31T17:07:04Z

I don't know how segfaults exhibit on Windows, but on Linux "Aborted" usually means some failed assertions. You can then use pytest -s to get the actual assertion message, since it is normally captured by pytest.

h-vetinari · 2024-12-31T18:01:59Z

Pytest should produce regular reporting for failed tests. The "aborted" is an irregular shutdown

Co-authored-by: Michał Górny <mgorny@gentoo.org>

Tobias-Fischer · 2024-12-31T22:25:15Z

Error for reference:

2024-12-31T22:08:38.0509971Z OMP: Error #15: Initializing libiomp5md.dll, but found libomp140.x86_64.dll already initialized.
2024-12-31T22:08:38.0513317Z OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
2024-12-31T22:08:38.5588572Z Fatal Python error: Aborted

I'll see if it also happens with a non-mkl build.

Tobias-Fischer · 2024-12-31T22:25:25Z

@conda-forge-admin please rerender

…onda-forge-pinning 2024.12.31.18.36.35

h-vetinari · 2024-12-31T22:30:46Z

I'll see if it also happens with a non-mkl build.

It's another reason why we should get rid of intel-openmp already for this PR

Tobias-Fischer · 2024-12-31T23:24:18Z

I'll see if it also happens with a non-mkl build.

It's another reason why we should get rid of intel-openmp already for this PR

I still don’t know enough to fix this unfortunately - if you know what to do, please go ahead (or if you can provide guidance/pointers what to do)

Tobias-Fischer · 2025-01-01T03:16:53Z

We are back to the cuda_version issue:

2025-01-01T02:39:35.0198872Z ___________________ ERROR collecting test/test_autograd.py ____________________
2025-01-01T02:39:35.0199315Z test\test_autograd.py:62: in <module>
2025-01-01T02:39:35.0199822Z     from torch.testing._internal.common_methods_invocations import mask_not_all_zeros
2025-01-01T02:39:35.0200527Z ..\_test_env\lib\site-packages\torch\testing\_internal\common_methods_invocations.py:54: in <module>
2025-01-01T02:39:35.0201155Z     from torch.testing._internal.opinfo.core import (  # noqa: F401
2025-01-01T02:39:35.0201714Z ..\_test_env\lib\site-packages\torch\testing\_internal\opinfo\__init__.py:4: in <module>
2025-01-01T02:39:35.0202179Z     import torch.testing._internal.opinfo.definitions
2025-01-01T02:39:35.0202639Z ..\_test_env\lib\site-packages\torch\testing\_internal\opinfo\definitions\__init__.py:6: in <module>
2025-01-01T02:39:35.0203112Z     from torch.testing._internal.opinfo.definitions import (
2025-01-01T02:39:35.0203593Z ..\_test_env\lib\site-packages\torch\testing\_internal\opinfo\definitions\linalg.py:1585: in <module>
2025-01-01T02:39:35.0204088Z     _get_torch_cuda_version() < (11, 4), "not available before CUDA 11.3.1"
2025-01-01T02:39:35.0205125Z ..\_test_env\lib\site-packages\torch\testing\_internal\common_cuda.py:240: in _get_torch_cuda_version
2025-01-01T02:39:35.0205867Z     return tuple(int(x) for x in cuda_version.split("."))
2025-01-01T02:39:35.0206289Z ..\_test_env\lib\site-packages\torch\testing\_internal\common_cuda.py:240: in <genexpr>
2025-01-01T02:39:35.0206709Z     return tuple(int(x) for x in cuda_version.split("."))
2025-01-01T02:39:35.0207047Z E   ValueError: invalid literal for int() with base 10: 'None'

/cc @isuruf

mgorny · 2025-01-01T15:38:55Z

Looks like there's some memory corruption after all:

2025-01-01T10:31:54.7488844Z Windows fatal exception: code 0xc0000374
2025-01-01T10:31:54.7489138Z 
2025-01-01T10:31:54.7489270Z Thread 0x00001018 (most recent call first):
2025-01-01T10:31:54.7489844Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\socket.py", line 295 in accept
2025-01-01T10:31:54.7490764Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pytest_rerunfailures.py", line 440 in run_server
2025-01-01T10:31:54.7491680Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\threading.py", line 1012 in run
2025-01-01T10:31:54.7492488Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\threading.py", line 1075 in _bootstrap_inner
2025-01-01T10:31:54.7493456Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\threading.py", line 1032 in _bootstrap
2025-01-01T10:31:54.7494004Z 
2025-01-01T10:31:54.7494169Z Current thread 0x00000f90 (most recent call first):
2025-01-01T10:31:54.7495002Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\autograd\graph.py", line 825 in _engine_run_backward
2025-01-01T10:31:54.7496139Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\autograd\__init__.py", line 347 in backward
2025-01-01T10:31:54.7497158Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\_tensor.py", line 581 in backward
2025-01-01T10:31:54.7498032Z   File "C:\bld\libtorch_1735710398339\test_tmp\test\test_nn.py", line 87 in _backward
2025-01-01T10:31:54.7499057Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 3482 in test_noncontig
2025-01-01T10:31:54.7500290Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_nn.py", line 3432 in __call__
2025-01-01T10:31:54.7501266Z   File "C:\bld\libtorch_1735710398339\test_tmp\test\test_nn.py", line 7213 in <lambda>
2025-01-01T10:31:54.7502265Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_utils.py", line 2979 in wrapper
2025-01-01T10:31:54.7503313Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\unittest\case.py", line 589 in _callTestMethod
2025-01-01T10:31:54.7504131Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\unittest\case.py", line 634 in run
2025-01-01T10:31:54.7505735Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_utils.py", line 3084 in _run_custom
2025-01-01T10:31:54.7507308Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\torch\testing\_internal\common_utils.py", line 3112 in run
2025-01-01T10:31:54.7508349Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\unittest\case.py", line 690 in __call__
2025-01-01T10:31:54.7509233Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\unittest.py", line 351 in runtest
2025-01-01T10:31:54.7510261Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 174 in pytest_runtest_call
2025-01-01T10:31:54.7511299Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7512269Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7513230Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7514183Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 242 in <lambda>
2025-01-01T10:31:54.7515145Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 341 in from_call
2025-01-01T10:31:54.7516210Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 241 in call_and_report
2025-01-01T10:31:54.7517267Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 132 in runtestprotocol
2025-01-01T10:31:54.7518362Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\runner.py", line 113 in pytest_runtest_protocol
2025-01-01T10:31:54.7519417Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7520394Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7521352Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7522351Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 362 in pytest_runtestloop
2025-01-01T10:31:54.7523449Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7524474Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7525425Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7526336Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 337 in _main
2025-01-01T10:31:54.7527282Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 283 in wrap_session
2025-01-01T10:31:54.7528369Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\main.py", line 330 in pytest_cmdline_main
2025-01-01T10:31:54.7529412Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_callers.py", line 103 in _multicall
2025-01-01T10:31:54.7530396Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
2025-01-01T10:31:54.7531357Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pluggy\_hooks.py", line 513 in __call__
2025-01-01T10:31:54.7532335Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\config\__init__.py", line 175 in main
2025-01-01T10:31:54.7533384Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\_pytest\config\__init__.py", line 201 in console_main
2025-01-01T10:31:54.7534403Z   File "C:\bld\libtorch_1735710398339\_test_env\Lib\site-packages\pytest\__main__.py", line 9 in <module>
2025-01-01T10:31:54.7535150Z   File "<frozen runpy>", line 88 in _run_code
2025-01-01T10:31:54.7535631Z   File "<frozen runpy>", line 198 in _run_module_as_main

mgorny · 2025-01-02T10:06:40Z

Okay, so I don't think the issues we're seeing are trivial to fix, but I think the overall build architecture is good. And I'd like to give rattler-build a shot here without creating merge conflicts. So I'd like to propose that:

I prepare a fix for non-Windows builds here — either remove sccache bits from build.sh and make the dependency specific to win, or fix them. To be honest, I'd lean towards the former, as I think sccache is only needed here because Windows builds don't reuse the build tree the way Linux builds do — and it conflicts with my local use of ccache.
We temporarily disable Windows builds again and merge this — perhaps into Add triton dependency, readd cudss and cusparselt, mention dev speedup tricks in the README #309 to avoid double rebuilds.
You can work on trying to fix the test issues (I suspect this will mostly involve patches), while I work on getting rattler-build working.

WDYT?

h-vetinari · 2025-01-02T10:15:55Z

What concrete benefit do you forsee from a switch to rattler? This has nothing to do with the test errors. I'd prefer to keep this on conda-build for now, not least because the anaconda folks are planning to contribute some of their recipe bits, and this will surely be easier if we stay on the same format.

For the rest of the plan I'm fine - we can merge this PR as long as the windows builds are still disabled and the linux builds are passing (and not breaking your local workflow).

mgorny · 2025-01-02T10:38:23Z

Well, it will definitely make my local testing faster (given ccache makes the actual build very fast), dependency resolution and other processing in conda-build is a bottleneck. But mostly, I'm worried that the growing complexity of the recipe is making the eventual conversion harder, so I'd rather do it sooner than later, so that future changes are already done in the new format.

Of course, I'm not insisting. There's definitely some cost involved: like the ugly comment form seen in triton-feedstock now, or the fact that the {% for ... %}s will probably have to be inlined.

mgorny · 2025-01-02T13:07:03Z

Hmm, I don't think we have python312_d.lib in Conda-forge (or at least streamlit can't find it).

As a data point, it looks like the upstream build from PyPI doesn't crash — test\test_modules.py passes on my laptop 100%. Not sure how valuable a data point that would be, but perhaps it would make sense to try pip install --force-reinstall torch in the test pipeline, to confirm it's definitely something about the build rather than the environment.

mgorny · 2025-01-02T13:59:19Z

Looking at https://github.com/pytorch/pytorch/actions/runs/11471282816/job/31928694532, upstream is building with mkl as BLAS backend and libiomp5md rather than libomp.

h-vetinari · 2025-01-02T15:27:46Z

The openmp change I was talking about is not difficult at all, just replace intel-openmp with llvm-openmp.

baszalmstra requested review from Tobias-Fischer, beckermr, benjaminrwilson, hmaarrfk and sodre as code owners April 5, 2024 12:55

baszalmstra marked this pull request as draft April 5, 2024 13:00

hmaarrfk reviewed Apr 5, 2024

View reviewed changes

recipe/meta.yaml Outdated Show resolved Hide resolved

baszalmstra mentioned this pull request Apr 6, 2024

Windows VMs Quansight/open-gpu-server#31

Closed

mgorny reviewed Dec 31, 2024

View reviewed changes

recipe/patches/0015-fix-cuda-test-win.patch Outdated Show resolved Hide resolved

Tobias-Fischer added 3 commits December 31, 2024 20:25

Delete recipe/patches/0015-fix-cuda-test-win.patch

f29632a

unset cuda_compiler_version

7445264

credits @isuruf

Disable parallel tests on Windows for now, remove unneeded patch

160221a

mgorny reviewed Dec 31, 2024

View reviewed changes

recipe/meta.yaml Outdated Show resolved Hide resolved

h-vetinari and others added 2 commits December 31, 2024 19:02

try with -s

facc448

Co-authored-by: Michał Górny <mgorny@gentoo.org>

Try non-mkl build

8baaf3b

conda-forge-webservices[bot] and others added 2 commits December 31, 2024 22:27

MNT: Re-rendered with conda-build 24.11.2, conda-smithy 3.45.1, and c…

d7937dd

…onda-forge-pinning 2024.12.31.18.36.35

Trigger

0f03fc4

unset CUDA_VERSION too

3a38400

Tobias-Fischer added 5 commits January 2, 2025 20:57

Build DEBUG mode, enable /fsanitize=address

ac1b365

pytest -v

b4d4719

set ASAN_WIN_CONTINUE_ON_INTERCEPTION_FAILURE=1

a6b0d1b

Fixup

9ecb789

/fsanitize=address in too hard basket

53c4a25

feat: yet another attempt to add windows builds #231

Are you sure you want to change the base?

feat: yet another attempt to add windows builds #231

Conversation

baszalmstra commented Apr 5, 2024

conda-forge-webservices bot commented Apr 5, 2024

conda-forge-webservices bot commented Apr 5, 2024

conda-forge-webservices bot commented Apr 5, 2024

baszalmstra commented Apr 6, 2024

weiji14 commented Apr 6, 2024

Tobias-Fischer commented Apr 6, 2024 • edited Loading

jakirkham commented Apr 6, 2024

baszalmstra commented Apr 6, 2024

baszalmstra commented Apr 6, 2024

jakirkham commented Apr 6, 2024

baszalmstra commented Apr 6, 2024

bkpoon commented Apr 6, 2024

baszalmstra commented Apr 6, 2024

bkpoon commented Apr 6, 2024

hmaarrfk commented Apr 6, 2024

baszalmstra commented Apr 7, 2024

hmaarrfk commented Apr 9, 2024

Tobias-Fischer commented May 6, 2024

baszalmstra commented May 7, 2024

baszalmstra commented May 11, 2024

baszalmstra commented May 12, 2024 • edited Loading

h-vetinari commented May 12, 2024

h-vetinari commented Dec 31, 2024

mgorny commented Dec 31, 2024

h-vetinari commented Dec 31, 2024

Tobias-Fischer commented Dec 31, 2024

Tobias-Fischer commented Dec 31, 2024

h-vetinari commented Dec 31, 2024

Tobias-Fischer commented Dec 31, 2024

Tobias-Fischer commented Jan 1, 2025

mgorny commented Jan 1, 2025

mgorny commented Jan 2, 2025

h-vetinari commented Jan 2, 2025

mgorny commented Jan 2, 2025

mgorny commented Jan 2, 2025

mgorny commented Jan 2, 2025

h-vetinari commented Jan 2, 2025

Tobias-Fischer commented Apr 6, 2024 •

edited

Loading

baszalmstra commented May 12, 2024 •

edited

Loading