Make torch dependency more flexible #355

jluethi · 2023-03-27T15:38:11Z

Currently, we hardcode torch version 1.12 in the fractal-tasks-core dependencies to make it work well on older UZH GPUs. The tasks themselves don't depend on that torch version though and run fine in other torch versions (e.g. 1.13 or even the new 2.0.0).

The 1.12 dependency made some issues on @gusqgm Windows Subsystem Linux test. On the FMI cluster, it's fine on some GPU nodes, but actually runs in the error below on other GPU nodes. I tested with torch 2.0.0 now and then everything works.

Thus, we should make the torch version more flexible. The correct torch version to install depends on the infrastructure, not the task package.

A workaround until we have it is to manually install torch of a given version into the task venv:

source /path/to/task-envs/.fractal/fractal-tasks-core0.9.0/venv/bin/activate
pip uninstall torch
pip install torch==2.0.0

If someone is searching for it, I'm hitting this error message when the torch version doesn't match:

Traceback (most recent call last):
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 693, in <module>
    run_fractal_task(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/_utils.py", line 91, in run_fractal_task
    metadata_update = task_function(**task_args.dict(exclude_unset=True))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 542, in cellpose_segmentation
    new_label_img = masked_loading_wrapper(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_masked_loading.py", line 240, in masked_loading_wrapper
    new_label_img = function(image_array, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 110, in segment_ROI
    mask, _, _ = model.eval(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/models.py", line 552, in eval
    masks, styles, dP, cellprob, p = self._run_cp(x,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/models.py", line 616, in _run_cp
    yf, style = self._run_nets(img, net_avg=net_avg,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 363, in _run_nets
    y, style = self._run_net(img, augment=augment, tile=tile, tile_overlap=tile_overlap,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 442, in _run_net
    y, style = self._run_tiled(imgs, augment=augment, bsize=bsize,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 543, in _run_tiled
    y0, style = self.network(IMG[irange], return_conv=return_conv)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 315, in network
    y, style = self.net(X)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 202, in forward
    T0    = self.downsample(data)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 84, in forward
    xd.append(self.down[n](y))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 47, in forward
    x = self.proj(x) + self.conv[1](self.conv[0](x))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The text was updated successfully, but these errors were encountered:

tcompa · 2023-03-28T08:06:50Z

I guess this was meant to be a fractal-tasks-core issue (unless its goal is to provide a way to install different packages on different clusters).

Relevant refs on CUDA/pytorch versions and compatibility:

tcompa · 2023-03-28T08:07:21Z

(also: ref #220)

jluethi · 2023-03-28T08:09:50Z

My bad. Yes, it should be a tasks issue :)

And the goal would be to allow an admin setting things up or a user installing the core tasks to get more control about which torch version is used. The effect would be that different torch versions are installed on different clusters. Not sure what the best way to make this happen will be, but it shouldn't be a server concern if at all possible :)

tcompa · 2023-03-28T08:10:26Z

A possible way out would be to add package extras, so that one could install the package as

pip install fractal-tasks-core[pytorch112]
pip install fractal-tasks-core[pytorch113]

Let's rediscuss it.

jluethi · 2023-03-29T08:19:50Z

Optional extras specify the pytorch version

If nothing is specified, pip install cellpose will install something (likely the newest pytorch version)

jluethi · 2023-06-06T15:05:45Z

What is our plan regarding torch versions for the fractal-tasks extra? Not the biggest fan of multiple different extra editions tbh, but would be great to allow the torch installation to work better (i.e. also work "out of the box" on more modern system than the UZH GPUs)

tcompa · 2023-06-07T07:51:52Z

Refs (to explore further):

https://peps.python.org/pep-0508/#environment-markers This would be a decent solution, if we can provide a bunch of those markers that identify the UZH system - and if we can make it work in poetry. Big assumption: this also applies to versions, and not only to the actual presence of a dependency.
Arbitrary environment property for environment markers python-poetry/poetry#5222
pytorch and poetry python-poetry/poetry#4231
Instructions for installing PyTorch python-poetry/poetry#6409

tcompa · 2023-06-07T07:53:34Z

https://peps.python.org/pep-0508/#environment-markers This would be a decent solution, if we can provide a bunch of those markers that identify the UZH system - and if we can make it work in poetry. Big assumption: this also applies to versions, and not only to the actual presence of a dependency.

See

Maybe doable by combining

[tool.poetry.dependencies]
pathlib2 = { version = "^2.2", markers = "python_version <= '3.4' or sys_platform == 'win32'" }

with

[tool.poetry.dependencies]
foo = [
    {version = "<=1.9", python = ">=3.6,<3.8"},
    {version = "^2.0", python = ">=3.8"}
]

tcompa · 2023-06-07T11:05:28Z

We explored multiple options with @mfranzon, and we don't see any which makes sense to us via conditional dependencies or something similar. We then propose that:

fractal-tasks-core depends on a more flexible torch version (e.g. <=2.0.0)
The sysadmin keeps installing the "correct" version (of torch, for instance) after the task collection is complete.

Since this is very tedious, we also propose the following workaround for doing it automatically (to be included in fractal-server - we can then open issue over there).
The /api/v1/task/collect/pip/ currently takes this request body:

{
  "package": "string",
  "package_version": "string",
  "package_extras": "string",
  "python_version": "string"
}

We could add an additional attribute, like custom_package_versions. This would be empty by default, and only at UZH we would set it to custom_package_versions={"torch": "1.12.1"}. The behavior of the task collection would then be:

Perform the whole installation in the standard way (NOTE: this must not fail!)
After the installation is complete, run pip install torch==1.12.1 (where pip is replaced by the actual venv pip that is being used).

CAVEAT: this is messing with the package, and thus creating a not-so-clean log of the installation (although we would still include also the additional torch-installation logs). Such an operation is meant to be restricted to very specific cases, where there is an important dependency on hardware or system libraries - things that a regular user should not be using.

IMPORTANT NOTE 1
This workaround cannot bring us out of the versions supported by fractal-tasks-core (for instance). Say that we now require torch>=1.13.0, and then we set custom_package_versions={"torch": "1.12.1"}. This task-collection operation will fail, because the installation of the custom package goes conflicts with fractal-tasks-core.

IMPORTANT NOTE 2
We should never use this feature to install an additional package. For instance if the fractal-tasks-core does not depend on polars, and we specify custom_package_versions={"polars": "1.0"}, then task collection will fail.

MINOR NOTE:
This also fits perfectly with fractal-analytics-platform/fractal-server#686, where we would only need to add the same pip install line in the script.

jluethi · 2023-06-07T11:10:09Z

Thanks for digging into this! Sounds good to me.

I already tested it with torch 2.0.0 on the FMI side and that also works, so I don't see a strong reason for limiting the torch version at all for the time being.

Having the custom_package_versions sounds convenient for the Pelkmans lab setup. If that's not a major effort, I'd be in favor of having this.

tcompa · 2023-06-07T11:38:34Z

Server work is deferred to fractal-analytics-platform/fractal-server#740.
This issue remains for

Make torch dependency more flexible (no constraints at all? up to 2.something?)

jluethi · 2023-06-07T11:51:20Z

I have seen no reason for constraints so far, given that 2.0.0 still worked well. We just need torch for cellpose, right? Do we still add it as an explicit dependency for the extras (to make the custom_package_versions workaround work) or is that not necessary?

jluethi · 2023-06-07T11:51:52Z

Basically, our torch constraint is:

Whatever cellpose needs => they define that
Whatever local hardware requires (=> custom_package_versions)

Note that when torch 2.0 is used this change also introduces additional dependencies (e.g. sympy and mpmath)

tcompa · 2023-06-07T14:55:45Z

We just need torch for cellpose, right?

Anndata also uses it, but they are not very strict in the dependency version: torch is not listed as a direct dependency in https://github.com/scverse/anndata/blob/main/pyproject.toml, and pip install anndata in a fresh environment does not install it. I think they just try to import it, and have a fall-back options if the import fails.

To do:

Find out whether this information is available in some docs, otherwise open an anndata issue about torch supported versions (or at least version that are known to fail)
Find out whether the issues that we sometimes find in the CI are only happening when torch is an implicit dependency or also when we explicitly include it in pyproject.toml.

Note: the list below is a bunch of not-very-systematic tests. This is all preliminary, but it'd be nice to understand things clearly - since we are already at it.

Here are some raw CI tests

torch=1.12 works fine (as expected, since we always used it), see https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201320932
torch=1.13 works fine - see results at https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201458940?pr=402
torch=2.0.1 breaks something, see https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201406726?pr=402
a different torch=2 test works fine - see https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/5201483327

tcompa · 2023-06-08T08:52:47Z

Finally found the issue (it's a torch 2.0.1 issue, which is exposed by anndata imports but unrelated to anndata)

Pytorch 2.0.1 pypi wheel does not install dependent cuda libraries pytorch/pytorch#100974
Poetry seems to ignore dependencies of pytorch 2.0.1 but correctly installs them from local whl file python-poetry/poetry#7902

Current fix: we have to include torch dependency explicitly, and make it <=2.0.0.

tcompa · 2023-06-22T14:09:52Z

For the record, the new size of the installed package is quite larger - and I think this is due to the torch 2.0 requirement of nvidia libraries:

$ pwd
/home/tommaso/Fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal

$ du -hs fractal-tasks-core0.10.0a6/
5.4G	fractal-tasks-core0.10.0a6/

$ du -hs fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/* | sort -h | tail -n 5
86M	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/scipy
99M	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/llvmlite
185M	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/triton
1.3G	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/torch
2.6G	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/nvidia

tcompa transferred this issue from fractal-analytics-platform/fractal-server Mar 28, 2023

tcompa added the package label May 31, 2023

tcompa mentioned this issue Jun 7, 2023

Expose pinned_package_versions for task collection fractal-analytics-platform/fractal-server#740

Closed

tcompa added the High Priority Current Priorities & Blocking Issues label Jun 7, 2023

tcompa added a commit that referenced this issue Jun 7, 2023

Remove direct torch dependency (close #355)

f090e20

Note that when torch 2.0 is used this change also introduces additional dependencies (e.g. sympy and mpmath)

tcompa linked a pull request Jun 7, 2023 that will close this issue

Remove direct torch dependency (close #355) #402

Closed

tcompa mentioned this issue Jun 8, 2023

Relax constraint on torch version #406

Merged

tcompa added a commit that referenced this issue Jun 8, 2023

Make torch version constraint <=2.0.0 (close #355)

4839b24

tcompa closed this as completed in #406 Jun 8, 2023

github-project-automation bot moved this from TODO to Done in Fractal Project Management Jun 8, 2023

tcompa mentioned this issue Jun 22, 2023

Update required CUDA version? #220

Closed

jluethi removed this from Fractal Project Management Apr 10, 2024

tcompa mentioned this issue Jul 2, 2024

Verify compatibility with Python 3.12 #770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make torch dependency more flexible #355

Make torch dependency more flexible #355

jluethi commented Mar 27, 2023

tcompa commented Mar 28, 2023

tcompa commented Mar 28, 2023 •

edited

Loading

jluethi commented Mar 28, 2023

tcompa commented Mar 28, 2023

jluethi commented Mar 29, 2023

jluethi commented Jun 6, 2023

tcompa commented Jun 7, 2023

tcompa commented Jun 7, 2023

tcompa commented Jun 7, 2023

jluethi commented Jun 7, 2023

tcompa commented Jun 7, 2023

jluethi commented Jun 7, 2023

jluethi commented Jun 7, 2023

tcompa commented Jun 7, 2023 •

edited

Loading

tcompa commented Jun 8, 2023

tcompa commented Jun 22, 2023

Make torch dependency more flexible #355

Make torch dependency more flexible #355

Comments

jluethi commented Mar 27, 2023

tcompa commented Mar 28, 2023

tcompa commented Mar 28, 2023 • edited Loading

jluethi commented Mar 28, 2023

tcompa commented Mar 28, 2023

jluethi commented Mar 29, 2023

jluethi commented Jun 6, 2023

tcompa commented Jun 7, 2023

tcompa commented Jun 7, 2023

tcompa commented Jun 7, 2023

jluethi commented Jun 7, 2023

tcompa commented Jun 7, 2023

jluethi commented Jun 7, 2023

jluethi commented Jun 7, 2023

tcompa commented Jun 7, 2023 • edited Loading

tcompa commented Jun 8, 2023

tcompa commented Jun 22, 2023

tcompa commented Mar 28, 2023 •

edited

Loading

tcompa commented Jun 7, 2023 •

edited

Loading