Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Code and build system for multiple backends #3120

Closed
Tracked by #3122
njzjz opened this issue Jan 9, 2024 · 1 comment
Closed
Tracked by #3122

[Feature Request] Code and build system for multiple backends #3120

njzjz opened this issue Jan 9, 2024 · 1 comment
Milestone

Comments

@njzjz
Copy link
Member

njzjz commented Jan 9, 2024

Summary

This proposal describes how multiple backends are organized and built.

Detailed Description

Code structure

The code may organized as

(updated: Jan 27, 2024)

- deepmd
  - tf
  - pt
- source
  - api_cc (include tf and pt)
  - tests
     - common
     - tf
     - pt

deepmd is actually deepmd_tf, but we must keep compatibility.

Other codes are not affected.

CMake

Currently, CMake has TENSORFLOW_ROOT and USE_TF_PYTHON_LIBS. Now they are not required if one enables USE_PT_PYTHON_LIBS. However, an error should be raised if one does not enable any backend.
One or two backends can be built, as shown below:

image

Pip

By default, both backends are installed. This doesn't bring great changes because PT codes don't use C++.
Add a new environment variable option to disable TF.
Add optional dependencies for PyTorch, i.e.

pip install deepmd-kit[torch]

Pre-built packages with dependencies

As both TensorFlow and PyTorch are pretty large in size, we need to determine whether to bring them into the same pre-built package...

Further Information, Files, and Links

No response

@njzjz njzjz added this to the v3.0.0 milestone Jan 9, 2024
wanghan-iapcm pushed a commit that referenced this issue Jan 24, 2024
See #3120.

- CMake: add `ENABLE_TENSORFLOW` and `ENABLE_PYTORCH`.
`BUILD_TENSORFLOW` will be enabled when `TENSORFLOW_ROOT` is not empty
or `USE_TF_PYTHON_LIBS` is on.
- api_cc: add `BUILD_TENSORFLOW` and `BUILD_PYTORCH` defination. Move
several functions from `common.h` to `commonTF.h` to prevent exposing
them to header files.
- CI: download libtorch in the build/test CC actions.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
iProzd added a commit to iProzd/deepmd-kit that referenced this issue Jan 24, 2024
* Fix max nbor size related issues (deepmodeling#3157)

* Merge master into devel (deepmodeling#3167)

* [pre-commit.ci] pre-commit autoupdate (deepmodeling#3163)

<!--pre-commit.ci start-->
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.1.13 →
v0.1.14](astral-sh/ruff-pre-commit@v0.1.13...v0.1.14)
<!--pre-commit.ci end-->

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* setup PyTorch C++ interface build environement (deepmodeling#3169)

See deepmodeling#3120.

- CMake: add `ENABLE_TENSORFLOW` and `ENABLE_PYTORCH`.
`BUILD_TENSORFLOW` will be enabled when `TENSORFLOW_ROOT` is not empty
or `USE_TF_PYTHON_LIBS` is on.
- api_cc: add `BUILD_TENSORFLOW` and `BUILD_PYTORCH` defination. Move
several functions from `common.h` to `commonTF.h` to prevent exposing
them to header files.
- CI: download libtorch in the build/test CC actions.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

* docs: add TF icons to platform-specific features (deepmodeling#3171)

Fix deepmodeling#3121.

The PyTorch icon can be added when a feature implemented by PyTorch is
added.

However, I can't find a way to add an icon to TOC.


![image](https://github.com/deepmodeling/deepmd-kit/assets/9496702/7f29da27-af81-4850-9da0-79310d216b2d)

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

* add universal Python inference interface DeepPot (deepmodeling#3164)

Need discussion for other classes.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

* detect version in advance before building deepmd-kit-cu11 (deepmodeling#3172)

Fix deepmodeling#3168.

See:
pypa/setuptools-scm#1006 (comment)

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: Denghui Lu <denghuilu@pku.edu.cn>
Co-authored-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jan 29, 2024
Fix deepmodeling#3120.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz
Copy link
Member Author

njzjz commented Jan 29, 2024

Pre-built packages with dependencies

As both TensorFlow and PyTorch are pretty large in size, we need to determine whether to bring them into the same pre-built package...

After investigation, the pre-built package seems impossible due to:

(1) PyTorch from the PyPI has incompatible CXX11_ABI with TF;
(2) PyTorch package built using CXX11_ABI=1 on their website has a higher GLIBC version, incompatible with the current docker image to make the pre-built package;
(3) PyTorch links the CUDA libraries instead of using dlopen(). This means we need to ship CUDA libraries when providing it... Otherwise, it cannot work even on CPUs.
(4) It's too big!

-rwxr-xr-x. 1 jz748 jz748  55M Dec 12 13:10 libnvrtc-b51b459d.so.12
-rwxr-xr-x. 1 jz748 jz748  68M Dec 12 13:10 libcudnn_ops_train.so.8
-rwxr-xr-x. 1 jz748 jz748  87M Dec 12 13:10 libcudnn_ops_infer.so.8
-rwxr-xr-x. 1 jz748 jz748 103M Dec 12 13:09 libcublas-37d11411.so.12
-rwxr-xr-x. 1 jz748 jz748 111M Dec 12 13:10 libcudnn_adv_train.so.8
-rwxr-xr-x. 1 jz748 jz748 120M Dec 12 13:10 libcudnn_adv_infer.so.8
-rwxr-xr-x. 1 jz748 jz748 127M Dec 12 13:10 libcudnn_cnn_train.so.8
-rwxr-xr-x. 1 jz748 jz748 129M Dec 12 13:10 libtorch_cuda_linalg.so
-rw-r--r--. 1 jz748 jz748 140M Dec 12 13:03 libdnnl.a
-rwxr-xr-x. 1 jz748 jz748 427M Dec 12 13:10 libtorch_cpu.so
-rwxr-xr-x. 1 jz748 jz748 492M Dec 12 13:08 libcublasLt-f97bfc2c.so.12
-rwxr-xr-x. 1 jz748 jz748 605M Dec 12 13:10 libcudnn_cnn_infer.so.8
-rwxr-xr-x. 1 jz748 jz748 1.3G Dec 12 13:10 libtorch_cuda.so

wanghan-iapcm pushed a commit that referenced this issue Jan 30, 2024
Fix #3120.

One can disable building the TensorFlow backend during `pip install` by
setting `DP_ENABLE_TENSORFLOW=0`.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz njzjz closed this as completed Jan 30, 2024
iProzd added a commit to iProzd/deepmd-kit that referenced this issue Jan 30, 2024
* throw errors when PyTorch CXX11 ABI is different from TensorFlow (deepmodeling#3201)

If so, throw the following error:
```
-- PyTorch CXX11 ABI: 0
CMake Error at CMakeLists.txt:162 (message):
  PyTorch CXX11 ABI mismatch TensorFlow: 0 != 1
```

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

* allow disabling TensorFlow backend during Python installation (deepmodeling#3200)

Fix deepmodeling#3120.

One can disable building the TensorFlow backend during `pip install` by
setting `DP_ENABLE_TENSORFLOW=0`.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

* breaking: pt: add dp model format and refactor pt impl for the fitting net. (deepmodeling#3199)

- add dp model format (backend independent definition) for the fitting
- refactor torch support, compatible with dp model format
- fix mlp issue: the idt should only be used when a skip connection is
available.
- add tools `to_numpy_array` and `to_torch_tensor`.

---------

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

* remove duplicated fitting output check. fix codeql (deepmodeling#3202)

Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: Han Wang <92130845+wanghan-iapcm@users.noreply.github.com>
Co-authored-by: Han Wang <wang_han@iapcm.ac.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant