Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Ascend platform adaptation code #1881

Open
wants to merge 19 commits into
base: devel
Choose a base branch
from

Conversation

ZhengdQin
Copy link
Contributor

  1. Add transfer-to-ascend module, one can use command "dp transfer-to-ascend mix_precision -i water.pb -o Ascend_transfer.pb" to transfer a model to mix-precision Ascend_transfer.pb, the model can excute on Ascend platform.
  2. Modify dp test module for Ascend platform.
  3. Modify Lammps +deepMD for Ascend platform

@codecov-commenter
Copy link

codecov-commenter commented Aug 30, 2022

Codecov Report

Patch coverage: 48.64% and project coverage change: -0.54 ⚠️

Comparison is base (58bed2b) 78.00% compared to head (563f312) 77.47%.

❗ Current head 563f312 differs from pull request most recent head d0261d5. Consider uploading reports for the commit d0261d5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #1881      +/-   ##
==========================================
- Coverage   78.00%   77.47%   -0.54%     
==========================================
  Files         118      116       -2     
  Lines        9853    10139     +286     
==========================================
+ Hits         7686     7855     +169     
- Misses       2167     2284     +117     
Impacted Files Coverage Δ
deepmd/utils/convert.py 14.65% <ø> (+0.49%) ⬆️
deepmd/infer/deep_pot.py 45.16% <9.44%> (-24.95%) ⬇️
deepmd/entrypoints/convert.py 15.38% <50.00%> (ø)
deepmd/entrypoints/transfer.py 75.59% <58.06%> (+3.47%) ⬆️
deepmd/utils/transfer_to_ascend.py 73.68% <73.68%> (ø)
deepmd/entrypoints/transfer_to_ascend.py 80.00% <80.00%> (ø)
deepmd/train/trainer.py 80.68% <85.71%> (+0.02%) ⬆️
deepmd/entrypoints/__init__.py 100.00% <100.00%> (ø)
deepmd/entrypoints/main.py 92.12% <100.00%> (+4.83%) ⬆️
deepmd/entrypoints/train.py 88.39% <100.00%> (+0.26%) ⬆️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

source/lmp/CMakeLists.txt Outdated Show resolved Hide resolved
source/CMakeLists.txt Outdated Show resolved Hide resolved
deepmd/entrypoints/main.py Outdated Show resolved Hide resolved
deepmd/infer/deep_pot.py Outdated Show resolved Hide resolved
deepmd/utils/argcheck.py Outdated Show resolved Hide resolved
deepmd/entrypoints/transfer_to_ascend.py Outdated Show resolved Hide resolved
deepmd/entrypoints/transfer.py Outdated Show resolved Hide resolved
source/op/prod_env_mat_multi_device.cc Outdated Show resolved Hide resolved
source/api_cc/src/DeepPot.cc Outdated Show resolved Hide resolved
source/api_cc/src/DeepPot.cc Show resolved Hide resolved
@njzjz
Copy link
Member

njzjz commented Aug 30, 2022

The title should be more precise.

@github-actions github-actions bot removed the Gromacs label Aug 31, 2022
source/CMakeLists.txt Outdated Show resolved Hide resolved
source/CMakeLists.txt Outdated Show resolved Hide resolved
deepmd/entrypoints/transfer.py Outdated Show resolved Hide resolved
source/op/prod_env_mat_multi_device.cc Outdated Show resolved Hide resolved
source/lmp/ascend_env.sh.in Outdated Show resolved Hide resolved
deepmd/utils/network.py Outdated Show resolved Hide resolved
b_initializer,
trainable = trainable)
variable_summaries(b, 'bias')
if final_layer and GLOBAL_ASCEND_OUT_PRECISION:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@denghuilu It looks similar as mixed_prec. However the weight for mixed_prec is cast but the weight for GLOBAL_ASCEND_OUT_PRECISION is not cast. What do you think about it?

Copy link
Member

@denghuilu denghuilu Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like they have already set the precisions before running the networks:

jdata["model"]["descriptor"]["precision"] = "float16"
jdata["model"]["fitting_net"]["precision"] = "float16"
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we modify the original precision directly. We tried different mix-precision models on Ascend platform and found that the GLOBAL_ASCEND_OUT_PRECISION being float32 (only the last biasadd is float32) is important to ensure the accuracy of the ascend transfered model, so we cast the every weight except the last biasadd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has the same aim as mixed_prec so it's better to merge these two variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! Considering our mix-precision model is defferent with the mixed_prec defined model. only the last biasadd is the float32 type, so using mixed_prec needs to change the code logic and makes it difficult to understand.

deepmd/entrypoints/transfer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@wanghan-iapcm wanghan-iapcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add document for how to use deepmd-kit on ascend?

Comment on lines 304 to 308
if not self.is_compress:
if self.is_ascend_transfer:
self._init_from_frz_model()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new case for ascend transfer? Is it the same as training with the --init-model option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. The ascend transfer is very similar as training with init-model, the reason we cannot use training with init-model is the following:

  1. Considering we may add new functions in the ascend transfer module in the future, developing a new module has better augmentability.
  2. We cannot use train with init-model directly, since we only build a model without training. At the same time, we can automatically modify the input.json. In this way, we can finish build, freeze and transfer in one command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we cannot use training with init-model is the following:

Considering we may add new functions in the ascend transfer module in the future, developing a new module has better augment ability.
We cannot use dp train with init-model directly, since we only build a model without training. At the same time, we can automatically modify the input.json. In this way, we can finish build, freeze and transfer in one command.

@@ -458,6 +458,7 @@ def model_args ():
doc_sw_rmin = 'The lower boundary of the interpolation between short-range tabulated interaction and DP. It is only required when `use_srtab` is provided.'
doc_sw_rmax = 'The upper boundary of the interpolation between short-range tabulated interaction and DP. It is only required when `use_srtab` is provided.'
doc_compress_config = 'Model compression configurations'
doc_ascend_transfer = 'Model transfer to ascend mix-precision model'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument would not be needed if ascend training is the same as --init-model training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, ascend transfer is different from init-model training. please see the detailed explanations in the above reply.

b_initializer,
trainable = trainable)
variable_summaries(b, 'bias')
if final_layer and GLOBAL_ASCEND_OUT_PRECISION is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a option out_precision to the interface (default=GLOBAL_TF_FLOAT_PRECISION), so not only ascent is supported.
Note: changing the behavior of the function by a global variable is dangerous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! We have removed the global variable and added the option out_precision to the interface.

GLOBAL_ASCEND_OUT_PRECISION,
b_initializer,
trainable = trainable)
variable_summaries(b, 'bias')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mv variable_summaries out of the if-else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we have fixed it.

Comment on lines 39 to 41
* @brief Initialize the DP.
* @param[in] model The name of the frozen model file.
* @param[in] gpu_rank The GPU rank. Default is 0.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz update the doc str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we have add the reference.

* @param[in] model The name of the frozen model file.
* @param[in] gpu_rank The GPU rank. Default is 0.
**/
void init (const std::string & model, const int & nloc, const int & gpu_rank = 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you require the nlocal to be a constant? this is a very strict restriction, as the number of atoms in a local region may change during MD simulations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the nlocal for Ascend platform is a constant, cause we pad the number of each type of atom. Considering nocal changes during the inference, we increase the value to 1.1 times the original nlocal.

type_count[type[ii]-1] ++;
}
deep_pot.init_graph (arg[0], type_count, get_file_content(arg[0]));
deep_pot.init (arg[0], nlocal, get_node_rank());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nlocal changes if the number of subregions > 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we pad the nlocal value, so it is OK if the fluctuation range is less than the given value.

@wanghan-iapcm
Copy link
Collaborator

Please provide a proper title for this PR.

@amcadmus
Copy link
Member

amcadmus commented Sep 7, 2022

Is it possible to provide unit tests for the contributed code?

@denghuilu
Copy link
Member

It seems that on non-data center GPU cards, the transfered model has an impressive speedup performance. I have tested the new model in a local 1080ti environment and achieved a speedup by a factor of 7.5 (water benchmark system, 12288 atoms):

double precision original model

image

ascend method transfer model
image

@njzjz
Copy link
Member

njzjz commented Sep 8, 2022

@denghuilu Are they the same model? The output looks different.

@ZhengdQin ZhengdQin changed the title Devel Adds Ascend platform adaptation code Sep 14, 2022
@ZhengdQin
Copy link
Contributor Author

Is it possible to provide unit tests for the contributed code?
Good idea! we will add the tests in the next commit.

@ZhengdQin ZhengdQin closed this Sep 14, 2022
@ZhengdQin ZhengdQin deleted the devel branch September 14, 2022 09:57
@ZhengdQin ZhengdQin restored the devel branch September 14, 2022 09:58
@ZhengdQin ZhengdQin reopened this Sep 14, 2022
Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve conflicts

deepmd/entrypoints/transfer.py Outdated Show resolved Hide resolved
deepmd/utils/transfer_to_ascend.py Outdated Show resolved Hide resolved
deepmd/utils/transfer_to_ascend.py Outdated Show resolved Hide resolved
source/CMakeLists.txt Show resolved Hide resolved
source/CMakeLists.txt Outdated Show resolved Hide resolved
@github-actions github-actions bot added the Docs label Sep 15, 2022
----------
new_graph : tf.Graph
orginal new graph
Returns :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns :
Returns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we have fixed it.

----------
feed_dict : dict of tensor
Session original feed_dict includes coord, type, box, mesh, natoms.
t_out : list of tensor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type and the order are different from those in line 408. Please check which is right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we have fixed it.

@denghuilu
Copy link
Member

denghuilu commented Oct 24, 2022

@denghuilu Are they the same model? The output looks different.

No, they are not the same models, ascend method transfer model was casted from the original model. There may be some truncation error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants