Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Improve sparse pull performance for gluon trainer #11429

Merged
merged 18 commits into from
Jul 9, 2018

Conversation

eric-haibin-lin
Copy link
Member

Description

  • introduce gpu priority queue for row sparse pull operations
  • add ignore_sparse option to kv.pull, which improves hybrid blocks with dense weight and sparse gradient

@leezu @rahul003

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@eric-haibin-lin
Copy link
Member Author

@haojin2

@eric-haibin-lin
Copy link
Member Author

@junrushao1994 pls help review engine code. Thanks!

@junrushao
Copy link
Member

The engine code looks good to me

update_on_kvstore = False
if 'dist' in kvstore.type:
# kv.pull(row_sparse_grad) is not supported for dist kvstore
update_on_kvstore = self._contains_sparse_weight or self._contains_sparse_grad
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the comment I'm guessing you meant not self._contains_sparse_weight and not self._contains_sparse_grad

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended. kv.pull(row_sparse_grad) is not supported for dist kvstore, so we want to set update_on_kvstore = True if there's sparse grad.

@szha
Copy link
Member

szha commented Jun 29, 2018

how does the performance look? @eric-haibin-lin @leezu

@eric-haibin-lin
Copy link
Member Author

eric-haibin-lin commented Jul 1, 2018

@szha For dense weight with sparse grad, this PR pulls sparse grad instead of dense weight. Reduces gpu2gpu copy time from 60ms to less than 1ms.

Copy link
Contributor

@haojin2 haojin2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* \return 0 when success, -1 when failure happens
*/
MXNET_DLL int MXKVStorePull(KVStoreHandle handle,
mx_uint num,
const int* keys,
NDArrayHandle* vals,
int priority);
int priority,
bool ignore_sparse = true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C API doesn't support default value

* \return 0 when success, -1 when failure happens
*/
MXNET_DLL int MXKVStorePullEx(KVStoreHandle handle,
mx_uint num,
const char** keys,
NDArrayHandle* vals,
int priority);
int priority,
bool ignore_sparse = true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added extra CAPIs instead of adding default value to this one

@@ -84,6 +84,8 @@ enum class FnProperty {
kCopyToGPU,
/*! \brief Prioritized sync operation on CPU */
kCPUPrioritized,
/*! \brief Prioritized sync operation on GPU */
kGPUPrioritized,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it at the end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the end.

raise RuntimeError("Cannot set update_on_kvstore to False when sparse "
"gradients and/or sparse weights are present for "
"Parameter '%s'."%param.name)
raise RuntimeError("Cannot set update_on_kvstore to False when sparse weights "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be a error, or can it be a warning and automatically use update_on_kvstore ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, shouldn't this be outside the if contains_sparse_weight condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default update_on_kvstore is None. It's only set if user provides a value on purpose. I think an explicit err is better, since we cannot satisfy user's original intent.
If user set update_on_kvstore to False and the model contains no sparse weight, it's totally fine. Why should this be outside the if condition?

For `RowSparseNDArray` values, this call is ignored,
please use ``row_sparse_pull`` instead.
pull with `RowSparseNDArray` is not supported for dist kvstore.
Please use ``row_sparse_pull`` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ignore_sparse be defaulted to false to be consistent with previous behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous behavior is to always ignore sparse. So it's consistent

@eric-haibin-lin
Copy link
Member Author

@piiswrong @rahul003 pls review again, thanks.

@eric-haibin-lin eric-haibin-lin merged commit 266de6b into apache:master Jul 9, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* clip sparse grad. fix _reduce for rowsparse param

* fix kvstore init for local kv

* trigger

* pull with ignore sparse

* rsp pull with priority

* add doc;

* fix bug in sparse kvstore

* +kvstore test

* add dist kvstore test

* enhance dist kv test

* fix lint

* fix lint

* CR comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants