Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-120] Float16 support for distributed training #10183

Merged
merged 49 commits into from
Apr 11, 2018

Conversation

rahul003
Copy link
Member

@rahul003 rahul003 commented Mar 21, 2018

Description

Supports distributed training with float16 (and technically float64 as well).

  • Modified parameter server to work at byte level, so that we can send both fp16 and fp32 gradients while training in fp16 and mixed precision modes. The user does not have to do anything special to handle fp16 gradients. KVStore handles it internally. Even originally, ps-lite later converted data to char, doing this at an earlier step gives mxnet more flexibility.

  • The type of request and data type received by server are encoded in a command which is computed using the Cantor pairing function. This generates a unique number z given two numbers x and y, and allows us to invert this number z to give the unique original pair of numbers x and y.

  • For keys with fp16 gradients, they are sent as fp16. All computation on server is done in fp16, and the server responds to worker's pull responses as fp16.

  • FP16 support depends heavily on CPU computation with fp16. This is currently very slow in MXNet.
    There are two things which can greatly improve this speed. I have made PRs for both. So for best performance and to get the below numbers we would need to merge these two PRs

    • Leveraging F16C instruction set support on x86 machines for fast conversion between float16 and float. PR in MShadow
    • Replacing mshadow Eval with Kernel Launch PR in MXNet

Results

8 p3.8x machines, with all 4 gpus being used. Throughput in samples/sec for resnet-50 with imagenet synthetic data

FP32 FP16 Comments
560 150 Only this PR
610 780 With this PR and PR 9029 in MXNet
610 1100 With this PR, PR 9029 in MXNet and PR 330 in MShadow

The effect of PR 9029 is not much for fp32 end to end, but on profiling we do see that the addition operations is faster with Kernel launch. But it is hidden by other factors. In the case of fp16 and gradient compression, the effect is very visible.

Checklist

Essentials

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:

Changes

  • Change kvdist and kvdist_server to receive float, half and double datatypes
  • Some sparse server side handling refactored into functions
  • Added nightly tests, and cleaned up some existing tests

Comments

  • Gradient compression is only supported for fp32. I'll fix them in a different PR. The compression code needs to be changed to handle different datatypes.

@eric-haibin-lin @reminisce @cjolivier01 @anirudh2290 @haojin2 @ptrendx please review

@rahul003 rahul003 requested a review from cjolivier01 as a code owner March 21, 2018 05:54
@rahul003
Copy link
Member Author

rahul003 commented Mar 21, 2018

@solin319 This allows us to send some keys in fp16 and some in fp32. By using parameter server with type char, we also can avoid interface changes. Now the user doesn't have to do anything special to use fp16. Could you also review this?

@rahul003
Copy link
Member Author

rahul003 commented Mar 21, 2018

@marcoabreu I see something weird in the CI build for this.
I made some changes to address things raised by lint. But that broke compilation locally, yet the CI went through. Any idea what's going on?
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10183/2/pipeline/

@marcoabreu
Copy link
Contributor

In how far did it break compilation locally? what's your build environment?

@rahul003
Copy link
Member Author

Locally I noticed it failing for (GPU GCC Openblas) as well as (CPU Clang). The CI status page says it used the right commit 8152ca4

pull_pskv.size *= num_bytes;
CHECK_EQ(static_cast<size_t>(push_pskv.size), compr_size * num_bytes);
CHECK_EQ(static_cast<size_t>(pull_pskv.size), original_size * num_bytes);
CHECK_EQ(push_pskv.lens.size(), num_servers * 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do something like:
CHECK_EQ(static_cast<size_t>(push_pskv.size), compr_size);
CHECK_EQ(static_cast<size_t>(pull_pskv.size), original_size);
push_pskv.size *= num_bytes;
pull_pskv.size *= num_bytes;
instead so that we perform 2 less multiplications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will do

@rahul003 rahul003 closed this Mar 21, 2018
@marcoabreu
Copy link
Contributor

Huh? How come you closed the PR?

@rahul003
Copy link
Member Author

rahul003 commented Apr 5, 2018

Thanks, Do you know why the integration test doesn't show up in the CI steps?

@marcoabreu
Copy link
Contributor

The CI steps are cached and only re-evaluated if the stage is actually reached. It will show up once the job reaches the integration stage.

@rahul003
Copy link
Member Author

rahul003 commented Apr 6, 2018

I added the nightly tests we had for distributed kvstore as integration tests to the CI. And the build has passed.

@everyone Please review and let me know if there are any more concerns

@marcoabreu
Copy link
Contributor

Thanks a lot, Rahul!

@anirudh2290
Copy link
Member

@rahul003 can you fix the conflicts ?

@rahul003
Copy link
Member Author

rahul003 commented Apr 9, 2018

@eric-haibin-lin @piiswrong Is this good to be merged?

@@ -23,6 +23,11 @@ ifndef OPENBLAS_ROOT
export OPENBLAS_ROOT=/usr/local/opt/openblas
endif

# use F16C if the architecture supports it, turned off by default
ifndef USE_F16C
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does 16C stand for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added better comment. F16C is an instruction set extension supported by newer x86 CPUs. It provides intrinsics for faster fp16 compute.

@eric-haibin-lin
Copy link
Member

BTW build is failing

@rahul003
Copy link
Member Author

Amalgamation build failed after I updated mshadow. I've updated the makefile for amalgamation to correct the build.

@anirudh2290
Copy link
Member

@eric-haibin-lin can we merge this ?

@anirudh2290
Copy link
Member

@piiswrong can this be merged ?

@piiswrong piiswrong merged commit 9a8c164 into apache:master Apr 11, 2018
@rahul003 rahul003 deleted the fp16-new branch April 11, 2018 18:48
dabraude pushed a commit to dabraude/incubator-mxnet that referenced this pull request Apr 12, 2018
* send as char

* fix bug on pull response, and rowsparse on worker side

* three modes

* default to mode 0 and add support for row sparse

* refactor sparse

* rowsparse numbytes fixes

* WIP tests

* update test sync

* remove prints

* refactoring

* Revert "refactoring"

This reverts commit 05ffa1b.

* undo refactoring to keep PR simple

* add wait to stored in pull default

* lint fixes

* undo static cast for recvblob

* lint fixes

* mode 1 changes

* sparse bug fix dtype

* mshadow default

* remove unused var

* remove debug statements

* clearer variables, reduced multiplication, const vars

* add const for more vars, comments

* comment syntax, code watcher, test default val

* remove unnecessary print in test

* trigger ci

* multi precision mode (debugging race condition)

* working rsp pushes

* finish multiprecision for row sparse

* rename num-bytes

* fix bug due to rename of numbytes, and remove debug logs

* address comments

* add integration test

* trigger ci

* integration test

* integration test

* fix path of script

* update mshadow

* disable f16c for amalgamation

* fix amalgamation build

* trigger ci

* disable f16c for jetson
rahul003 added a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* send as char

* fix bug on pull response, and rowsparse on worker side

* three modes

* default to mode 0 and add support for row sparse

* refactor sparse

* rowsparse numbytes fixes

* WIP tests

* update test sync

* remove prints

* refactoring

* Revert "refactoring"

This reverts commit 05ffa1b.

* undo refactoring to keep PR simple

* add wait to stored in pull default

* lint fixes

* undo static cast for recvblob

* lint fixes

* mode 1 changes

* sparse bug fix dtype

* mshadow default

* remove unused var

* remove debug statements

* clearer variables, reduced multiplication, const vars

* add const for more vars, comments

* comment syntax, code watcher, test default val

* remove unnecessary print in test

* trigger ci

* multi precision mode (debugging race condition)

* working rsp pushes

* finish multiprecision for row sparse

* rename num-bytes

* fix bug due to rename of numbytes, and remove debug logs

* address comments

* add integration test

* trigger ci

* integration test

* integration test

* fix path of script

* update mshadow

* disable f16c for amalgamation

* fix amalgamation build

* trigger ci

* disable f16c for jetson
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* send as char

* fix bug on pull response, and rowsparse on worker side

* three modes

* default to mode 0 and add support for row sparse

* refactor sparse

* rowsparse numbytes fixes

* WIP tests

* update test sync

* remove prints

* refactoring

* Revert "refactoring"

This reverts commit 05ffa1b.

* undo refactoring to keep PR simple

* add wait to stored in pull default

* lint fixes

* undo static cast for recvblob

* lint fixes

* mode 1 changes

* sparse bug fix dtype

* mshadow default

* remove unused var

* remove debug statements

* clearer variables, reduced multiplication, const vars

* add const for more vars, comments

* comment syntax, code watcher, test default val

* remove unnecessary print in test

* trigger ci

* multi precision mode (debugging race condition)

* working rsp pushes

* finish multiprecision for row sparse

* rename num-bytes

* fix bug due to rename of numbytes, and remove debug logs

* address comments

* add integration test

* trigger ci

* integration test

* integration test

* fix path of script

* update mshadow

* disable f16c for amalgamation

* fix amalgamation build

* trigger ci

* disable f16c for jetson
piiswrong pushed a commit that referenced this pull request Jun 28, 2018
* Remove Fermi from cmake (#10486)

* updated R docs (#10473)

* [MXNET-120] Float16 support for distributed training (#10183)

* send as char

* fix bug on pull response, and rowsparse on worker side

* three modes

* default to mode 0 and add support for row sparse

* refactor sparse

* rowsparse numbytes fixes

* WIP tests

* update test sync

* remove prints

* refactoring

* Revert "refactoring"

This reverts commit 05ffa1b.

* undo refactoring to keep PR simple

* add wait to stored in pull default

* lint fixes

* undo static cast for recvblob

* lint fixes

* mode 1 changes

* sparse bug fix dtype

* mshadow default

* remove unused var

* remove debug statements

* clearer variables, reduced multiplication, const vars

* add const for more vars, comments

* comment syntax, code watcher, test default val

* remove unnecessary print in test

* trigger ci

* multi precision mode (debugging race condition)

* working rsp pushes

* finish multiprecision for row sparse

* rename num-bytes

* fix bug due to rename of numbytes, and remove debug logs

* address comments

* add integration test

* trigger ci

* integration test

* integration test

* fix path of script

* update mshadow

* disable f16c for amalgamation

* fix amalgamation build

* trigger ci

* disable f16c for jetson

* Fix rat excludes (#10499)

* MXNET-308 added missing license (#10497)

* refactored example (#10484)

* [MXNET-298] Scala Infer API docs landing page (#10474)

* changed url references from dmlc to apache/incubator-mxnet

* prepping scala landing pages

* infer api info added

* Fix infer storage type (#10507)

* Fix infer_storage_type

* Add test

* Fix lint

* Trigger CI

* [MXNET-306] Add slice_like operator (#10491)

* add slice_like and doc

* pass unittest and lint

* Minor simplifications in ci/build.py (#10496)

* [MXNET-305] Scala tutorial table fix (#10488)

* initial update on setting up scala ide with mxnet

* moving images to web-data project

* updated links to images; added readme for root folder

* scala hello world feature added

* workaround for make transitive error

* fixed systempath

* minor updates

* table fix

* added some spacing

* more spacing

* added ability to set search path for Accelerate library

* [MXNET-311] change test needs a docker with sudo, hence image changed (#10510)

* added new metric

* changed back from branch to upstream

* lint changed

* Fixed typo

* Clarified interpretation

* Changes for variable names

* fixed variable names

* replay the unit tests

* changed comment
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
…e#10524)

* Remove Fermi from cmake (apache#10486)

* updated R docs (apache#10473)

* [MXNET-120] Float16 support for distributed training (apache#10183)

* send as char

* fix bug on pull response, and rowsparse on worker side

* three modes

* default to mode 0 and add support for row sparse

* refactor sparse

* rowsparse numbytes fixes

* WIP tests

* update test sync

* remove prints

* refactoring

* Revert "refactoring"

This reverts commit 05ffa1b.

* undo refactoring to keep PR simple

* add wait to stored in pull default

* lint fixes

* undo static cast for recvblob

* lint fixes

* mode 1 changes

* sparse bug fix dtype

* mshadow default

* remove unused var

* remove debug statements

* clearer variables, reduced multiplication, const vars

* add const for more vars, comments

* comment syntax, code watcher, test default val

* remove unnecessary print in test

* trigger ci

* multi precision mode (debugging race condition)

* working rsp pushes

* finish multiprecision for row sparse

* rename num-bytes

* fix bug due to rename of numbytes, and remove debug logs

* address comments

* add integration test

* trigger ci

* integration test

* integration test

* fix path of script

* update mshadow

* disable f16c for amalgamation

* fix amalgamation build

* trigger ci

* disable f16c for jetson

* Fix rat excludes (apache#10499)

* MXNET-308 added missing license (apache#10497)

* refactored example (apache#10484)

* [MXNET-298] Scala Infer API docs landing page (apache#10474)

* changed url references from dmlc to apache/incubator-mxnet

* prepping scala landing pages

* infer api info added

* Fix infer storage type (apache#10507)

* Fix infer_storage_type

* Add test

* Fix lint

* Trigger CI

* [MXNET-306] Add slice_like operator (apache#10491)

* add slice_like and doc

* pass unittest and lint

* Minor simplifications in ci/build.py (apache#10496)

* [MXNET-305] Scala tutorial table fix (apache#10488)

* initial update on setting up scala ide with mxnet

* moving images to web-data project

* updated links to images; added readme for root folder

* scala hello world feature added

* workaround for make transitive error

* fixed systempath

* minor updates

* table fix

* added some spacing

* more spacing

* added ability to set search path for Accelerate library

* [MXNET-311] change test needs a docker with sudo, hence image changed (apache#10510)

* added new metric

* changed back from branch to upstream

* lint changed

* Fixed typo

* Clarified interpretation

* Changes for variable names

* fixed variable names

* replay the unit tests

* changed comment
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants