[MXNET-120] Float16 support for distributed training #10183

rahul003 · 2018-03-21T05:54:32Z

Description

Supports distributed training with float16 (and technically float64 as well).

Modified parameter server to work at byte level, so that we can send both fp16 and fp32 gradients while training in fp16 and mixed precision modes. The user does not have to do anything special to handle fp16 gradients. KVStore handles it internally. Even originally, ps-lite later converted data to char, doing this at an earlier step gives mxnet more flexibility.
The type of request and data type received by server are encoded in a command which is computed using the Cantor pairing function. This generates a unique number z given two numbers x and y, and allows us to invert this number z to give the unique original pair of numbers x and y.
For keys with fp16 gradients, they are sent as fp16. All computation on server is done in fp16, and the server responds to worker's pull responses as fp16.
FP16 support depends heavily on CPU computation with fp16. This is currently very slow in MXNet.
There are two things which can greatly improve this speed. I have made PRs for both. So for best performance and to get the below numbers we would need to merge these two PRs
- Leveraging F16C instruction set support on x86 machines for fast conversion between float16 and float. PR in MShadow
- Replacing mshadow Eval with Kernel Launch PR in MXNet

Results

8 p3.8x machines, with all 4 gpus being used. Throughput in samples/sec for resnet-50 with imagenet synthetic data

FP32	FP16	Comments
560	150	Only this PR
610	780	With this PR and PR 9029 in MXNet
610	1100	With this PR, PR 9029 in MXNet and PR 330 in MShadow

The effect of PR 9029 is not much for fp32 end to end, but on profiling we do see that the addition operations is faster with Kernel launch. But it is hidden by other factors. In the case of fp16 and gradient compression, the effect is very visible.

Checklist

Essentials

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Code is well-documented:

Changes

Change kvdist and kvdist_server to receive float, half and double datatypes
Some sparse server side handling refactored into functions
Added nightly tests, and cleaned up some existing tests

Comments

Gradient compression is only supported for fp32. I'll fix them in a different PR. The compression code needs to be changed to handle different datatypes.

@eric-haibin-lin @reminisce @cjolivier01 @anirudh2290 @haojin2 @ptrendx please review

This reverts commit 05ffa1b.

rahul003 · 2018-03-21T06:35:22Z

@solin319 This allows us to send some keys in fp16 and some in fp32. By using parameter server with type char, we also can avoid interface changes. Now the user doesn't have to do anything special to use fp16. Could you also review this?

rahul003 · 2018-03-21T16:02:00Z

@marcoabreu I see something weird in the CI build for this.
I made some changes to address things raised by lint. But that broke compilation locally, yet the CI went through. Any idea what's going on?
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10183/2/pipeline/

marcoabreu · 2018-03-21T17:36:01Z

In how far did it break compilation locally? what's your build environment?

rahul003 · 2018-03-21T17:40:56Z

Locally I noticed it failing for (GPU GCC Openblas) as well as (CPU Clang). The CI status page says it used the right commit 8152ca4

haojin2 · 2018-03-21T17:48:31Z

src/kvstore/kvstore_dist.h

+        pull_pskv.size *= num_bytes;
+        CHECK_EQ(static_cast<size_t>(push_pskv.size), compr_size * num_bytes);
+        CHECK_EQ(static_cast<size_t>(pull_pskv.size), original_size * num_bytes);
+        CHECK_EQ(push_pskv.lens.size(), num_servers * 2);


Can we not do something like:
CHECK_EQ(static_cast<size_t>(push_pskv.size), compr_size);
CHECK_EQ(static_cast<size_t>(pull_pskv.size), original_size);
push_pskv.size *= num_bytes;
pull_pskv.size *= num_bytes;
instead so that we perform 2 less multiplications?

Good idea, will do

marcoabreu · 2018-03-21T23:39:38Z

Huh? How come you closed the PR?

…6-new

rahul003 · 2018-04-05T15:15:32Z

Thanks, Do you know why the integration test doesn't show up in the CI steps?

marcoabreu · 2018-04-05T15:17:05Z

The CI steps are cached and only re-evaluated if the stage is actually reached. It will show up once the job reaches the integration stage.

rahul003 · 2018-04-06T08:01:12Z

I added the nightly tests we had for distributed kvstore as integration tests to the CI. And the build has passed.

@everyone Please review and let me know if there are any more concerns

marcoabreu · 2018-04-06T10:50:18Z

Thanks a lot, Rahul!

anirudh2290 · 2018-04-08T18:52:23Z

@rahul003 can you fix the conflicts ?

…to fp16-new

rahul003 · 2018-04-09T18:27:55Z

@eric-haibin-lin @piiswrong Is this good to be merged?

eric-haibin-lin · 2018-04-10T04:51:36Z

amalgamation/Makefile

@@ -23,6 +23,11 @@ ifndef OPENBLAS_ROOT
    export OPENBLAS_ROOT=/usr/local/opt/openblas
 endif

+# use F16C if the architecture supports it, turned off by default
+ifndef USE_F16C


what does 16C stand for?

added better comment. F16C is an instruction set extension supported by newer x86 CPUs. It provides intrinsics for faster fp16 compute.

eric-haibin-lin · 2018-04-10T04:56:19Z

BTW build is failing

rahul003 · 2018-04-10T07:38:26Z

Amalgamation build failed after I updated mshadow. I've updated the makefile for amalgamation to correct the build.

…to fp16-new

anirudh2290 · 2018-04-10T23:45:46Z

@eric-haibin-lin can we merge this ?

anirudh2290 · 2018-04-11T04:44:11Z

@piiswrong can this be merged ?

* send as char * fix bug on pull response, and rowsparse on worker side * three modes * default to mode 0 and add support for row sparse * refactor sparse * rowsparse numbytes fixes * WIP tests * update test sync * remove prints * refactoring * Revert "refactoring" This reverts commit 05ffa1b. * undo refactoring to keep PR simple * add wait to stored in pull default * lint fixes * undo static cast for recvblob * lint fixes * mode 1 changes * sparse bug fix dtype * mshadow default * remove unused var * remove debug statements * clearer variables, reduced multiplication, const vars * add const for more vars, comments * comment syntax, code watcher, test default val * remove unnecessary print in test * trigger ci * multi precision mode (debugging race condition) * working rsp pushes * finish multiprecision for row sparse * rename num-bytes * fix bug due to rename of numbytes, and remove debug logs * address comments * add integration test * trigger ci * integration test * integration test * fix path of script * update mshadow * disable f16c for amalgamation * fix amalgamation build * trigger ci * disable f16c for jetson

* Remove Fermi from cmake (#10486) * updated R docs (#10473) * [MXNET-120] Float16 support for distributed training (#10183) * send as char * fix bug on pull response, and rowsparse on worker side * three modes * default to mode 0 and add support for row sparse * refactor sparse * rowsparse numbytes fixes * WIP tests * update test sync * remove prints * refactoring * Revert "refactoring" This reverts commit 05ffa1b. * undo refactoring to keep PR simple * add wait to stored in pull default * lint fixes * undo static cast for recvblob * lint fixes * mode 1 changes * sparse bug fix dtype * mshadow default * remove unused var * remove debug statements * clearer variables, reduced multiplication, const vars * add const for more vars, comments * comment syntax, code watcher, test default val * remove unnecessary print in test * trigger ci * multi precision mode (debugging race condition) * working rsp pushes * finish multiprecision for row sparse * rename num-bytes * fix bug due to rename of numbytes, and remove debug logs * address comments * add integration test * trigger ci * integration test * integration test * fix path of script * update mshadow * disable f16c for amalgamation * fix amalgamation build * trigger ci * disable f16c for jetson * Fix rat excludes (#10499) * MXNET-308 added missing license (#10497) * refactored example (#10484) * [MXNET-298] Scala Infer API docs landing page (#10474) * changed url references from dmlc to apache/incubator-mxnet * prepping scala landing pages * infer api info added * Fix infer storage type (#10507) * Fix infer_storage_type * Add test * Fix lint * Trigger CI * [MXNET-306] Add slice_like operator (#10491) * add slice_like and doc * pass unittest and lint * Minor simplifications in ci/build.py (#10496) * [MXNET-305] Scala tutorial table fix (#10488) * initial update on setting up scala ide with mxnet * moving images to web-data project * updated links to images; added readme for root folder * scala hello world feature added * workaround for make transitive error * fixed systempath * minor updates * table fix * added some spacing * more spacing * added ability to set search path for Accelerate library * [MXNET-311] change test needs a docker with sudo, hence image changed (#10510) * added new metric * changed back from branch to upstream * lint changed * Fixed typo * Clarified interpretation * Changes for variable names * fixed variable names * replay the unit tests * changed comment

…e#10524) * Remove Fermi from cmake (apache#10486) * updated R docs (apache#10473) * [MXNET-120] Float16 support for distributed training (apache#10183) * send as char * fix bug on pull response, and rowsparse on worker side * three modes * default to mode 0 and add support for row sparse * refactor sparse * rowsparse numbytes fixes * WIP tests * update test sync * remove prints * refactoring * Revert "refactoring" This reverts commit 05ffa1b. * undo refactoring to keep PR simple * add wait to stored in pull default * lint fixes * undo static cast for recvblob * lint fixes * mode 1 changes * sparse bug fix dtype * mshadow default * remove unused var * remove debug statements * clearer variables, reduced multiplication, const vars * add const for more vars, comments * comment syntax, code watcher, test default val * remove unnecessary print in test * trigger ci * multi precision mode (debugging race condition) * working rsp pushes * finish multiprecision for row sparse * rename num-bytes * fix bug due to rename of numbytes, and remove debug logs * address comments * add integration test * trigger ci * integration test * integration test * fix path of script * update mshadow * disable f16c for amalgamation * fix amalgamation build * trigger ci * disable f16c for jetson * Fix rat excludes (apache#10499) * MXNET-308 added missing license (apache#10497) * refactored example (apache#10484) * [MXNET-298] Scala Infer API docs landing page (apache#10474) * changed url references from dmlc to apache/incubator-mxnet * prepping scala landing pages * infer api info added * Fix infer storage type (apache#10507) * Fix infer_storage_type * Add test * Fix lint * Trigger CI * [MXNET-306] Add slice_like operator (apache#10491) * add slice_like and doc * pass unittest and lint * Minor simplifications in ci/build.py (apache#10496) * [MXNET-305] Scala tutorial table fix (apache#10488) * initial update on setting up scala ide with mxnet * moving images to web-data project * updated links to images; added readme for root folder * scala hello world feature added * workaround for make transitive error * fixed systempath * minor updates * table fix * added some spacing * more spacing * added ability to set search path for Accelerate library * [MXNET-311] change test needs a docker with sudo, hence image changed (apache#10510) * added new metric * changed back from branch to upstream * lint changed * Fixed typo * Clarified interpretation * Changes for variable names * fixed variable names * replay the unit tests * changed comment

rahul003 and others added 13 commits March 16, 2018 14:54

send as char

15ce25e

fix bug on pull response, and rowsparse on worker side

d1bf0a6

three modes

f839d1b

default to mode 0 and add support for row sparse

614b316

refactor sparse

cb768e3

rowsparse numbytes fixes

a42b798

WIP tests

17497b2

update test sync

060e8bf

remove prints

7442cd1

refactoring

05ffa1b

Revert "refactoring"

e954ce1

This reverts commit 05ffa1b.

undo refactoring to keep PR simple

9e93eff

add wait to stored in pull default

03f852a

rahul003 requested a review from cjolivier01 as a code owner March 21, 2018 05:54

lint fixes

8152ca4

undo static cast for recvblob

d016faa

haojin2 reviewed Mar 21, 2018

View reviewed changes

lint fixes

4b722e6

rahul003 closed this Mar 21, 2018

rahul003 and others added 6 commits March 22, 2018 17:09

mode 1 changes

867b633

merge master

e06ad86

sparse bug fix dtype

5e1e869

mshadow default

36923c1

remove unused var

b3d7bd7

Merge branch 'fp16-new' of https://github.com/rahul003/mxnet into fp1…

aec1eb5

…6-new

rahul003 added 4 commits April 5, 2018 10:36

trigger ci

5fd9f7b

integration test

963bb78

integration test

1406b6f

fix path of script

7d6b13d

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

5bea39e

…to fp16-new

rahul003 added 2 commits April 9, 2018 14:43

update mshadow

22f5508

disable f16c for amalgamation

a753843

eric-haibin-lin reviewed Apr 10, 2018

View reviewed changes

fix amalgamation build

41668ec

rahul003 added 3 commits April 10, 2018 18:07

trigger ci

a9fa3b9

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

521edae

…to fp16-new

disable f16c for jetson

605c613

eric-haibin-lin approved these changes Apr 11, 2018

View reviewed changes

piiswrong merged commit 9a8c164 into apache:master Apr 11, 2018

rahul003 deleted the fp16-new branch April 11, 2018 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-120] Float16 support for distributed training #10183

[MXNET-120] Float16 support for distributed training #10183

rahul003 commented Mar 21, 2018 •

edited

Loading

rahul003 commented Mar 21, 2018 •

edited

Loading

rahul003 commented Mar 21, 2018 •

edited

Loading

marcoabreu commented Mar 21, 2018

rahul003 commented Mar 21, 2018

haojin2 Mar 21, 2018

rahul003 Mar 21, 2018

marcoabreu commented Mar 21, 2018

rahul003 commented Apr 5, 2018

marcoabreu commented Apr 5, 2018

rahul003 commented Apr 6, 2018

marcoabreu commented Apr 6, 2018

anirudh2290 commented Apr 8, 2018

rahul003 commented Apr 9, 2018

eric-haibin-lin Apr 10, 2018

rahul003 Apr 10, 2018

eric-haibin-lin commented Apr 10, 2018

rahul003 commented Apr 10, 2018

anirudh2290 commented Apr 10, 2018

anirudh2290 commented Apr 11, 2018

[MXNET-120] Float16 support for distributed training #10183

[MXNET-120] Float16 support for distributed training #10183

Conversation

rahul003 commented Mar 21, 2018 • edited Loading

Description

Results

Checklist

Essentials

Changes

Comments

rahul003 commented Mar 21, 2018 • edited Loading

rahul003 commented Mar 21, 2018 • edited Loading

marcoabreu commented Mar 21, 2018

rahul003 commented Mar 21, 2018

haojin2 Mar 21, 2018

Choose a reason for hiding this comment

rahul003 Mar 21, 2018

Choose a reason for hiding this comment

marcoabreu commented Mar 21, 2018

rahul003 commented Apr 5, 2018

marcoabreu commented Apr 5, 2018

rahul003 commented Apr 6, 2018

marcoabreu commented Apr 6, 2018

anirudh2290 commented Apr 8, 2018

rahul003 commented Apr 9, 2018

eric-haibin-lin Apr 10, 2018

Choose a reason for hiding this comment

rahul003 Apr 10, 2018

Choose a reason for hiding this comment

eric-haibin-lin commented Apr 10, 2018

rahul003 commented Apr 10, 2018

anirudh2290 commented Apr 10, 2018

anirudh2290 commented Apr 11, 2018

rahul003 commented Mar 21, 2018 •

edited

Loading

rahul003 commented Mar 21, 2018 •

edited

Loading

rahul003 commented Mar 21, 2018 •

edited

Loading