[Metric] Accelerate the calculation of F1 #9833

sxjscience · 2018-02-20T00:23:14Z

Description

Accelerate the calculation of F1 by removing the for-loop. I find that using GPU will in fact make the code slower 😅.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Accelerate F1

Results of running the script in #9705
Before:

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.8075
F1             cpu(0)    gpu(0)      131072      16             2              1.7331
F1             gpu(0)    cpu(0)      131072      16             2              1.8095
F1             gpu(0)    gpu(0)      131072      16             2              2.5608
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.48318
F1             cpu(0)    gpu(0)      131072      64             2              0.68605
F1             gpu(0)    cpu(0)      131072      64             2              0.59764
F1             gpu(0)    gpu(0)      131072      64             2              0.71643
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.33182
F1             cpu(0)    gpu(0)      131072      256            2              0.49648
F1             gpu(0)    cpu(0)      131072      256            2              0.43996
F1             gpu(0)    gpu(0)      131072      256            2              0.47916
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.30854
F1             cpu(0)    gpu(0)      131072      1024           2              0.3868
F1             gpu(0)    cpu(0)      131072      1024           2              0.33063
F1             gpu(0)    gpu(0)      131072      1024           2              0.33789
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.09547
F1             cpu(0)    gpu(0)      16384       16             2              0.24957
F1             gpu(0)    cpu(0)      16384       16             2              0.325
F1             gpu(0)    gpu(0)      16384       16             2              0.30319
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.0871
F1             cpu(0)    gpu(0)      16384       64             2              0.11098
F1             gpu(0)    cpu(0)      16384       64             2              0.097242
F1             gpu(0)    gpu(0)      16384       64             2              0.1261
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.041346
F1             cpu(0)    gpu(0)      16384       256            2              0.055987
F1             gpu(0)    cpu(0)      16384       256            2              0.054622
F1             gpu(0)    gpu(0)      16384       256            2              0.064322
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.074404
F1             cpu(0)    gpu(0)      16384       1024           2              0.057779
F1             gpu(0)    cpu(0)      16384       1024           2              0.045733
F1             gpu(0)    gpu(0)      16384       1024           2              0.044807
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.012862
F1             cpu(0)    gpu(0)      2048        16             2              0.031966
F1             gpu(0)    cpu(0)      2048        16             2              0.028063
F1             gpu(0)    gpu(0)      2048        16             2              0.032354
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0078101
F1             cpu(0)    gpu(0)      2048        64             2              0.017859
F1             gpu(0)    cpu(0)      2048        64             2              0.011364
F1             gpu(0)    gpu(0)      2048        64             2              0.014009
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.0052049
F1             cpu(0)    gpu(0)      2048        256            2              0.0073137
F1             gpu(0)    cpu(0)      2048        256            2              0.0071828
F1             gpu(0)    gpu(0)      2048        256            2              0.0076649
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.0050001
F1             cpu(0)    gpu(0)      2048        1024           2              0.0089076
F1             gpu(0)    cpu(0)      2048        1024           2              0.011183
F1             gpu(0)    gpu(0)      2048        1024           2              0.0092695
------------------------------------------------------------------------------------------

After

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.69502
F1             cpu(0)    gpu(0)      131072      16             2              1.8135
F1             gpu(0)    cpu(0)      131072      16             2              1.754
F1             gpu(0)    gpu(0)      131072      16             2              2.4507
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.21756
F1             cpu(0)    gpu(0)      131072      64             2              0.30392
F1             gpu(0)    cpu(0)      131072      64             2              0.48876
F1             gpu(0)    gpu(0)      131072      64             2              0.56596
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.070438
F1             cpu(0)    gpu(0)      131072      256            2              0.10176
F1             gpu(0)    cpu(0)      131072      256            2              0.14616
F1             gpu(0)    gpu(0)      131072      256            2              0.17256
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.027534
F1             cpu(0)    gpu(0)      131072      1024           2              0.049942
F1             gpu(0)    cpu(0)      131072      1024           2              0.031038
F1             gpu(0)    gpu(0)      131072      1024           2              0.035196
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.11917
F1             cpu(0)    gpu(0)      16384       16             2              0.34919
F1             gpu(0)    cpu(0)      16384       16             2              0.22027
F1             gpu(0)    gpu(0)      16384       16             2              0.42634
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.029119
F1             cpu(0)    gpu(0)      16384       64             2              0.049232
F1             gpu(0)    cpu(0)      16384       64             2              0.04909
F1             gpu(0)    gpu(0)      16384       64             2              0.081102
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.0090246
F1             cpu(0)    gpu(0)      16384       256            2              0.014211
F1             gpu(0)    cpu(0)      16384       256            2              0.012915
F1             gpu(0)    gpu(0)      16384       256            2              0.016437
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.0036936
F1             cpu(0)    gpu(0)      16384       1024           2              0.011009
F1             gpu(0)    cpu(0)      16384       1024           2              0.0092134
F1             gpu(0)    gpu(0)      16384       1024           2              0.0060875
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.015834
F1             cpu(0)    gpu(0)      2048        16             2              0.026115
F1             gpu(0)    cpu(0)      2048        16             2              0.025691
F1             gpu(0)    gpu(0)      2048        16             2              0.030914
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0044892
F1             cpu(0)    gpu(0)      2048        64             2              0.0088701
F1             gpu(0)    cpu(0)      2048        64             2              0.0074565
F1             gpu(0)    gpu(0)      2048        64             2              0.0094738
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.0012782
F1             cpu(0)    gpu(0)      2048        256            2              0.0021632
F1             gpu(0)    cpu(0)      2048        256            2              0.002121
F1             gpu(0)    gpu(0)      2048        256            2              0.0024769
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.00060058
F1             cpu(0)    gpu(0)      2048        1024           2              0.0007112
F1             gpu(0)    cpu(0)      2048        1024           2              0.00064564
F1             gpu(0)    gpu(0)      2048        1024           2              0.00079536
------------------------------------------------------------------------------------------

szha · 2018-02-20T01:24:13Z

python/mxnet/metric.py

-                self.false_negatives += 1.
-            else:
-                self.true_negatives += 1.
+        self.true_positives += ((pred_label == 1) * (label == 1)).sum()


you can cache the computation such as predicted_true = pred_label == 1 and use later.

sxjscience · 2018-02-20T22:34:38Z

After some investigation, I find using GPU is not faster because the batch_size tested is rather small, i.e, 16, 64, 256, 1024. NDArray + GPU is not as fast as numpy in these cases.
I've used this script to test the speed:

import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time

# Warm up the GPU
for _ in range(10):
    a = nd.ones((100, 100), ctx=mx.gpu())
    b = a * 2
    b.asnumpy()

N = 100

# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
    dat_npy = np.random.uniform(0, 1, data_shape)
    dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
    dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
    nd.waitall()
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_npy)
    end = time.time()
    np_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_gpu).asscalar()
    end = time.time()
    nd_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_cpu).asscalar()
    end = time.time()
    nd_cpu_time = end - start
    print('sum, data_shape=%s, numpy time=%g, mxnet gpu time=%g, mxnet cpu time=%g' %(str(data_shape), np_time, nd_gpu_time, nd_cpu_time))

Result

sum, data_shape=(16,), numpy time=0.00067687, mxnet gpu time=0.0206566, mxnet cpu time=0.193971
sum, data_shape=(64,), numpy time=0.000299454, mxnet gpu time=0.0147879, mxnet cpu time=0.00626922
sum, data_shape=(256,), numpy time=0.000304699, mxnet gpu time=0.0141888, mxnet cpu time=0.00622177
sum, data_shape=(1024,), numpy time=0.000349522, mxnet gpu time=0.015424, mxnet cpu time=0.00976443

szha · 2018-02-20T22:36:11Z

Given that those are typical cases, would it make sense to always use cpu?

sxjscience · 2018-02-20T22:41:24Z

Yes, I think CPU should be used.

szha · 2018-02-21T00:10:15Z

There's no wait in between the for loops. That should be addressed before calling the results conclusive.

sxjscience · 2018-02-21T00:13:22Z

We have .asscalar() so it should be okay to not call nd.waitall() between the for-loops.

szha · 2018-02-21T00:19:09Z

Ah, sorry I read the results wrong. Looks like numpy sum is an order of magnitude faster than either nd CPU or GPU. What does it look like if we take into account asnumpy()?

sxjscience · 2018-02-21T00:27:27Z

Without .asscalar():

import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time

# Warm up the GPU
for _ in range(10):
    a = nd.ones((100, 100), ctx=mx.gpu())
    b = a * 2
    b.asnumpy()

N = 100

# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
    dat_npy = np.random.uniform(0, 1, data_shape)
    dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
    dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
    nd.waitall()
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_npy)
    end = time.time()
    np_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_gpu)
        nd.waitall()
    end = time.time()
    nd_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_cpu)
        nd.waitall()
    end = time.time()
    nd_cpu_time = end - start
    print('sum, data_shape=%s, numpy time=%g, mxnet gpu time=%g, mxnet cpu time=%g' %(str(data_shape), np_time, nd_gpu_time, nd_cpu_time))

Result:

sum, data_shape=(16,), numpy time=0.000400066, mxnet gpu time=0.0181644, mxnet cpu time=0.0328023
sum, data_shape=(64,), numpy time=0.000379086, mxnet gpu time=0.00761223, mxnet cpu time=0.037406
sum, data_shape=(256,), numpy time=0.000583172, mxnet gpu time=0.0079515, mxnet cpu time=0.064379
sum, data_shape=(1024,), numpy time=0.00065589, mxnet gpu time=0.00781155, mxnet cpu time=0.00705242

sxjscience · 2018-02-21T00:29:43Z

Our implementations are much slower than numpy if the inputs are small.

sxjscience · 2018-02-21T17:32:40Z

I've tested the case that calls .asnumpy() before using the numpy.sum. However, the numpy version is still faster:

import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time

# Warm up the GPU
for _ in range(10):
    a = nd.ones((100, 100), ctx=mx.gpu())
    b = a * 2
    b.asnumpy()

N = 100

# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
    dat_npy = np.random.uniform(0, 1, data_shape)
    dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
    dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
    nd.waitall()
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_nd_gpu.asnumpy())
    end = time.time()
    np_copy_from_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_nd_cpu.asnumpy())
    end = time.time()
    np_copy_from_cpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_gpu).asscalar()
    end = time.time()
    nd_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_cpu).asscalar()
    end = time.time()
    nd_cpu_time = end - start
    print('sum, data_shape=%s, numpy from gpu=%g, numpy from cpu=%g, mxnet gpu=%g, mxnet cpu=%g' %(str(data_shape), np_copy_from_gpu_time, np_copy_from_cpu_time, nd_gpu_time, nd_cpu_time))

Result:

sum, data_shape=(16,), numpy from gpu=0.0132685, numpy from cpu=0.00352144, mxnet gpu=0.0388703, mxnet cpu=0.0236092
sum, data_shape=(64,), numpy from gpu=0.0118794, numpy from cpu=0.00273204, mxnet gpu=0.0307364, mxnet cpu=0.0268292
sum, data_shape=(256,), numpy from gpu=0.0102284, numpy from cpu=0.00267339, mxnet gpu=0.0215137, mxnet cpu=0.0267296
sum, data_shape=(1024,), numpy from gpu=0.0122139, numpy from cpu=0.00344729, mxnet gpu=0.025255, mxnet cpu=0.0139427

* Accelerate the calculation of F1 * cache the mid results * trigger CI

Accelerate the calculation of F1

fb0670e

sxjscience requested a review from szha as a code owner February 20, 2018 00:23

szha reviewed Feb 20, 2018

View reviewed changes

sxjscience added 2 commits February 20, 2018 10:12

cache the mid results

19b8aeb

trigger CI

cb0e576

piiswrong merged commit b8ae967 into apache:master Feb 27, 2018

sxjscience deleted the acc_f1 branch March 8, 2018 00:52

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018

[Metric] Accelerate the calculation of F1 (apache#9833)

23d9847

* Accelerate the calculation of F1 * cache the mid results * trigger CI

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

[Metric] Accelerate the calculation of F1 (apache#9833)

b071741

* Accelerate the calculation of F1 * cache the mid results * trigger CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metric] Accelerate the calculation of F1 #9833

[Metric] Accelerate the calculation of F1 #9833

sxjscience commented Feb 20, 2018 •

edited

Loading

szha Feb 20, 2018

sxjscience commented Feb 20, 2018 •

edited

Loading

szha commented Feb 20, 2018

sxjscience commented Feb 20, 2018

szha commented Feb 21, 2018

sxjscience commented Feb 21, 2018

szha commented Feb 21, 2018

sxjscience commented Feb 21, 2018

sxjscience commented Feb 21, 2018

sxjscience commented Feb 21, 2018

[Metric] Accelerate the calculation of F1 #9833

[Metric] Accelerate the calculation of F1 #9833

Conversation

sxjscience commented Feb 20, 2018 • edited Loading

Description

Checklist

Essentials

Changes

szha Feb 20, 2018

Choose a reason for hiding this comment

sxjscience commented Feb 20, 2018 • edited Loading

szha commented Feb 20, 2018

sxjscience commented Feb 20, 2018

szha commented Feb 21, 2018

sxjscience commented Feb 21, 2018

szha commented Feb 21, 2018

sxjscience commented Feb 21, 2018

sxjscience commented Feb 21, 2018

sxjscience commented Feb 21, 2018

sxjscience commented Feb 20, 2018 •

edited

Loading

sxjscience commented Feb 20, 2018 •

edited

Loading