Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Metric] Accelerate the calculation of F1 #9833

Merged
merged 3 commits into from
Feb 27, 2018
Merged

Conversation

sxjscience
Copy link
Member

@sxjscience sxjscience commented Feb 20, 2018

Description

Accelerate the calculation of F1 by removing the for-loop. I find that using GPU will in fact make the code slower 😅.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Accelerate F1

Results of running the script in #9705
Before:

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.8075
F1             cpu(0)    gpu(0)      131072      16             2              1.7331
F1             gpu(0)    cpu(0)      131072      16             2              1.8095
F1             gpu(0)    gpu(0)      131072      16             2              2.5608
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.48318
F1             cpu(0)    gpu(0)      131072      64             2              0.68605
F1             gpu(0)    cpu(0)      131072      64             2              0.59764
F1             gpu(0)    gpu(0)      131072      64             2              0.71643
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.33182
F1             cpu(0)    gpu(0)      131072      256            2              0.49648
F1             gpu(0)    cpu(0)      131072      256            2              0.43996
F1             gpu(0)    gpu(0)      131072      256            2              0.47916
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.30854
F1             cpu(0)    gpu(0)      131072      1024           2              0.3868
F1             gpu(0)    cpu(0)      131072      1024           2              0.33063
F1             gpu(0)    gpu(0)      131072      1024           2              0.33789
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.09547
F1             cpu(0)    gpu(0)      16384       16             2              0.24957
F1             gpu(0)    cpu(0)      16384       16             2              0.325
F1             gpu(0)    gpu(0)      16384       16             2              0.30319
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.0871
F1             cpu(0)    gpu(0)      16384       64             2              0.11098
F1             gpu(0)    cpu(0)      16384       64             2              0.097242
F1             gpu(0)    gpu(0)      16384       64             2              0.1261
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.041346
F1             cpu(0)    gpu(0)      16384       256            2              0.055987
F1             gpu(0)    cpu(0)      16384       256            2              0.054622
F1             gpu(0)    gpu(0)      16384       256            2              0.064322
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.074404
F1             cpu(0)    gpu(0)      16384       1024           2              0.057779
F1             gpu(0)    cpu(0)      16384       1024           2              0.045733
F1             gpu(0)    gpu(0)      16384       1024           2              0.044807
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.012862
F1             cpu(0)    gpu(0)      2048        16             2              0.031966
F1             gpu(0)    cpu(0)      2048        16             2              0.028063
F1             gpu(0)    gpu(0)      2048        16             2              0.032354
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0078101
F1             cpu(0)    gpu(0)      2048        64             2              0.017859
F1             gpu(0)    cpu(0)      2048        64             2              0.011364
F1             gpu(0)    gpu(0)      2048        64             2              0.014009
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.0052049
F1             cpu(0)    gpu(0)      2048        256            2              0.0073137
F1             gpu(0)    cpu(0)      2048        256            2              0.0071828
F1             gpu(0)    gpu(0)      2048        256            2              0.0076649
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.0050001
F1             cpu(0)    gpu(0)      2048        1024           2              0.0089076
F1             gpu(0)    cpu(0)      2048        1024           2              0.011183
F1             gpu(0)    gpu(0)      2048        1024           2              0.0092695
------------------------------------------------------------------------------------------

After

Metric         Data-Ctx  Label-Ctx   Data Size   Batch Size     Output Dim     Elapsed Time
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      16             2              0.69502
F1             cpu(0)    gpu(0)      131072      16             2              1.8135
F1             gpu(0)    cpu(0)      131072      16             2              1.754
F1             gpu(0)    gpu(0)      131072      16             2              2.4507
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      64             2              0.21756
F1             cpu(0)    gpu(0)      131072      64             2              0.30392
F1             gpu(0)    cpu(0)      131072      64             2              0.48876
F1             gpu(0)    gpu(0)      131072      64             2              0.56596
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      256            2              0.070438
F1             cpu(0)    gpu(0)      131072      256            2              0.10176
F1             gpu(0)    cpu(0)      131072      256            2              0.14616
F1             gpu(0)    gpu(0)      131072      256            2              0.17256
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      131072      1024           2              0.027534
F1             cpu(0)    gpu(0)      131072      1024           2              0.049942
F1             gpu(0)    cpu(0)      131072      1024           2              0.031038
F1             gpu(0)    gpu(0)      131072      1024           2              0.035196
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       16             2              0.11917
F1             cpu(0)    gpu(0)      16384       16             2              0.34919
F1             gpu(0)    cpu(0)      16384       16             2              0.22027
F1             gpu(0)    gpu(0)      16384       16             2              0.42634
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       64             2              0.029119
F1             cpu(0)    gpu(0)      16384       64             2              0.049232
F1             gpu(0)    cpu(0)      16384       64             2              0.04909
F1             gpu(0)    gpu(0)      16384       64             2              0.081102
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       256            2              0.0090246
F1             cpu(0)    gpu(0)      16384       256            2              0.014211
F1             gpu(0)    cpu(0)      16384       256            2              0.012915
F1             gpu(0)    gpu(0)      16384       256            2              0.016437
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      16384       1024           2              0.0036936
F1             cpu(0)    gpu(0)      16384       1024           2              0.011009
F1             gpu(0)    cpu(0)      16384       1024           2              0.0092134
F1             gpu(0)    gpu(0)      16384       1024           2              0.0060875
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        16             2              0.015834
F1             cpu(0)    gpu(0)      2048        16             2              0.026115
F1             gpu(0)    cpu(0)      2048        16             2              0.025691
F1             gpu(0)    gpu(0)      2048        16             2              0.030914
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        64             2              0.0044892
F1             cpu(0)    gpu(0)      2048        64             2              0.0088701
F1             gpu(0)    cpu(0)      2048        64             2              0.0074565
F1             gpu(0)    gpu(0)      2048        64             2              0.0094738
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        256            2              0.0012782
F1             cpu(0)    gpu(0)      2048        256            2              0.0021632
F1             gpu(0)    cpu(0)      2048        256            2              0.002121
F1             gpu(0)    gpu(0)      2048        256            2              0.0024769
------------------------------------------------------------------------------------------
F1             cpu(0)    cpu(0)      2048        1024           2              0.00060058
F1             cpu(0)    gpu(0)      2048        1024           2              0.0007112
F1             gpu(0)    cpu(0)      2048        1024           2              0.00064564
F1             gpu(0)    gpu(0)      2048        1024           2              0.00079536
------------------------------------------------------------------------------------------

self.false_negatives += 1.
else:
self.true_negatives += 1.
self.true_positives += ((pred_label == 1) * (label == 1)).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can cache the computation such as predicted_true = pred_label == 1 and use later.

@sxjscience
Copy link
Member Author

sxjscience commented Feb 20, 2018

After some investigation, I find using GPU is not faster because the batch_size tested is rather small, i.e, 16, 64, 256, 1024. NDArray + GPU is not as fast as numpy in these cases.
I've used this script to test the speed:

import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time

# Warm up the GPU
for _ in range(10):
    a = nd.ones((100, 100), ctx=mx.gpu())
    b = a * 2
    b.asnumpy()

N = 100

# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
    dat_npy = np.random.uniform(0, 1, data_shape)
    dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
    dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
    nd.waitall()
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_npy)
    end = time.time()
    np_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_gpu).asscalar()
    end = time.time()
    nd_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_cpu).asscalar()
    end = time.time()
    nd_cpu_time = end - start
    print('sum, data_shape=%s, numpy time=%g, mxnet gpu time=%g, mxnet cpu time=%g' %(str(data_shape), np_time, nd_gpu_time, nd_cpu_time))

Result

sum, data_shape=(16,), numpy time=0.00067687, mxnet gpu time=0.0206566, mxnet cpu time=0.193971
sum, data_shape=(64,), numpy time=0.000299454, mxnet gpu time=0.0147879, mxnet cpu time=0.00626922
sum, data_shape=(256,), numpy time=0.000304699, mxnet gpu time=0.0141888, mxnet cpu time=0.00622177
sum, data_shape=(1024,), numpy time=0.000349522, mxnet gpu time=0.015424, mxnet cpu time=0.00976443

@szha
Copy link
Member

szha commented Feb 20, 2018

Given that those are typical cases, would it make sense to always use cpu?

@sxjscience
Copy link
Member Author

Yes, I think CPU should be used.

@szha
Copy link
Member

szha commented Feb 21, 2018

There's no wait in between the for loops. That should be addressed before calling the results conclusive.

@sxjscience
Copy link
Member Author

We have .asscalar() so it should be okay to not call nd.waitall() between the for-loops.

@szha
Copy link
Member

szha commented Feb 21, 2018

Ah, sorry I read the results wrong. Looks like numpy sum is an order of magnitude faster than either nd CPU or GPU. What does it look like if we take into account asnumpy()?

@sxjscience
Copy link
Member Author

Without .asscalar():

import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time

# Warm up the GPU
for _ in range(10):
    a = nd.ones((100, 100), ctx=mx.gpu())
    b = a * 2
    b.asnumpy()

N = 100

# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
    dat_npy = np.random.uniform(0, 1, data_shape)
    dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
    dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
    nd.waitall()
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_npy)
    end = time.time()
    np_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_gpu)
        nd.waitall()
    end = time.time()
    nd_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_cpu)
        nd.waitall()
    end = time.time()
    nd_cpu_time = end - start
    print('sum, data_shape=%s, numpy time=%g, mxnet gpu time=%g, mxnet cpu time=%g' %(str(data_shape), np_time, nd_gpu_time, nd_cpu_time))

Result:

sum, data_shape=(16,), numpy time=0.000400066, mxnet gpu time=0.0181644, mxnet cpu time=0.0328023
sum, data_shape=(64,), numpy time=0.000379086, mxnet gpu time=0.00761223, mxnet cpu time=0.037406
sum, data_shape=(256,), numpy time=0.000583172, mxnet gpu time=0.0079515, mxnet cpu time=0.064379
sum, data_shape=(1024,), numpy time=0.00065589, mxnet gpu time=0.00781155, mxnet cpu time=0.00705242

@sxjscience
Copy link
Member Author

Our implementations are much slower than numpy if the inputs are small.

@sxjscience
Copy link
Member Author

I've tested the case that calls .asnumpy() before using the numpy.sum. However, the numpy version is still faster:

import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time

# Warm up the GPU
for _ in range(10):
    a = nd.ones((100, 100), ctx=mx.gpu())
    b = a * 2
    b.asnumpy()

N = 100

# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
    dat_npy = np.random.uniform(0, 1, data_shape)
    dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
    dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
    nd.waitall()
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_nd_gpu.asnumpy())
    end = time.time()
    np_copy_from_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        np_ret = np.sum(dat_nd_cpu.asnumpy())
    end = time.time()
    np_copy_from_cpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_gpu).asscalar()
    end = time.time()
    nd_gpu_time = end - start
    start = time.time()
    for _ in range(N):
        nd_ret = nd.sum(dat_nd_cpu).asscalar()
    end = time.time()
    nd_cpu_time = end - start
    print('sum, data_shape=%s, numpy from gpu=%g, numpy from cpu=%g, mxnet gpu=%g, mxnet cpu=%g' %(str(data_shape), np_copy_from_gpu_time, np_copy_from_cpu_time, nd_gpu_time, nd_cpu_time))

Result:

sum, data_shape=(16,), numpy from gpu=0.0132685, numpy from cpu=0.00352144, mxnet gpu=0.0388703, mxnet cpu=0.0236092
sum, data_shape=(64,), numpy from gpu=0.0118794, numpy from cpu=0.00273204, mxnet gpu=0.0307364, mxnet cpu=0.0268292
sum, data_shape=(256,), numpy from gpu=0.0102284, numpy from cpu=0.00267339, mxnet gpu=0.0215137, mxnet cpu=0.0267296
sum, data_shape=(1024,), numpy from gpu=0.0122139, numpy from cpu=0.00344729, mxnet gpu=0.025255, mxnet cpu=0.0139427

@piiswrong piiswrong merged commit b8ae967 into apache:master Feb 27, 2018
@sxjscience sxjscience deleted the acc_f1 branch March 8, 2018 00:52
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Accelerate the calculation of F1

* cache the mid results

* trigger CI
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Accelerate the calculation of F1

* cache the mid results

* trigger CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants