-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Metric] Accelerate the calculation of F1 #9833
Conversation
python/mxnet/metric.py
Outdated
self.false_negatives += 1. | ||
else: | ||
self.true_negatives += 1. | ||
self.true_positives += ((pred_label == 1) * (label == 1)).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can cache the computation such as predicted_true = pred_label == 1
and use later.
After some investigation, I find using GPU is not faster because the batch_size tested is rather small, i.e, 16, 64, 256, 1024. NDArray + GPU is not as fast as numpy in these cases. import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time
# Warm up the GPU
for _ in range(10):
a = nd.ones((100, 100), ctx=mx.gpu())
b = a * 2
b.asnumpy()
N = 100
# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
dat_npy = np.random.uniform(0, 1, data_shape)
dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
nd.waitall()
start = time.time()
for _ in range(N):
np_ret = np.sum(dat_npy)
end = time.time()
np_time = end - start
start = time.time()
for _ in range(N):
nd_ret = nd.sum(dat_nd_gpu).asscalar()
end = time.time()
nd_gpu_time = end - start
start = time.time()
for _ in range(N):
nd_ret = nd.sum(dat_nd_cpu).asscalar()
end = time.time()
nd_cpu_time = end - start
print('sum, data_shape=%s, numpy time=%g, mxnet gpu time=%g, mxnet cpu time=%g' %(str(data_shape), np_time, nd_gpu_time, nd_cpu_time)) Result
|
Given that those are typical cases, would it make sense to always use cpu? |
Yes, I think CPU should be used. |
There's no wait in between the for loops. That should be addressed before calling the results conclusive. |
We have |
Ah, sorry I read the results wrong. Looks like numpy sum is an order of magnitude faster than either nd CPU or GPU. What does it look like if we take into account asnumpy()? |
Without import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time
# Warm up the GPU
for _ in range(10):
a = nd.ones((100, 100), ctx=mx.gpu())
b = a * 2
b.asnumpy()
N = 100
# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
dat_npy = np.random.uniform(0, 1, data_shape)
dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
nd.waitall()
start = time.time()
for _ in range(N):
np_ret = np.sum(dat_npy)
end = time.time()
np_time = end - start
start = time.time()
for _ in range(N):
nd_ret = nd.sum(dat_nd_gpu)
nd.waitall()
end = time.time()
nd_gpu_time = end - start
start = time.time()
for _ in range(N):
nd_ret = nd.sum(dat_nd_cpu)
nd.waitall()
end = time.time()
nd_cpu_time = end - start
print('sum, data_shape=%s, numpy time=%g, mxnet gpu time=%g, mxnet cpu time=%g' %(str(data_shape), np_time, nd_gpu_time, nd_cpu_time)) Result:
|
Our implementations are much slower than numpy if the inputs are small. |
I've tested the case that calls import mxnet as mx
import mxnet.ndarray as nd
import numpy as np
import time
# Warm up the GPU
for _ in range(10):
a = nd.ones((100, 100), ctx=mx.gpu())
b = a * 2
b.asnumpy()
N = 100
# Test the speed
for data_shape in [(16,), (64,), (256,), (1024,)]:
dat_npy = np.random.uniform(0, 1, data_shape)
dat_nd_gpu = nd.array(dat_npy, ctx=mx.gpu())
dat_nd_cpu = nd.array(dat_npy, ctx=mx.cpu())
nd.waitall()
start = time.time()
for _ in range(N):
np_ret = np.sum(dat_nd_gpu.asnumpy())
end = time.time()
np_copy_from_gpu_time = end - start
start = time.time()
for _ in range(N):
np_ret = np.sum(dat_nd_cpu.asnumpy())
end = time.time()
np_copy_from_cpu_time = end - start
start = time.time()
for _ in range(N):
nd_ret = nd.sum(dat_nd_gpu).asscalar()
end = time.time()
nd_gpu_time = end - start
start = time.time()
for _ in range(N):
nd_ret = nd.sum(dat_nd_cpu).asscalar()
end = time.time()
nd_cpu_time = end - start
print('sum, data_shape=%s, numpy from gpu=%g, numpy from cpu=%g, mxnet gpu=%g, mxnet cpu=%g' %(str(data_shape), np_copy_from_gpu_time, np_copy_from_cpu_time, nd_gpu_time, nd_cpu_time)) Result:
|
* Accelerate the calculation of F1 * cache the mid results * trigger CI
* Accelerate the calculation of F1 * cache the mid results * trigger CI
Description
Accelerate the calculation of F1 by removing the for-loop. I find that using GPU will in fact make the code slower 😅.
Checklist
Essentials
make lint
)Changes
Results of running the script in #9705
Before:
After