[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial #2510

Laurawly · 2019-01-25T21:00:08Z

Thanks to @vinx13 's pr #2420, argsort working now on GPUs.
Tested SSD full pipeline on NVIDIA K80c and Intel HD graphics. Performance improved compared with heterogenous results.
Please review @masahi @kevinthesun @zhiics

… fully on the gpu

masahi · 2019-01-25T23:21:07Z

tutorials/nnvm/deploy_ssd.py

+#ctx = tvm.gpu(0)
+# Use these commented settings to build for opencl.
+#target = 'opencl'
+#ctx = tvm.gpu(0)


if I remember correctly, for opencl it should be tvm.opencl(0) or tvm.cl(0), isn't it?

Yes, sry I forgot to change.

masahi · 2019-01-25T23:42:46Z

@Laurawly @vinx13 can we share the sorting IR in this PR and @vinx13's PR #2420? They look identical.

Laurawly · 2019-01-25T23:58:21Z

I agree to put sort in a common file. And we can add a unitest for it as well.

vinx13 · 2019-01-26T18:21:35Z

topi/python/topi/cuda/nms.py

+    with ib.for_range(0, batch, for_type="unroll") as b:
+        start = b * num_anchors
+        with ib.if_scope(tid < num_anchors):
+            p_out[start + tid] = tid


seems storage_sync is missing here, I will update my pr

@vinx13 Would you like to seperate argsort to a seperate file so that we can share the use of it? I can add unitest to it if needed.

@Laurawly What's needed in ssd? Seems that you changed num_bbox in my pr to p_index[0], why only first element in p_index is used?

Maybe we can make argsort a normal topi op? I'll add cpu implementation later.

@vinx13 p_index is the valid_count variable which is a 1D array resulted from the multibox operators. So instead of sorting all of data.shape[1] numbers, we only need to sort the first p_index[0] numbers.

@Laurawly shouldn't be p_index[batch_id]? are you assuming batch = 1?

@vinx13 p_index only have one dimension. So it should be p_index[0].

@kevinthesun @Laurawly The difficulty of sharing argsort (or extract it as a topi operator) is that we hope sort_num can be either a tvm.Tensor or constant array, but we can't use tvm.Expr to subscript a python array. Do you have ideas?

vinx13 · 2019-01-28T06:03:53Z

topi/python/topi/cuda/nms.py

-                with ib.else_scope():
-                    start = sizes[tid-1]
-                p_out[base_idx + k * axis_mul_after] = tvm.if_then_else(
-                    k < p_index[tid], index_new[k+start], k)


@Laurawly still confused, if batch > 1, it should enter this if branch (since axis_mul_before * axis_mul_after > 1). Does p_index[tid] here mean that each batch has a different valid count?

@vinx13 From here https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/nms.py#L368 axis is always 1, so axis_mul_before and axis_mul_after are both 1.

@Laurawly since ndim == 2, axis == 1, the actual loop is like

for i in range(0, 2): if i < 1: axis_mul_before *= data.shape[i]

I assume that axis_mul_after == 1, axis_mul_before == data.shape[0], which is batch size, right?

@vinx13 Yeah, that's right. I see what you mean. So each batch could have a different valid count when batch_size > 1. I shouldn't have assumed batch_size = 1. I just pushed the changes.

vinx13 · 2019-01-28T08:37:32Z

@Laurawly Btw have you checked the data race in nms ir? Seems __syncthreads and global barrier (maybe we should rewrite the ir to avoid global barrier) are needed on CUDA. I sometimes get incorrect nms results in my pr.

Laurawly · 2019-01-28T18:54:18Z

@vinx13 Does the conflict happen in argsort_ir?

vinx13 · 2019-01-29T02:03:17Z

@Laurawly the conflict happens in nms_ir, I replaced blockIdx with vthread and added storage_sync and it worked, but my current solution is not efficient

Laurawly · 2019-01-29T17:58:34Z

@vinx13 I don't see conflicts in my nms_ir using blockIdx.x, but I'll double check. Why do you want to replace blockIdx.x with vthread?

vinx13 · 2019-01-29T18:08:43Z

@Laurawly If the data written by other threads is needed (probably this line if_scope(p_out[b * num_anchors * 6 + offset_l] >= 0), there may be data race because lack of synchronization.

Laurawly · 2019-01-29T23:12:21Z

@vinx13 There's no data conflict for p_out in SSD because nms_topk = -1. For the line you mentioned, because writing and reading of p_out happen in the same thread block (if_scope(p_out[b * num_anchors * 6 + offset_i] >= 0) and p_out[b * num_anchors * 6 + offset_i] = -1.0)

vinx13 · 2019-01-30T03:07:22Z

@Laurawly the writing p_out[base_idx + offset_i] = -1.0 is in thread i, while the reading of p_out[base_idx + offset_l] in ib.if_scope(p_out[base_idx + offset_l] >= 0) and p_out[base_idx + offset_l] == p_out[b * num_anchors * 6 + offset_i])) is in all threads. And there is no synchrozation after each iteration of for_range(0, p_valid_count[b]) as l, is there confilict in this case?

Laurawly · 2019-01-30T05:41:59Z

@vinx13 No, because there's a condition that i > l. And because you iterate via l sequentially, when you read from p_out[base_idx + offset_l], the writing of it should be already finished. Such as when l==0, for thread 0, it couldn't write to p_out[base_idx + offset_0] so it returned; for thread 1 to thread n, they need to read from p_out[base_idx + offset_0] but we have already know thread 0 won't write to it. When l==1, for thread 0, 1, they won't write to p_out, for thread 2, it may need to read p_out[base_idx + offset_1] before thread 1 finish writing to it in iteration l==0. But in that case, it means p_out[base_idx + offset_1] == -1 and p_out[base_idx + offset_2] != -1 because thread 2 finishes much earlier than 'thread 1'. So p_out[base_idx + offset_l] ==p_out[base_idx + offset_i] won't be true either way.

vinx13 · 2019-01-30T05:47:24Z

@Laurawly I see, thanks for your clarification

masahi · 2019-01-30T10:08:59Z

thanks @Laurawly @vinx13 @kevinthesun @zhiics this is merged.

…ache#2510) * nms fixed for gpu, tested on cuda and opencl devices, ssd now can run fully on the gpu * sort updated to use virtual thread * typo fixed * fix lint * fix lint * add support when batch_size > 1 * intel graphics conv2d bugs fixed for inception_v3 * intel conv2d api updated, nn input size 4 condition added * review addressed * move conv_tags to attributes * opencl ctx fixed * nms_ir index simplified

Ubuntu and others added 5 commits January 25, 2019 19:22

nms fixed for gpu, tested on cuda and opencl devices, ssd now can run…

7f19a51

… fully on the gpu

sort updated to use virtual thread

c5557f1

typo fixed

7d5172d

fix lint

ee3642b

fix lint

cf53a1c

kevinthesun approved these changes Jan 25, 2019

View reviewed changes

masahi self-assigned this Jan 25, 2019

zhiics approved these changes Jan 25, 2019

View reviewed changes

masahi reviewed Jan 25, 2019

View reviewed changes

vinx13 reviewed Jan 26, 2019

View reviewed changes

yzhliu mentioned this pull request Jan 28, 2019

TVM 0.5 Release Note #2448

Closed

vinx13 reviewed Jan 28, 2019

View reviewed changes

add support when batch_size > 1

76bd01a

Laurawly force-pushed the dev branch from c62c424 to 76bd01a Compare January 28, 2019 19:26

Laurawly added 5 commits January 29, 2019 10:35

intel graphics conv2d bugs fixed for inception_v3

64d7186

intel conv2d api updated, nn input size 4 condition added

56cf8dc

review addressed

c4ecdae

move conv_tags to attributes

c65575c

opencl ctx fixed

d312c78

nms_ir index simplified

e0d6284

vinx13 approved these changes Jan 30, 2019

View reviewed changes

masahi approved these changes Jan 30, 2019

View reviewed changes

masahi merged commit 48c16a1 into apache:master Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial #2510

[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial #2510

Laurawly commented Jan 25, 2019

masahi Jan 25, 2019

Laurawly Jan 25, 2019

masahi commented Jan 25, 2019

Laurawly commented Jan 25, 2019

vinx13 Jan 26, 2019

Laurawly Jan 27, 2019

vinx13 Jan 27, 2019

kevinthesun Jan 27, 2019

Laurawly Jan 28, 2019 •

edited

Loading

vinx13 Jan 28, 2019

Laurawly Jan 28, 2019

vinx13 Jan 29, 2019

vinx13 Jan 28, 2019

Laurawly Jan 28, 2019

vinx13 Jan 28, 2019

Laurawly Jan 28, 2019 •

edited

Loading

vinx13 commented Jan 28, 2019 •

edited

Loading

Laurawly commented Jan 28, 2019

vinx13 commented Jan 29, 2019

Laurawly commented Jan 29, 2019

vinx13 commented Jan 29, 2019

Laurawly commented Jan 29, 2019

vinx13 commented Jan 30, 2019

Laurawly commented Jan 30, 2019 •

edited

Loading

vinx13 commented Jan 30, 2019

masahi commented Jan 30, 2019

[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial #2510

[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial #2510

Conversation

Laurawly commented Jan 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Jan 25, 2019

Laurawly commented Jan 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Laurawly Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Laurawly Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

vinx13 commented Jan 28, 2019 • edited Loading

Laurawly commented Jan 28, 2019

vinx13 commented Jan 29, 2019

Laurawly commented Jan 29, 2019

vinx13 commented Jan 29, 2019

Laurawly commented Jan 29, 2019

vinx13 commented Jan 30, 2019

Laurawly commented Jan 30, 2019 • edited Loading

vinx13 commented Jan 30, 2019

masahi commented Jan 30, 2019

Laurawly Jan 28, 2019 •

edited

Loading

Laurawly Jan 28, 2019 •

edited

Loading

vinx13 commented Jan 28, 2019 •

edited

Loading

Laurawly commented Jan 30, 2019 •

edited

Loading