Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Metaschedule] Tuning trial hanging after one task #12330

Closed
slyubomirsky opened this issue Aug 5, 2022 · 2 comments · Fixed by #13246
Closed

[Bug][Metaschedule] Tuning trial hanging after one task #12330

slyubomirsky opened this issue Aug 5, 2022 · 2 comments · Fixed by #13246
Assignees
Labels
needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it type: bug

Comments

@slyubomirsky
Copy link
Contributor

slyubomirsky commented Aug 5, 2022

I encountered this when trying to run this script over RPC on machines with v100's. Though it was done using Relax, @zxybazh says he thinks this can probably be triggered on mainline as well.

I ran ResNet-50 on V100 with an input shape of (1, 3, 224, 224), using 5 tuning trials. The tuning task began started hanging on the first tuning task, fused_conv2d_add_relu. It appeared that there were failures encountered during the task.

Output from the host:

  input_name: input0
  input_shape: [1, 3, 224, 224]
  input_dtype: float32
/home/ubuntu/tvm-runtime/python/tvm/driver/build_module.py:267: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
  warnings.warn(
INFO:tvm.meta_schedule.runner.rpc_runner:RPCRunner: max_workers = 2
INFO:tvm.meta_schedule.tune:Working directory: /home/ubuntu/dump/
2022-08-05 12:13:55.897 INFO Logging directory: /home/ubuntu/dump/logs
2022-08-05 12:13:55.897 INFO Working directory: /home/ubuntu/dump/
2022-08-05 12:13:55.898 INFO Creating JSONDatabase. Workload at: /home/ubuntu/dump/database_workload.json. Tuning records at: /home/ubuntu/dump/database_tuning_record.json
2022-08-05 12:13:56.063 INFO LocalBuilder: max_workers = 24
2022-08-05 12:13:56.388 INFO Initializing Task #0: "layout_transform"
2022-08-05 12:13:56.459 INFO Initializing Task #1: "fused_conv2d_add_relu"
2022-08-05 12:13:56.726 INFO Initializing Task #2: "max_pool2d"
2022-08-05 12:13:56.866 INFO Initializing Task #3: "fused_conv2d1_add1_relu1"
2022-08-05 12:13:57.114 INFO Initializing Task #4: "fused_contrib_conv2d_winograd_without_weight_transform_add1_relu1"
2022-08-05 12:13:58.024 INFO Initializing Task #5: "fused_conv2d2_add2"
2022-08-05 12:13:58.231 INFO Initializing Task #6: "fused_conv2d2_add2_add3_relu2"
2022-08-05 12:13:58.532 INFO Initializing Task #7: "fused_conv2d3_add1_relu1"
2022-08-05 12:13:58.784 INFO Initializing Task #8: "fused_conv2d4_add4_relu3"
2022-08-05 12:13:59.033 INFO Initializing Task #9: "fused_conv2d5_add5_relu4"
2022-08-05 12:13:59.301 INFO Initializing Task #10: "fused_conv2d7_add6"
2022-08-05 12:13:59.518 INFO Initializing Task #11: "fused_conv2d6_add6_add7_relu5"
2022-08-05 12:13:59.823 INFO Initializing Task #12: "fused_conv2d8_add5_relu4"
2022-08-05 12:14:00.077 INFO Initializing Task #13: "fused_contrib_conv2d_winograd_without_weight_transform1_add5_relu4"
2022-08-05 12:14:00.771 INFO Initializing Task #14: "fused_conv2d9_add8_relu6"
2022-08-05 12:14:01.022 INFO Initializing Task #15: "fused_conv2d10_add9_relu7"
2022-08-05 12:14:01.290 INFO Initializing Task #16: "fused_conv2d12_add10"
2022-08-05 12:14:01.504 INFO Initializing Task #17: "fused_conv2d11_add10_add11_relu8"
2022-08-05 12:14:01.806 INFO Initializing Task #18: "fused_conv2d13_add9_relu7"
2022-08-05 12:14:02.057 INFO Initializing Task #19: "fused_contrib_conv2d_winograd_without_weight_transform2_add9_relu7"
2022-08-05 12:14:02.753 INFO Initializing Task #20: "fused_conv2d14_add12_relu9"
2022-08-05 12:14:03.003 INFO Initializing Task #21: "fused_conv2d15_add13_relu10"
2022-08-05 12:14:03.272 INFO Initializing Task #22: "fused_conv2d17_add14"
2022-08-05 12:14:03.486 INFO Initializing Task #23: "fused_conv2d16_add14_add15_relu11"
2022-08-05 12:14:03.788 INFO Initializing Task #24: "fused_conv2d18_add13_relu10"
2022-08-05 12:14:04.039 INFO Initializing Task #25: "fused_contrib_conv2d_winograd_without_weight_transform3_add13_relu10"
2022-08-05 12:14:04.739 INFO Initializing Task #26: "adaptive_avg_pool2d"
2022-08-05 12:14:04.865 INFO Initializing Task #27: "fused_layout_transform1_reshape_squeeze"
2022-08-05 12:14:05.006 INFO Initializing Task #28: "fused_dense_add16"
2022-08-05 12:14:05.113 INFO 
 ID |                                                                 Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 |                                                     layout_transform |         1 |      1 |            N/A |          N/A |                   N/A |      0 |            
  1 |                                                fused_conv2d_add_relu | 237633536 |      1 |            N/A |          N/A |                   N/A |      0 |            
  2 |                                                           max_pool2d |   1806336 |      1 |            N/A |          N/A |                   N/A |      0 |            
  3 |                                             fused_conv2d1_add1_relu1 |  26091520 |      1 |            N/A |          N/A |                   N/A |      0 |            
  4 |    fused_contrib_conv2d_winograd_without_weight_transform_add1_relu1 | 128651264 |      3 |            N/A |          N/A |                   N/A |      0 |            
  5 |                                                   fused_conv2d2_add2 | 103563264 |      1 |            N/A |          N/A |                   N/A |      0 |            
  6 |                                        fused_conv2d2_add2_add3_relu2 | 105168896 |      3 |            N/A |          N/A |                   N/A |      0 |            
  7 |                                             fused_conv2d3_add1_relu1 | 103161856 |      2 |            N/A |          N/A |                   N/A |      0 |            
  8 |                                             fused_conv2d4_add4_relu3 | 206323712 |      1 |            N/A |          N/A |                   N/A |      0 |            
  9 |                                             fused_conv2d5_add5_relu4 | 231411712 |      1 |            N/A |          N/A |                   N/A |      0 |            
 10 |                                                   fused_conv2d7_add6 | 205922304 |      1 |            N/A |          N/A |                   N/A |      0 |            
 11 |                                        fused_conv2d6_add6_add7_relu5 | 103964672 |      4 |            N/A |          N/A |                   N/A |      0 |            
 12 |                                             fused_conv2d8_add5_relu4 | 102961152 |      3 |            N/A |          N/A |                   N/A |      0 |            
 13 |   fused_contrib_conv2d_winograd_without_weight_transform1_add5_relu4 | 127045632 |      3 |            N/A |          N/A |                   N/A |      0 |            
 14 |                                             fused_conv2d9_add8_relu6 | 205922304 |      1 |            N/A |          N/A |                   N/A |      0 |            
 15 |                                            fused_conv2d10_add9_relu7 | 231311360 |      1 |            N/A |          N/A |                   N/A |      0 |            
 16 |                                                 fused_conv2d12_add10 | 205721600 |      1 |            N/A |          N/A |                   N/A |      0 |            
 17 |                                     fused_conv2d11_add10_add11_relu8 | 103362560 |      6 |            N/A |          N/A |                   N/A |      0 |            
 18 |                                            fused_conv2d13_add9_relu7 | 102860800 |      5 |            N/A |          N/A |                   N/A |      0 |            
 19 |   fused_contrib_conv2d_winograd_without_weight_transform2_add9_relu7 | 114903040 |      5 |            N/A |          N/A |                   N/A |      0 |            
 20 |                                           fused_conv2d14_add12_relu9 | 205721600 |      1 |            N/A |          N/A |                   N/A |      0 |            
 21 |                                          fused_conv2d15_add13_relu10 | 231261184 |      1 |            N/A |          N/A |                   N/A |      0 |            
 22 |                                                 fused_conv2d17_add14 | 205621248 |      1 |            N/A |          N/A |                   N/A |      0 |            
 23 |                                    fused_conv2d16_add14_add15_relu11 | 103061504 |      3 |            N/A |          N/A |                   N/A |      0 |            
 24 |                                          fused_conv2d18_add13_relu10 | 102810624 |      2 |            N/A |          N/A |                   N/A |      0 |            
 25 | fused_contrib_conv2d_winograd_without_weight_transform3_add13_relu10 | 142132224 |      2 |            N/A |          N/A |                   N/A |      0 |            
 26 |                                                  adaptive_avg_pool2d |    102400 |      1 |            N/A |          N/A |                   N/A |      0 |            
 27 |                              fused_layout_transform1_reshape_squeeze |         1 |      1 |            N/A |          N/A |                   N/A |      0 |            
 28 |                                                    fused_dense_add16 |   4097000 |      1 |            N/A |          N/A |                   N/A |      0 |            
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0

2022-08-05 12:14:05.114 INFO Scheduler picks Task #0: "layout_transform"
2022-08-05 12:14:06.380 INFO Sending 6 sample(s) to builder
2022-08-05 12:14:06.713 INFO Sending 6 sample(s) to runner
2022-08-05 12:14:06.713 INFO Scheduler picks Task #1: "fused_conv2d_add_relu"

The tail of the long of task 1 (excerpted, as it goes on for a long time):

[etc]
2022-08-05 12:36:14.188 INFO Sample-Init-Population summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x423d1d8)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0xf2ad228)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0xf2ad258)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x3d6c8e8)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x4d449e8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x45933f8)]: 1685504 failure(s)
2022-08-05 12:36:15.803 INFO Sample-Init-Population summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x423d1d8)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0xf2ad228)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0xf2ad258)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x3d6c8e8)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x4d449e8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x45933f8)]: 1687552 failure(s)
2022-08-05 12:36:17.411 INFO Sample-Init-Population summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x423d1d8)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0xf2ad228)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0xf2ad258)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x3d6c8e8)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x4d449e8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x45933f8)]: 1689600 failure(s)

cc @Hzfengsy @junrushao @junrushao1994

@zxybazh zxybazh self-assigned this Aug 5, 2022
@slyubomirsky slyubomirsky changed the title [Bug][Metaschedule] Tuning trial hanging after [Bug][Metaschedule] Tuning trial hanging after one task Aug 11, 2022
@areusch areusch added the needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it label Oct 19, 2022
@junrushao
Copy link
Member

the number of failures in the stats is quite abnormal. is the task fused_conv2d_add_relu?

@zxybazh
Copy link
Member

zxybazh commented Oct 19, 2022

Yes, I think the task is fused_conv2d_add_relu.

xinetzone pushed a commit to daobook/tvm that referenced this issue Nov 10, 2022
This PR introduces a new argument for EvolutionarySearch that limits the failures (defined as rounds of no new generated candidate) in the `SampleInitPopulation` stage. In this way we can avoid the task to be hanging forever in special cases, e.g., some postproc always fails. This should fix apache#12330.
xinetzone pushed a commit to daobook/tvm that referenced this issue Nov 25, 2022
This PR introduces a new argument for EvolutionarySearch that limits the failures (defined as rounds of no new generated candidate) in the `SampleInitPopulation` stage. In this way we can avoid the task to be hanging forever in special cases, e.g., some postproc always fails. This should fix apache#12330.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants