-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962
Conversation
* Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test
* Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix
* Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT
* Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT
* Add Basic Python API for State * Add UTs for State
* Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner
…che#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix
* Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API
* Update c++ code style and unit test * Update python State wrapper and test cases
* Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test
* Add basic tutorial
* Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation"
* add workload registry * update * update
* add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu
* Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix
* Add custom sketch rule * Bug fix
* relay integration
* Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update
* Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback
* Add tensorize step
* Start to update api * Add compute_dag to state * API update
* kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite
@jcf94 and @merrymercy thanks for all the hard work! Can I request that we put another unresolved issue? In my opinion the written English parts i.e comments, explanations, etc could still use some improvement with both content and grammar and I would propose in general that we do some at least 1 or 2 rounds of full documentation polish (comments, examples, tests, tutorials, etc) before we officially release a feature (in this case when all of Ansor is landed in master). We tried to do this with Relay but I think we should continue to strive to do a better job with new features like this. |
Does that means to just modify ThreadPool to ParallelFor now? Class renamed & added some comments on the member function.
Thanks! That would be of great help since I'm not a native speaker. The documentation does need to be polished. |
src/auto_schedule/utils.h
Outdated
* TODO(merrymercy): Move this to `src/support/parallel_for` | ||
*/ | ||
class ThreadPool { | ||
class ParallelFor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcf94 Sorry I wasn't meant to say that we should rename ThreadPool to ParallelFor, instead we should hide the use of threadpool behind a parallel_for API, in similar style to https://docs.microsoft.com/en-us/cpp/parallel/concrt/parallel-algorithms?view=vs-2019#parallel_for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I get it. However after checking the current code, I found out that actually we have removed all the use of ThreadPool in this minimum system. Didn't realize that before.
src/auto_schedule/utils.cc
Outdated
return *pool; | ||
} | ||
|
||
void parallel_for(int start, int end, std::function<void(int index)> f, int stride) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen Added a temporary implementation of parallel_for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jcf94 , let me try to elaborate further. To simplify the abstraction, we should:
- Add src/support/parallel_for.h
- Move the threadpool as a detail of parallel_for.cc, remove thread_pool from utils.h
- It is unclear whether threadpool is needed to implement parallel for, it is very possible that we can just launch n std::thread(because std::thread is quite lightweight in c++)
- Use parallel_for for all necessary usecases of threadpool.
Also consider remove the stride argument, or make it optional since stride is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen Ok, I understand that(the stride argument has been set to 1 in default in utils.h), and it's fine to further clean these code.
Just confused about the "does not have to change now" above. :)
And actually the ThreadPool is never used in current code base...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should just remove the thread pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid involving extra review effort, removed ThreadPool from the current code base. cc @tqchen
This comment has been minimized.
This comment has been minimized.
@@ -0,0 +1,34 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this folder be named auto_scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The namespace is auto_schedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @MarisaKirisame means the namespace should be a noun. So auto_scheduler is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also works if we all agree, we can send a followup PR for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sent new PR for namespace renaming: #6059
I do appreciate and support the proposal. Let's move forward with the feature upstream process and after the major features being merged into master we can work together to refine the documentation. |
I had been looking over this PR a bit more, and it seems like a lot of review is dropped when file is updated. This is no one flaw's but a flaw in github's review design, and I think it just mean we should be careful in the upcoming Ansor review - just because code is merged, doesnt mean it is at top quality, and we should continue going over them. |
…generating (apache#5962) * Code migration Start (neo-ai#1) * Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test * Split transform_step out & Update more UTs (neo-ai#3) * Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix * Add search_task, measure and serialization (neo-ai#4) * Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT * Add MetaTileRewritePolicy (neo-ai#5) * Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT * Basic Python API for State (neo-ai#6) * Add Basic Python API for State * Add UTs for State * Add Python API: Measure & Task (neo-ai#7) * Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner * Add ansor.auto_schedule() API; First AutoSchedule working version(neo-ai#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix * Bug fix & Add python serialization API (neo-ai#10) * Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API * Improve code style, python wrapper and test cases (neo-ai#11) * Update c++ code style and unit test * Update python State wrapper and test cases * fix unit tests * Add RPCRunner & OpenCL/CUDA test (neo-ai#12) * Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test * rebase to upstream/master * Add Ansor basic tutorial (neo-ai#13) * Add basic tutorial * migrate feature extraction (neo-ai#14) * Add XGBModel & RPCRunnerWarpper (neo-ai#15) * Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation" * Migrate workload_registry.py (neo-ai#16) * add workload registry * update * update * add task scheduler (neo-ai#17) * Add conv2d cuda tutorial with workload registry (neo-ai#18) * add tune_test.py (the old tune_wkl.py) (neo-ai#19) * add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu * Code refine for tune_test.py & Add a pre load callback (neo-ai#20) * Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix * Add python custom sketch rule (neo-ai#21) * Add custom sketch rule * Bug fix * Ansor Relay Integration (without layout rewrite) (neo-ai#22) * relay integration * Add tune_op_subgraph.py & Some code clean for tune_network.py (neo-ai#23) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file * add explicit_unroll_max_extent (neo-ai#25) * Add Index simplification & API update (neo-ai#26) * Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update * Update PreLoadMeasuredStates & Some bug fix (neo-ai#27) * Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback * Add tensorize step for loop_state (neo-ai#31) * Add tensorize step * State python api update (neo-ai#33) * Start to update api * Add compute_dag to state * API update * kernel layout rewrite (neo-ai#28) * kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite * [cache flush] port cache flush to ansor (neo-ai#32) * Improve relay integration (neo-ai#34) * tmp checkpoint * Improve relay integration * Improve relay integration * Fix xgb error & Simplify dispatcher (neo-ai#35) * Rename "MetaTileRewritePolicy" to "SketchPolicy". (neo-ai#36) * Rename "MetaTileRewritePolicy" to "SketchPolicy". * Add a new class for auto_unroll_max_step, storage_offset in StageNode * fix tune_op_subgraph.py * rebase * Migrate all node::make to noderef's construct function (neo-ai#37) * Start to move xxxnode::make to noderef() * Update * Update * Finish transform_step * Finish comute dag & auto schedule * Update * Update * Update * Update * Update * Code refine * Code refine * Code refine * Update * Update * Some lint fix & Recover the double constructor of tvm::PrimExpr (neo-ai#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix * Add MutateComputeLocation and MutateParallel in evolutionary search (neo-ai#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint * Improve loop state python API (stage_tensors -> stage_ops) (neo-ai#41) * improve loop state python API (stage_tensors -> stage_ops) * fix * ComputeDAG bug fix & Add Custom TensorCore Matmul Example (neo-ai#42) * Bug Fix * Sample example of Custom TensorCore Matmul * Rever Commits, Start to build minimum Ansor system * Code clean for minimum Ansor system * Bug fix & Delete AccessAnalyzer * Delete attachmap & Code clean * Doc update Update statenode::stages from vector to Array * Headfile update & Python doc update * clang-format fix * pylint fix * Update * Doc update * Update * Bug fix after code merge to the new master * clang-format fix * Update * Update * Update std::vector to Array; Update verbosity setting; Some commemts addressed * std::vector->Array & std::string->String * Add init_state to ComputeDAG * Update * Update some unordered_map to Map * clang-format fix * Comments addressed Delete ReplayAndInferBound Delete ReplaySteps & InferBoundCommon * Lint fix * Update * Update * Update * Update * Update * Update * Update * Update * Update * Rename ansor namespace to auto_schedule * Update * Rename ThreadPool to ParallelFor * Add parallel_for * Remove ThreadPool * Update python/tvm/auto_schedule/auto_schedule.py * trigger CI Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Minmin Sun (孙敏敏) <minmin.smm@alibaba-inc.com> Co-authored-by: Zhao Wu <zhaowu@apache.org>
…generating (apache#5962) * Code migration Start (neo-ai#1) * Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test * Split transform_step out & Update more UTs (neo-ai#3) * Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix * Add search_task, measure and serialization (neo-ai#4) * Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT * Add MetaTileRewritePolicy (neo-ai#5) * Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT * Basic Python API for State (neo-ai#6) * Add Basic Python API for State * Add UTs for State * Add Python API: Measure & Task (neo-ai#7) * Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner * Add ansor.auto_schedule() API; First AutoSchedule working version(neo-ai#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix * Bug fix & Add python serialization API (neo-ai#10) * Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API * Improve code style, python wrapper and test cases (neo-ai#11) * Update c++ code style and unit test * Update python State wrapper and test cases * fix unit tests * Add RPCRunner & OpenCL/CUDA test (neo-ai#12) * Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test * rebase to upstream/master * Add Ansor basic tutorial (neo-ai#13) * Add basic tutorial * migrate feature extraction (neo-ai#14) * Add XGBModel & RPCRunnerWarpper (neo-ai#15) * Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation" * Migrate workload_registry.py (neo-ai#16) * add workload registry * update * update * add task scheduler (neo-ai#17) * Add conv2d cuda tutorial with workload registry (neo-ai#18) * add tune_test.py (the old tune_wkl.py) (neo-ai#19) * add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu * Code refine for tune_test.py & Add a pre load callback (neo-ai#20) * Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix * Add python custom sketch rule (neo-ai#21) * Add custom sketch rule * Bug fix * Ansor Relay Integration (without layout rewrite) (neo-ai#22) * relay integration * Add tune_op_subgraph.py & Some code clean for tune_network.py (neo-ai#23) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file * add explicit_unroll_max_extent (neo-ai#25) * Add Index simplification & API update (neo-ai#26) * Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update * Update PreLoadMeasuredStates & Some bug fix (neo-ai#27) * Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback * Add tensorize step for loop_state (neo-ai#31) * Add tensorize step * State python api update (neo-ai#33) * Start to update api * Add compute_dag to state * API update * kernel layout rewrite (neo-ai#28) * kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite * [cache flush] port cache flush to ansor (neo-ai#32) * Improve relay integration (neo-ai#34) * tmp checkpoint * Improve relay integration * Improve relay integration * Fix xgb error & Simplify dispatcher (neo-ai#35) * Rename "MetaTileRewritePolicy" to "SketchPolicy". (neo-ai#36) * Rename "MetaTileRewritePolicy" to "SketchPolicy". * Add a new class for auto_unroll_max_step, storage_offset in StageNode * fix tune_op_subgraph.py * rebase * Migrate all node::make to noderef's construct function (neo-ai#37) * Start to move xxxnode::make to noderef() * Update * Update * Finish transform_step * Finish comute dag & auto schedule * Update * Update * Update * Update * Update * Code refine * Code refine * Code refine * Update * Update * Some lint fix & Recover the double constructor of tvm::PrimExpr (neo-ai#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix * Add MutateComputeLocation and MutateParallel in evolutionary search (neo-ai#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint * Improve loop state python API (stage_tensors -> stage_ops) (neo-ai#41) * improve loop state python API (stage_tensors -> stage_ops) * fix * ComputeDAG bug fix & Add Custom TensorCore Matmul Example (neo-ai#42) * Bug Fix * Sample example of Custom TensorCore Matmul * Rever Commits, Start to build minimum Ansor system * Code clean for minimum Ansor system * Bug fix & Delete AccessAnalyzer * Delete attachmap & Code clean * Doc update Update statenode::stages from vector to Array * Headfile update & Python doc update * clang-format fix * pylint fix * Update * Doc update * Update * Bug fix after code merge to the new master * clang-format fix * Update * Update * Update std::vector to Array; Update verbosity setting; Some commemts addressed * std::vector->Array & std::string->String * Add init_state to ComputeDAG * Update * Update some unordered_map to Map * clang-format fix * Comments addressed Delete ReplayAndInferBound Delete ReplaySteps & InferBoundCommon * Lint fix * Update * Update * Update * Update * Update * Update * Update * Update * Update * Rename ansor namespace to auto_schedule * Update * Rename ThreadPool to ParallelFor * Add parallel_for * Remove ThreadPool * Update python/tvm/auto_schedule/auto_schedule.py * trigger CI Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Minmin Sun (孙敏敏) <minmin.smm@alibaba-inc.com> Co-authored-by: Zhao Wu <zhaowu@apache.org>
…generating (apache#5962) * Code migration Start (neo-ai#1) * Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test * Split transform_step out & Update more UTs (neo-ai#3) * Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix * Add search_task, measure and serialization (neo-ai#4) * Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT * Add MetaTileRewritePolicy (neo-ai#5) * Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT * Basic Python API for State (neo-ai#6) * Add Basic Python API for State * Add UTs for State * Add Python API: Measure & Task (neo-ai#7) * Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner * Add ansor.auto_schedule() API; First AutoSchedule working version(neo-ai#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix * Bug fix & Add python serialization API (neo-ai#10) * Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API * Improve code style, python wrapper and test cases (neo-ai#11) * Update c++ code style and unit test * Update python State wrapper and test cases * fix unit tests * Add RPCRunner & OpenCL/CUDA test (neo-ai#12) * Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test * rebase to upstream/master * Add Ansor basic tutorial (neo-ai#13) * Add basic tutorial * migrate feature extraction (neo-ai#14) * Add XGBModel & RPCRunnerWarpper (neo-ai#15) * Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation" * Migrate workload_registry.py (neo-ai#16) * add workload registry * update * update * add task scheduler (neo-ai#17) * Add conv2d cuda tutorial with workload registry (neo-ai#18) * add tune_test.py (the old tune_wkl.py) (neo-ai#19) * add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu * Code refine for tune_test.py & Add a pre load callback (neo-ai#20) * Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix * Add python custom sketch rule (neo-ai#21) * Add custom sketch rule * Bug fix * Ansor Relay Integration (without layout rewrite) (neo-ai#22) * relay integration * Add tune_op_subgraph.py & Some code clean for tune_network.py (neo-ai#23) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file * add explicit_unroll_max_extent (neo-ai#25) * Add Index simplification & API update (neo-ai#26) * Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update * Update PreLoadMeasuredStates & Some bug fix (neo-ai#27) * Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback * Add tensorize step for loop_state (neo-ai#31) * Add tensorize step * State python api update (neo-ai#33) * Start to update api * Add compute_dag to state * API update * kernel layout rewrite (neo-ai#28) * kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite * [cache flush] port cache flush to ansor (neo-ai#32) * Improve relay integration (neo-ai#34) * tmp checkpoint * Improve relay integration * Improve relay integration * Fix xgb error & Simplify dispatcher (neo-ai#35) * Rename "MetaTileRewritePolicy" to "SketchPolicy". (neo-ai#36) * Rename "MetaTileRewritePolicy" to "SketchPolicy". * Add a new class for auto_unroll_max_step, storage_offset in StageNode * fix tune_op_subgraph.py * rebase * Migrate all node::make to noderef's construct function (neo-ai#37) * Start to move xxxnode::make to noderef() * Update * Update * Finish transform_step * Finish comute dag & auto schedule * Update * Update * Update * Update * Update * Code refine * Code refine * Code refine * Update * Update * Some lint fix & Recover the double constructor of tvm::PrimExpr (neo-ai#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix * Add MutateComputeLocation and MutateParallel in evolutionary search (neo-ai#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint * Improve loop state python API (stage_tensors -> stage_ops) (neo-ai#41) * improve loop state python API (stage_tensors -> stage_ops) * fix * ComputeDAG bug fix & Add Custom TensorCore Matmul Example (neo-ai#42) * Bug Fix * Sample example of Custom TensorCore Matmul * Rever Commits, Start to build minimum Ansor system * Code clean for minimum Ansor system * Bug fix & Delete AccessAnalyzer * Delete attachmap & Code clean * Doc update Update statenode::stages from vector to Array * Headfile update & Python doc update * clang-format fix * pylint fix * Update * Doc update * Update * Bug fix after code merge to the new master * clang-format fix * Update * Update * Update std::vector to Array; Update verbosity setting; Some commemts addressed * std::vector->Array & std::string->String * Add init_state to ComputeDAG * Update * Update some unordered_map to Map * clang-format fix * Comments addressed Delete ReplayAndInferBound Delete ReplaySteps & InferBoundCommon * Lint fix * Update * Update * Update * Update * Update * Update * Update * Update * Update * Rename ansor namespace to auto_schedule * Update * Rename ThreadPool to ParallelFor * Add parallel_for * Remove ThreadPool * Update python/tvm/auto_schedule/auto_schedule.py * trigger CI Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Minmin Sun (孙敏敏) <minmin.smm@alibaba-inc.com> Co-authored-by: Zhao Wu <zhaowu@apache.org>
…generating (apache#5962) * Code migration Start (#1) * Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test * Split transform_step out & Update more UTs (#3) * Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix * Add search_task, measure and serialization (#4) * Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT * Add MetaTileRewritePolicy (#5) * Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT * Basic Python API for State (#6) * Add Basic Python API for State * Add UTs for State * Add Python API: Measure & Task (#7) * Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner * Add ansor.auto_schedule() API; First AutoSchedule working version(#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix * Bug fix & Add python serialization API (#10) * Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API * Improve code style, python wrapper and test cases (#11) * Update c++ code style and unit test * Update python State wrapper and test cases * fix unit tests * Add RPCRunner & OpenCL/CUDA test (#12) * Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test * rebase to upstream/master * Add Ansor basic tutorial (#13) * Add basic tutorial * migrate feature extraction (#14) * Add XGBModel & RPCRunnerWarpper (#15) * Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation" * Migrate workload_registry.py (#16) * add workload registry * update * update * add task scheduler (#17) * Add conv2d cuda tutorial with workload registry (#18) * add tune_test.py (the old tune_wkl.py) (#19) * add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu * Code refine for tune_test.py & Add a pre load callback (#20) * Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix * Add python custom sketch rule (#21) * Add custom sketch rule * Bug fix * Ansor Relay Integration (without layout rewrite) (#22) * relay integration * Add tune_op_subgraph.py & Some code clean for tune_network.py (#23) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file * add explicit_unroll_max_extent (#25) * Add Index simplification & API update (#26) * Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update * Update PreLoadMeasuredStates & Some bug fix (#27) * Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback * Add tensorize step for loop_state (#31) * Add tensorize step * State python api update (#33) * Start to update api * Add compute_dag to state * API update * kernel layout rewrite (#28) * kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite * [cache flush] port cache flush to ansor (#32) * Improve relay integration (#34) * tmp checkpoint * Improve relay integration * Improve relay integration * Fix xgb error & Simplify dispatcher (#35) * Rename "MetaTileRewritePolicy" to "SketchPolicy". (#36) * Rename "MetaTileRewritePolicy" to "SketchPolicy". * Add a new class for auto_unroll_max_step, storage_offset in StageNode * fix tune_op_subgraph.py * rebase * Migrate all node::make to noderef's construct function (#37) * Start to move xxxnode::make to noderef() * Update * Update * Finish transform_step * Finish comute dag & auto schedule * Update * Update * Update * Update * Update * Code refine * Code refine * Code refine * Update * Update * Some lint fix & Recover the double constructor of tvm::PrimExpr (#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix * Add MutateComputeLocation and MutateParallel in evolutionary search (#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint * Improve loop state python API (stage_tensors -> stage_ops) (#41) * improve loop state python API (stage_tensors -> stage_ops) * fix * ComputeDAG bug fix & Add Custom TensorCore Matmul Example (#42) * Bug Fix * Sample example of Custom TensorCore Matmul * Rever Commits, Start to build minimum Ansor system * Code clean for minimum Ansor system * Bug fix & Delete AccessAnalyzer * Delete attachmap & Code clean * Doc update Update statenode::stages from vector to Array * Headfile update & Python doc update * clang-format fix * pylint fix * Update * Doc update * Update * Bug fix after code merge to the new master * clang-format fix * Update * Update * Update std::vector to Array; Update verbosity setting; Some commemts addressed * std::vector->Array & std::string->String * Add init_state to ComputeDAG * Update * Update some unordered_map to Map * clang-format fix * Comments addressed Delete ReplayAndInferBound Delete ReplaySteps & InferBoundCommon * Lint fix * Update * Update * Update * Update * Update * Update * Update * Update * Update * Rename ansor namespace to auto_schedule * Update * Rename ThreadPool to ParallelFor * Add parallel_for * Remove ThreadPool * Update python/tvm/auto_schedule/auto_schedule.py * trigger CI Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Minmin Sun (孙敏敏) <minmin.smm@alibaba-inc.com> Co-authored-by: Zhao Wu <zhaowu@apache.org>
Hi all,
The last PR of Ansor #5883 is not clear enough for reviewers to fully understand our design. After some discussion, we changed our upstream plan to propose a minimal version of Ansor which contains a small but complete framework, so others can get a better understanding of the whole structure of Ansor.
In [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0), we've introduced the Ansor auto-scheduler. And we reached an agreement that should replace AutoTVM with Ansor in the end.
For most existing templates, current Ansor can directly replace them with better performance and less tuning time.
For other special templates (low-precision, sparse), the plan is to introduce search space customization and gradually rewrite them with Ansor's new API.
This PR introduces a self-contained minimum version of Ansor with most of the bones.
This PR includes the interface of core data structures and an empty search policy that does nothing. More advanced search policy and cost models will be in the next few PRs.
Infrastructure: A Sketch IR for Schedule Searching
Different from AutoTVM, whose tuning spaces are composed of predefined parameters, Ansor constructs a search space by manipulating the loop structures of the given compute DAG.
To enable flexible manipulation of the loop structures, we implemented a lightweight loop structure IR (Intermediate Representation) based on the original TVM IR but specifically for schedule search. The IR is composed of the state and action, which are defined as follows:
State: A state of schedule search is the loop structure defined by the schedule (i.e., the TVM IR created by
tvm.lower
). See LoopState in the Key Data Structure for details.Action: An action is composed of one or more schedule primitives to manipulate (e.g., split, reorder, fuse) a state. See TransformStep in the Key Data Structure for details.
We don't use the existing TVM IR but to extend a new Sketch IR on it is because:
After the search is done, we will lower this to TVM IR with TVM schedule primitives.
Key Data Structures
To help build an overview of Ansor, this shows the class relations of some important Ansor data structures:
ComputeDAG: Compute declaration graph and its related analysis tools.
Related source files:
src/ansor/compute_dag.*
,python/tvm/ansor/compute_dag.py
Ansor takes a compute declaration, which could be a single operator or a subgraph, described by
tvm.compute
as an input and converts it to a ComputeDAG.ComputeDAG implementation includes a set of analyses such as the total float operations, consumer/producer relations of each operation stage, whether a stage should be tiled/compute inlined, and so on (some of the analysis will be included in the follow-up PRs). These analyses can help the search policy to do specific decisions during schedule search process.
LoopState: This defines the "state" for the search problem.
Related source files:
src/ansor/loop_state.*
,python/tvm/ansor/loop_state.py
.Each LoopState corresponds to a specific schedule for the target ComputeDAG.
A LoopState consists of: 1. the current loop structure; 2. the transform history to reach this loop structure.
The loop structure keeps a preview of how the schedule will finally look like after lowering (number of iterators, the extent of each iterator, the
compute_at
locations, etc), which can help the search policy to make decisions during the search.The transform history is a sequence of
TransformStep
which will finally be mapped to schedule primitives.TransformStep: This defines the "action" for the search problem, i.e., the schedule primitives for our sketch IR.
Related files:
src/ansor/transform_step.*
,python/tvm/ansor/loop_state.py
Each step has its corresponding
tvm.te
schedule primitives. We record all TransformSteps for every state as its transform history. After finishing the schedule search, these transform steps will be lowered with their corresponding TVM's schedule primitives.Note: This PR only contains a small subset of the TransformSteps. The complete set of transform steps will be in the next PRs.
SearchTask: Meta information and hardware parameters for a specific schedule search task.
Related source files:
src/ansor/search_task.*
This structure includes the target ComputeDAG, device information as well as some hardware parameters obtained from the system or user inputs.
SearchPolicy: The search policy is defined for Ansor to auto-generate a high-performance schedule for different computations.
Related source files:
src/ansor/search_policy/*
A SearchPolicy takes a
SearchTask
, system information and some tuning options as inputs, performs the schedule search, and returns a state with the best performance. The resulting state can be used to apply to TVM schedule later.Note that in Ansor paper (https://arxiv.org/abs/2006.06762), we proposed a sketch generation policy which achieves pretty good results with various workloads on different devices. On the other hand, in this minimum system PR, we only provide an
EmptyPolicy
to illustrate the search policy interface.Ansor Minimum System
This is a brief diagram for the Ansor system:
te
API and create a ComputeDAG structure.ComputeDAG
to create aSearchTask
structure.SearchPolicy
takes theSearchTask
as input and performs the schedule search. During the search process,SearchPolicy
will generate multiple candidate states, each of which corresponds to a specific TVM schedule.In the Ansor system, we use the sketch generation policy (will be brought in later PRs) described in the paper as the default search policy, which should be enough to cover most use cases. Meanwhile, we will have an RFC for a custom rule mechanism that enables user-defined template search to serve the same functionality as the current AutoTVM template. Specifically, we will provide Python APIs for the new IR that is intended to be used by users for sketch customization. They look very similar to existing schedule primitives, as shown in
python/tvm/ansor/loop_state.py
.Our goal is to make sure Ansor can cover all AutoTVM functionalities while achieving the same or better performance so that the community can gradually switch to Ansor from AutoTVM.
More Details on LoopState
In this section, we illustrate how a loop state looks like and how does it connect to the current TVM build system.
Take a simple State that includes Split, Fuse and Reorder steps for example:
First, let's print out a state. It shows the loop structure of the corresponding TVM schedule, the "preview":
It stores all history transform steps required to reach the current state. We can print the history transform steps as TVM's python schedule API. This can be used for debugging or to apply the schedule on a former TVM version without Ansor support.
We can also replay these steps to get a schedule for
tvm.lower
andtvm.build
.The steps of this state can be serialized into the log file as:
Ansor serializes all transform steps to the log file; while AutoTVM serializes parameters of a predefined template. The log format discussion would be based on https://discuss.tvm.ai/t/rfc-canonicalizing-autotvm-log-format/7038/.
In the next few PRs, we'll introduce the complete search policy and tutorials for single op/ subgraph schedule search, Relay integration, and tutorials for end-to-end network schedule search, custom rules to support customized search space.
This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac @yangjunpro @yidawang .