-
Notifications
You must be signed in to change notification settings - Fork 212
[wip] Register promotion improvements #161
base: master
Are you sure you want to change the base?
Conversation
4d76698
to
8295319
Compare
8295319
to
178b5cf
Compare
@caffe2bot retest this please |
ea53839
to
815b402
Compare
* are mapped to threads (the innermost of them being mapped to thread x) and | ||
* the depth of this mapping can be obtained from threadIdxxScheduleDepthState. | ||
* | ||
* In parciular, the group's footprint must contain only one element and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
particular
.apply_domain(schedule); | ||
|
||
// Scheduled accesses contain maps from schedule dimensions to tensor | ||
// subscripts. Compute the relation that between the schedule dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra that
/* | ||
* Check if the given "group" can be promoted to registers for the given active | ||
* domain points under full "schedule" where "nThreads" consecutive dimensions | ||
* are mapped to threads (the innermost of them being mapped to thread x) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, can you remind me where the assumption that threadIdx.x is innermost is initially introduced in the context of memory promotion?
Nothing to change now but as I am reading these pieces again I am wondering where/how bad things will break when we relax that assumption.
size_t nMappedThreads = 0; | ||
for (int j = 0; j < points.dim(isl::dim_type::param); ++j) { | ||
auto id = points.get_space().get_dim_id(isl::dim_type::param, j); | ||
for (size_t i = 0; i < mapping::ThreadId::kMaxDim; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!MappingId::isThreadId(id)) {
continue;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish, but ids in isl space are not MappingId
and there is no easy way to convert them.
if (id != mapping::ThreadId::makeId(i)) { | ||
continue; | ||
} | ||
if (getParamValIfFixed(points, j) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could extend to a templated
getParamValIfFixed<T>(points, j)
and just compare to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just have a comparison operator between isl::val
and int
. I don't think we should narrow isl::val
to int in a call.
if (!hasReuse(*group, fullSched, depth)) { | ||
continue; | ||
} | ||
// TODO: if something is already in shared, but reuse it within one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have it backwards here?
First promote to registers at some depth below threadId mappings.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.
If you first promote to shared then promote again to private a bunch of issues can occur:
- missed opportunities to promote to shared because of incorrect size estimate
- extra complexity to undo promotion to shared
no point in keeping it in shared _if_ it gets promoted into a register
is only true modulo proper coalescing
for (auto a : isl::UnionAsVector<isl::union_map>(accesses)) { | ||
if (isl::union_map(a.curry()).intersect_domain(domain).is_empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
} | ||
} | ||
|
||
schedule = schedule.unite(current); | ||
prefixMupa = isl::manage(isl_multi_union_pw_aff_intersect_domain( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Export this rather than fallback to old days?
Maybe it is exported later?
3944e43
to
44a7708
Compare
Extract as an overload of Scop::activePromotions taking a set of active statement instances and a tensor id. Overload was chosen because both functions return the same data structure and are semantically close.
Internally, we may need to modify the stored active promotions but the public functions return either a copy of or a const reference to that storage. Extract logic to find active promotions into a separate function that returns indexes into the storage, and use it to create a copy inside a public call.
If a group of references was promoted into shared memory, but it could be also promoted to registers while covering exactly the same statement instances accessing it, demote it from shared memory before promoting to registers.
These option combinations were failing with previous implementations of double promotion. Make sure they never fail again.
All other ScheduleTree node types are printed in such a way that each set(map) of the union_set(union_map) present in the node is printed on a new line. Do the same for extension nodes.
This creates a private convenience function to obtain a copy of active promotions specified by a list of their indexes in the storage. Use this function in Scop::promoteGroup to avoid retraversing the list of all promotions twice in a row.
In cases when the appoximate footprint of the reference group being promoted to registers is not a subset of any of the approximate footprints of the reference groups promoted to shared, it is still possible to promote by copying directly from global memory as long as all overlapping reference groups have only read the data. It will just create multiple copies of the data in different storages without compromising correctness.
In cases where a reference group promoted to registers covered exactly the same accesses as another group promoted to shared memory, the second group was demoted to save up shared memory space. However, this led to adverse effects since copying global->shared is performed in the beginning of the block while copying global->register deeper in the tree, which does not allow to hide latency from loads. Keep the group promoted to shared memory and perform a copy shared->register. Alternative solution would be to decrease the promotion scope depth for register promotion. This would require to ensure that loops indices of which are present in subscripts of "register" arrays are fully unrolled so that the elements of that array are effectively mapped to registers. Since unrolling is expensive in compilation time and is exposed to the autotuner, we would prefer to also expose the register promotion depth in the future.
44a7708
to
230b99a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this commit 187406c is where everything happens, first the diff is WIP but I still made a first pass at it; the first remark is that insertIntraCopiesUnder
begs to be properly documented.
Regarding the choice of promotion ordering, have you thought about doing it the other way around (and if so can you comment on the tradeoffs)?
Personally I would gone first for promotion to registers.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.
If you first promote to shared then promote again to private a bunch of issues can occur:
- missed opportunities to promote to shared because of incorrect size estimate
- extra complexity to undo promotion to shared
- demotion from shared is only good modulo proper coalescing
I'll make another pass tomorrow with a clear head
@@ -412,6 +412,11 @@ struct Scop { | |||
isl::schedule_constraints constraints, | |||
const SchedulerOptionsView& schedulerOptions); | |||
|
|||
// Get the indexes of active promotions in the activePromotions_. | |||
std::vector<size_t> activePromotionsIndexes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activePromotionsIndices
src/core/polyhedral/scop.cc
Outdated
std::vector<std::pair<isl::union_set, Scop::PromotionInfo>> | ||
Scop::activePromotions(isl::union_set activePoints, isl::id tensorId) { | ||
std::vector<std::pair<isl::union_set, Scop::PromotionInfo>> result; | ||
std::vector<size_t> Scop::activePromotionsIndexes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activePromotionsIndices
include/tc/core/polyhedral/scop.h
Outdated
@@ -331,7 +331,9 @@ struct Scop { | |||
|
|||
std::vector<std::pair<isl::union_set, Scop::PromotionInfo>> activePromotions( | |||
isl::union_set activePoints, | |||
isl::id tensorId) const; | |||
isl::id tensorId) const { | |||
return promotionsAtIndexes(activePromotionsIndexes(activePoints, tensorId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and everywhere else, plural of index is indices :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are correct in English ;) But ok if you insist.
There are pros and cons of doing it both ways, or even in a single promotion pass. If you want to promote to registers first, and then the same reference group to shared, you will have to find and modify the copy expressions that you inserted in all register scopes. If you do the other way around, you can just demote one shared promotion. It will free some space, but it's straightforward to promote something else there. Conceptually, I found it simpler to always know where your data is currently located (global, shared, register), so I went for shared, then registers.
Practice shows that we may want to use shared even if there is no reuse for latency hiding reasons. This is actually one of the main reasons why you saw perf regressions with registers compared to shared-only.
We can keep the collected data and call the promotion once again (it's greedy).
Cutting a tree branch is far easier than rewriting the functions for copies.
It is even trickier than that. |
be also promoted to registers while covering exactly the same statement
instances accessing it, demote it from shared memory before promoting to
registers.
of references can be promoted into registers while covering a subset of
statement instances accessing it, copy from shared to registers and back.
Stacked on #149