Skip to content
This repository has been archived by the owner on Apr 28, 2023. It is now read-only.

[wip] Register promotion improvements #161

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ftynse
Copy link
Contributor

@ftynse ftynse commented Mar 15, 2018

  • If a group of references was promoted into shared memory, but it could
    be also promoted to registers while covering exactly the same statement
    instances accessing it, demote it from shared memory before promoting to
    registers.
  • If a group of references was promoted into shared, and a smaller group
    of references can be promoted into registers while covering a subset of
    statement instances accessing it, copy from shared to registers and back.

Stacked on #149

@ftynse ftynse added the wip label Mar 15, 2018
@ftynse ftynse force-pushed the register-promotion-improvements branch 4 times, most recently from 4d76698 to 8295319 Compare March 19, 2018 13:08
@ftynse ftynse changed the base branch from dev to master March 19, 2018 13:17
@ftynse ftynse force-pushed the register-promotion-improvements branch from 8295319 to 178b5cf Compare March 19, 2018 13:25
@prigoyal
Copy link
Contributor

@caffe2bot retest this please

@ftynse ftynse force-pushed the register-promotion-improvements branch 3 times, most recently from ea53839 to 815b402 Compare March 22, 2018 18:58
@ftynse ftynse mentioned this pull request Mar 23, 2018
3 tasks
* are mapped to threads (the innermost of them being mapped to thread x) and
* the depth of this mapping can be obtained from threadIdxxScheduleDepthState.
*
* In parciular, the group's footprint must contain only one element and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

particular

.apply_domain(schedule);

// Scheduled accesses contain maps from schedule dimensions to tensor
// subscripts. Compute the relation that between the schedule dimensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra that

/*
* Check if the given "group" can be promoted to registers for the given active
* domain points under full "schedule" where "nThreads" consecutive dimensions
* are mapped to threads (the innermost of them being mapped to thread x) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, can you remind me where the assumption that threadIdx.x is innermost is initially introduced in the context of memory promotion?
Nothing to change now but as I am reading these pieces again I am wondering where/how bad things will break when we relax that assumption.

size_t nMappedThreads = 0;
for (int j = 0; j < points.dim(isl::dim_type::param); ++j) {
auto id = points.get_space().get_dim_id(isl::dim_type::param, j);
for (size_t i = 0; i < mapping::ThreadId::kMaxDim; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!MappingId::isThreadId(id)) {
  continue;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish, but ids in isl space are not MappingId and there is no easy way to convert them.

if (id != mapping::ThreadId::makeId(i)) {
continue;
}
if (getParamValIfFixed(points, j) ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could extend to a templated

getParamValIfFixed<T>(points, j)

and just compare to 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just have a comparison operator between isl::val and int. I don't think we should narrow isl::val to int in a call.

if (!hasReuse(*group, fullSched, depth)) {
continue;
}
// TODO: if something is already in shared, but reuse it within one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have it backwards here?
First promote to registers at some depth below threadId mappings.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.

If you first promote to shared then promote again to private a bunch of issues can occur:

  1. missed opportunities to promote to shared because of incorrect size estimate
  2. extra complexity to undo promotion to shared
  3. no point in keeping it in shared _if_ it gets promoted into a register is only true modulo proper coalescing

for (auto a : isl::UnionAsVector<isl::union_map>(accesses)) {
if (isl::union_map(a.curry()).intersect_domain(domain).is_empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

}
}

schedule = schedule.unite(current);
prefixMupa = isl::manage(isl_multi_union_pw_aff_intersect_domain(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Export this rather than fallback to old days?
Maybe it is exported later?

@ftynse ftynse force-pushed the register-promotion-improvements branch from 3944e43 to 44a7708 Compare March 26, 2018 21:05
ftynse added 8 commits March 26, 2018 15:47
Extract as an overload of Scop::activePromotions taking a set of active
statement instances and a tensor id.  Overload was chosen because both
functions return the same data structure and are semantically close.
Internally, we may need to modify the stored active promotions but the
public functions return either a copy of or a const reference to that
storage.  Extract logic to find active promotions into a separate
function that returns indexes into the storage, and use it to create a
copy inside a public call.
If a group of references was promoted into shared memory, but it could
be also promoted to registers while covering exactly the same statement
instances accessing it, demote it from shared memory before promoting to
registers.
These option combinations were failing with previous implementations of
double promotion.  Make sure they never fail again.
All other ScheduleTree node types are printed in such a way that each
set(map) of the union_set(union_map) present in the node is printed on a
new line.  Do the same for extension nodes.
This creates a private convenience function to obtain a copy of active
promotions specified by a list of their indexes in the storage.

Use this function in Scop::promoteGroup to avoid retraversing the list
of all promotions twice in a row.
In cases when the appoximate footprint of the reference group being
promoted to registers is not a subset of any of the approximate
footprints of the reference groups promoted to shared, it is still
possible to promote by copying directly from global memory as long as
all overlapping reference groups have only read the data.  It will just
create multiple copies of the data in different storages without
compromising correctness.
In cases where a reference group promoted to registers covered exactly
the same accesses as another group promoted to shared memory, the second
group was demoted to save up shared memory space.  However, this led to
adverse effects since copying global->shared is performed in the
beginning of the block while copying global->register deeper in the tree,
which does not allow to hide latency from loads.  Keep the group
promoted to shared memory and perform a copy shared->register.

Alternative solution would be to decrease the promotion scope depth for
register promotion.  This would require to ensure that loops indices of
which are present in subscripts of "register" arrays are fully unrolled
so that the elements of that array are effectively mapped to registers.
Since unrolling is expensive in compilation time and is exposed to the
autotuner, we would prefer to also expose the register promotion depth
in the future.
@nicolasvasilache nicolasvasilache force-pushed the register-promotion-improvements branch from 44a7708 to 230b99a Compare March 26, 2018 21:47
Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this commit 187406c is where everything happens, first the diff is WIP but I still made a first pass at it; the first remark is that insertIntraCopiesUnder begs to be properly documented.

Regarding the choice of promotion ordering, have you thought about doing it the other way around (and if so can you comment on the tradeoffs)?

Personally I would gone first for promotion to registers.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.

If you first promote to shared then promote again to private a bunch of issues can occur:

  1. missed opportunities to promote to shared because of incorrect size estimate
  2. extra complexity to undo promotion to shared
  3. demotion from shared is only good modulo proper coalescing

I'll make another pass tomorrow with a clear head

@@ -412,6 +412,11 @@ struct Scop {
isl::schedule_constraints constraints,
const SchedulerOptionsView& schedulerOptions);

// Get the indexes of active promotions in the activePromotions_.
std::vector<size_t> activePromotionsIndexes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activePromotionsIndices

std::vector<std::pair<isl::union_set, Scop::PromotionInfo>>
Scop::activePromotions(isl::union_set activePoints, isl::id tensorId) {
std::vector<std::pair<isl::union_set, Scop::PromotionInfo>> result;
std::vector<size_t> Scop::activePromotionsIndexes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activePromotionsIndices

@@ -331,7 +331,9 @@ struct Scop {

std::vector<std::pair<isl::union_set, Scop::PromotionInfo>> activePromotions(
isl::union_set activePoints,
isl::id tensorId) const;
isl::id tensorId) const {
return promotionsAtIndexes(activePromotionsIndexes(activePoints, tensorId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and everywhere else, plural of index is indices :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are correct in English ;) But ok if you insist.

@ftynse
Copy link
Contributor Author

ftynse commented Mar 29, 2018

Don't you have it backwards here?

There are pros and cons of doing it both ways, or even in a single promotion pass.

If you want to promote to registers first, and then the same reference group to shared, you will have to find and modify the copy expressions that you inserted in all register scopes. If you do the other way around, you can just demote one shared promotion. It will free some space, but it's straightforward to promote something else there.

Conceptually, I found it simpler to always know where your data is currently located (global, shared, register), so I went for shared, then registers.

First promote to registers at some depth below threadId mappings.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.

Practice shows that we may want to use shared even if there is no reuse for latency hiding reasons. This is actually one of the main reasons why you saw perf regressions with registers compared to shared-only.

If you first promote to shared then promote again to private a bunch of issues can occur:

missed opportunities to promote to shared because of incorrect size estimate

We can keep the collected data and call the promotion once again (it's greedy).

extra complexity to undo promotion to shared

Cutting a tree branch is far easier than rewriting the functions for copies.

no point in keeping it in shared _if_ it gets promoted into a register is only true modulo proper coalescing

It is even trickier than that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants