[wip] Register promotion improvements #161

ftynse · 2018-03-15T14:05:53Z

If a group of references was promoted into shared memory, but it could
be also promoted to registers while covering exactly the same statement
instances accessing it, demote it from shared memory before promoting to
registers.
If a group of references was promoted into shared, and a smaller group
of references can be promoted into registers while covering a subset of
statement instances accessing it, copy from shared to registers and back.

Stacked on #149

prigoyal · 2018-03-20T17:38:16Z

@caffe2bot retest this please

nicolasvasilache · 2018-03-26T19:55:20Z

src/core/polyhedral/memory_promotion_heuristic.cc

+ * are mapped to threads (the innermost of them being mapped to thread x) and
+ * the depth of this mapping can be obtained from threadIdxxScheduleDepthState.
+ *
+ * In parciular, the group's footprint must contain only one element and the


nicolasvasilache · 2018-03-26T20:02:02Z

src/core/polyhedral/memory_promotion_heuristic.cc

+          .apply_domain(schedule);
+
+  // Scheduled accesses contain maps from schedule dimensions to tensor
+  // subscripts.  Compute the relation that between the schedule dimensions


extra that

nicolasvasilache · 2018-03-26T20:22:58Z

src/core/polyhedral/memory_promotion_heuristic.cc

+/*
+ * Check if the given "group" can be promoted to registers for the given active
+ * domain points under full "schedule" where "nThreads" consecutive dimensions
+ * are mapped to threads (the innermost of them being mapped to thread x) and


For future reference, can you remind me where the assumption that threadIdx.x is innermost is initially introduced in the context of memory promotion?
Nothing to change now but as I am reading these pieces again I am wondering where/how bad things will break when we relax that assumption.

nicolasvasilache · 2018-03-26T20:25:28Z

src/core/polyhedral/memory_promotion_heuristic.cc

+      size_t nMappedThreads = 0;
+      for (int j = 0; j < points.dim(isl::dim_type::param); ++j) {
+        auto id = points.get_space().get_dim_id(isl::dim_type::param, j);
+        for (size_t i = 0; i < mapping::ThreadId::kMaxDim; ++i) {


if (!MappingId::isThreadId(id)) { continue; }

I wish, but ids in isl space are not MappingId and there is no easy way to convert them.

nicolasvasilache · 2018-03-26T20:27:27Z

src/core/polyhedral/memory_promotion_heuristic.cc

+          if (id != mapping::ThreadId::makeId(i)) {
+            continue;
+          }
+          if (getParamValIfFixed(points, j) ==


You could extend to a templated

getParamValIfFixed<T>(points, j)

and just compare to 0

We can just have a comparison operator between isl::val and int. I don't think we should narrow isl::val to int in a call.

nicolasvasilache · 2018-03-26T20:38:42Z

src/core/polyhedral/memory_promotion_heuristic.cc

+          if (!hasReuse(*group, fullSched, depth)) {
+            continue;
+          }
+          // TODO: if something is already in shared, but reuse it within one


Don't you have it backwards here?
First promote to registers at some depth below threadId mappings.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.

If you first promote to shared then promote again to private a bunch of issues can occur:

missed opportunities to promote to shared because of incorrect size estimate

extra complexity to undo promotion to shared

no point in keeping it in shared _if_ it gets promoted into a register is only true modulo proper coalescing

nicolasvasilache · 2018-03-26T20:40:12Z

src/core/polyhedral/memory_promotion.cc

  for (auto a : isl::UnionAsVector<isl::union_map>(accesses)) {
+    if (isl::union_map(a.curry()).intersect_domain(domain).is_empty()) {


makes sense

nicolasvasilache · 2018-03-26T20:42:24Z

src/core/polyhedral/memory_promotion_heuristic.cc

      }
    }

-    schedule = schedule.unite(current);
+    prefixMupa = isl::manage(isl_multi_union_pw_aff_intersect_domain(


Export this rather than fallback to old days?
Maybe it is exported later?

Extract as an overload of Scop::activePromotions taking a set of active statement instances and a tensor id. Overload was chosen because both functions return the same data structure and are semantically close.

Internally, we may need to modify the stored active promotions but the public functions return either a copy of or a const reference to that storage. Extract logic to find active promotions into a separate function that returns indexes into the storage, and use it to create a copy inside a public call.

If a group of references was promoted into shared memory, but it could be also promoted to registers while covering exactly the same statement instances accessing it, demote it from shared memory before promoting to registers.

These option combinations were failing with previous implementations of double promotion. Make sure they never fail again.

All other ScheduleTree node types are printed in such a way that each set(map) of the union_set(union_map) present in the node is printed on a new line. Do the same for extension nodes.

This creates a private convenience function to obtain a copy of active promotions specified by a list of their indexes in the storage. Use this function in Scop::promoteGroup to avoid retraversing the list of all promotions twice in a row.

In cases when the appoximate footprint of the reference group being promoted to registers is not a subset of any of the approximate footprints of the reference groups promoted to shared, it is still possible to promote by copying directly from global memory as long as all overlapping reference groups have only read the data. It will just create multiple copies of the data in different storages without compromising correctness.

In cases where a reference group promoted to registers covered exactly the same accesses as another group promoted to shared memory, the second group was demoted to save up shared memory space. However, this led to adverse effects since copying global->shared is performed in the beginning of the block while copying global->register deeper in the tree, which does not allow to hide latency from loads. Keep the group promoted to shared memory and perform a copy shared->register. Alternative solution would be to decrease the promotion scope depth for register promotion. This would require to ensure that loops indices of which are present in subscripts of "register" arrays are fully unrolled so that the elements of that array are effectively mapped to registers. Since unrolling is expensive in compilation time and is exposed to the autotuner, we would prefer to also expose the register promotion depth in the future.

nicolasvasilache

So this commit 187406c is where everything happens, first the diff is WIP but I still made a first pass at it; the first remark is that insertIntraCopiesUnder begs to be properly documented.

Regarding the choice of promotion ordering, have you thought about doing it the other way around (and if so can you comment on the tradeoffs)?

Personally I would gone first for promotion to registers.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.

If you first promote to shared then promote again to private a bunch of issues can occur:

missed opportunities to promote to shared because of incorrect size estimate
extra complexity to undo promotion to shared
demotion from shared is only good modulo proper coalescing

I'll make another pass tomorrow with a clear head

nicolasvasilache · 2018-03-27T00:44:23Z

include/tc/core/polyhedral/scop.h

@@ -412,6 +412,11 @@ struct Scop {
      isl::schedule_constraints constraints,
      const SchedulerOptionsView& schedulerOptions);

+  // Get the indexes of active promotions in the activePromotions_.
+  std::vector<size_t> activePromotionsIndexes(


activePromotionsIndices

nicolasvasilache · 2018-03-27T00:45:10Z

src/core/polyhedral/scop.cc

-std::vector<std::pair<isl::union_set, Scop::PromotionInfo>>
-Scop::activePromotions(isl::union_set activePoints, isl::id tensorId) {
-  std::vector<std::pair<isl::union_set, Scop::PromotionInfo>> result;
+std::vector<size_t> Scop::activePromotionsIndexes(


activePromotionsIndices

nicolasvasilache · 2018-03-27T00:49:24Z

include/tc/core/polyhedral/scop.h

@@ -331,7 +331,9 @@ struct Scop {

  std::vector<std::pair<isl::union_set, Scop::PromotionInfo>> activePromotions(
      isl::union_set activePoints,
-      isl::id tensorId) const;
+      isl::id tensorId) const {
+    return promotionsAtIndexes(activePromotionsIndexes(activePoints, tensorId));


Here and everywhere else, plural of index is indices :)

Both are correct in English ;) But ok if you insist.

ftynse · 2018-03-29T11:06:18Z

Don't you have it backwards here?

There are pros and cons of doing it both ways, or even in a single promotion pass.

If you want to promote to registers first, and then the same reference group to shared, you will have to find and modify the copy expressions that you inserted in all register scopes. If you do the other way around, you can just demote one shared promotion. It will free some space, but it's straightforward to promote something else there.

Conceptually, I found it simpler to always know where your data is currently located (global, shared, register), so I went for shared, then registers.

First promote to registers at some depth below threadId mappings.
Then promote remaining stuff to shared if extra reuse remains to be exploited or coalescing is bad.

Practice shows that we may want to use shared even if there is no reuse for latency hiding reasons. This is actually one of the main reasons why you saw perf regressions with registers compared to shared-only.

If you first promote to shared then promote again to private a bunch of issues can occur:
missed opportunities to promote to shared because of incorrect size estimate

We can keep the collected data and call the promotion once again (it's greedy).

extra complexity to undo promotion to shared

Cutting a tree branch is far easier than rewriting the functions for copies.

no point in keeping it in shared _if_ it gets promoted into a register is only true modulo proper coalescing

It is even trickier than that.

ftynse added the wip label Mar 15, 2018

facebook-github-bot added the CLA Signed label Mar 15, 2018

ftynse force-pushed the register-promotion-improvements branch 4 times, most recently from 4d76698 to 8295319 Compare March 19, 2018 13:08

ftynse changed the base branch from dev to master March 19, 2018 13:17

ftynse force-pushed the register-promotion-improvements branch from 8295319 to 178b5cf Compare March 19, 2018 13:25

ftynse force-pushed the register-promotion-improvements branch 3 times, most recently from ea53839 to 815b402 Compare March 22, 2018 18:58

ftynse mentioned this pull request Mar 23, 2018

Modernization of TC repo #194

Closed

3 tasks

nicolasvasilache reviewed Mar 26, 2018

View reviewed changes

ftynse force-pushed the register-promotion-improvements branch from 3944e43 to 44a7708 Compare March 26, 2018 21:05

ftynse added 8 commits March 26, 2018 15:47

extract detection of already promoted tensors

0a92f0c

Extract as an overload of Scop::activePromotions taking a set of active statement instances and a tensor id. Overload was chosen because both functions return the same data structure and are semantically close.

repromote from shared to private if possible

187406c

If a group of references was promoted into shared memory, but it could be also promoted to registers while covering exactly the same statement instances accessing it, demote it from shared memory before promoting to registers.

add tests cases found by the autotuner on initial versions

0634050

These option combinations were failing with previous implementations of double promotion. Make sure they never fail again.

ScheduleTree: print extension maps line by line

557e6da

All other ScheduleTree node types are printed in such a way that each set(map) of the union_set(union_map) present in the node is printed on a new line. Do the same for extension nodes.

nicolasvasilache force-pushed the register-promotion-improvements branch from 44a7708 to 230b99a Compare March 26, 2018 21:47

nicolasvasilache reviewed Mar 27, 2018

View reviewed changes

ftynse mentioned this pull request Jul 5, 2018

Repromote from shared to private memory #554

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Register promotion improvements #161

[wip] Register promotion improvements #161

ftynse commented Mar 15, 2018 •

edited

Loading

prigoyal commented Mar 20, 2018

nicolasvasilache Mar 26, 2018

nicolasvasilache Mar 26, 2018

nicolasvasilache Mar 26, 2018

nicolasvasilache Mar 26, 2018

ftynse Mar 29, 2018

nicolasvasilache Mar 26, 2018

ftynse Mar 29, 2018

nicolasvasilache Mar 26, 2018

nicolasvasilache Mar 26, 2018

nicolasvasilache Mar 26, 2018

nicolasvasilache left a comment •

edited

Loading

nicolasvasilache Mar 27, 2018

nicolasvasilache Mar 27, 2018

nicolasvasilache Mar 27, 2018

ftynse Mar 27, 2018

ftynse commented Mar 29, 2018

		for (auto a : isl::UnionAsVector<isl::union_map>(accesses)) {
		if (isl::union_map(a.curry()).intersect_domain(domain).is_empty()) {

[wip] Register promotion improvements #161

Are you sure you want to change the base?

[wip] Register promotion improvements #161

Conversation

ftynse commented Mar 15, 2018 • edited Loading

prigoyal commented Mar 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolasvasilache left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ftynse commented Mar 29, 2018

ftynse commented Mar 15, 2018 •

edited

Loading

nicolasvasilache left a comment •

edited

Loading