-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU autoscheduler #6856
Add GPU autoscheduler #6856
Conversation
Step 1: please run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there really is a lot of overlap with Adams2019 here. Would it be tractable to merge these changes into Adams2019, or at least have shared code between them?
.unroll(r, 8); | ||
A.in().compute_at(prod, r).vectorize(_0).unroll(_1); | ||
B.in().compute_at(prod, r).vectorize(_0).unroll(_1); | ||
if (!auto_schedule) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere: as of #6838, you no longer should examine the auto_schedule
GeneratorParam directly; instead, please call the using_autoscheduler()
method.
|
||
$(BIN)/cost_model/%.a: $(BIN)/cost_model.generator | ||
@mkdir -p $(@D) | ||
$^ -g $* -o $(BIN)/cost_model -f $* target=$(HL_TARGET)-no_runtime auto_schedule=false enable_debug_output=$(ENABLE_DEBUG_OUTPUT) -e stmt,static_library,h,assembly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere: auto_schedule=false
is no longer supported and should be removed.
$(BIN)/%/demo.a: $(GENERATOR_BIN)/demo.generator $(BIN)/libautoschedule_anderson2021.$(SHARED_EXT) | ||
@mkdir -p $(@D) | ||
HL_WEIGHTS_DIR=$(SRC)/baseline.weights \ | ||
$(GENERATOR_BIN)/demo.generator -g demo -o $(@D) -f demo target=$* auto_schedule=true -p $(BIN)/libautoschedule_anderson2021.$(SHARED_EXT) -s Anderson2021 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere: auto_schedule=true -s Anderson2021
is no longer supported syntax. Instead, specify autoscheduler=Anderson2021
to replace both of these.
5b92284
to
bb571dd
Compare
Some of the files (e.g. Unfortunately, most of the other files (e.g. |
Yep!
No worries then. |
@steven-johnson The buildbots are green now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty great, thanks for doing all this work to get this landed! Just a handful of mostly style nits and such, with a few relatively minor things that need addressing (and a few things that would be nice for someone to work on as a followup).
IntrusivePtr<State> optimal_schedule(int beam_size); | ||
}; | ||
|
||
#ifdef HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API is now removed in Halide main
, so this should just be removed here and elsewhere. (We can do this as a followup PR if you like.)
That said, here are the (legacy) env vars you can still use when HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API | ||
is defined: | ||
|
||
HL_BEAM_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: since HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API is now removed in Halide main
, most of these env vars are probably no longer actually in use (settings would be controlled by GeneratorParams instead), so this should probably get pruned (subsequent PR is fine)
|
||
typedef PerfectHashMap<FunctionDAG::Node::Stage, ScheduleFeatures> StageMapOfScheduleFeatures; | ||
|
||
void find_and_apply_schedule(FunctionDAG &dag, const std::vector<Function> &outputs, const Anderson2021Params ¶ms, const Target &target, CostModel *cost_model, int beam_size, StageMapOfScheduleFeatures *schedule_features); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: this is an awfully long line
"Incorrect size for pipeline features"); | ||
int num_stages = 0; | ||
for (const auto &n : dag.nodes) { | ||
if (!n.is_input) num_stages += (int)n.stages.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit, here and elsewhere: generally, Halide always encloses if clauses in braces, even for a single line:
if (...) {
...
}
*(cost_ptrs(i)) = dst(i); | ||
if (std::isnan(dst(i))) { | ||
any_nans = true; | ||
aslog(0) << "Prediction " << i << " is NaN. True runtime is " << true_runtimes(i) << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, we prefer the autoschedulers to be 'quiet' by default -- do we really want/need these to be aslog(0)
(vs aslog(1)
etc)?
StageMap<StageMap<bool>> descendants; | ||
root->get_stages_computed_in_each_compute_root_loop(descendants); | ||
|
||
aslog(0) << "BEGIN compute locations\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be aslog(0)
?
namespace Internal { | ||
namespace Autoscheduler { | ||
|
||
#define MAX_THREADS_PER_BLOCK 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer constexpr int
for consts like this
#include <cstdint> | ||
#include <vector> | ||
|
||
using std::vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never, ever, ever put a using
into a .h file in the global namespace.
# benchmarked serially. | ||
BATCH_SIZE=80 | ||
EPOCHS=200 | ||
NUM_GPUS=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely, NVidia is not the only GPU vendor out there :-)
echo "Predict only mode: ON" | ||
fi | ||
|
||
DEFAULT_SAMPLES_DIR_NAME="${SAMPLES_DIR:-autotuned_samples}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard-to-parse error messages are not good, please fix :-)
} | ||
|
||
cursor = 0; | ||
cost_per_stage_ptrs.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have been wrong, but it seems that the buildbots are not running the GPU autoscheduler related tests?
For example, I pulled this branch and tried running anderson2021_test_apps_autoscheduler
(i.e., test/autoschedulers/anderson2021/test.cpp
) but it instantly fails, complaining that access to cost_per_stage_ptrs
is out of bound. I commented this line (cost_per_stage_ptrs.clear();
) and it seems to work again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have been wrong, but it seems that the buildbots are not running the GPU autoscheduler related tests?
That's correct and as-intended, since this branch hasn't landed yet. The buildbots usually aren't configured to test major new features before they have landed. This PR is close, but has a number of nits that need fixing first.
// Make the vectorized dimension of the inner loop 32 (or as | ||
// close as possible) | ||
int64_t inner_extent = std::min(c->size[vectorized_loop_index], (int64_t)32); | ||
|
||
if (c->stage->index == 0) { | ||
vector<int64_t> tiling(c->node->dimensions, 1); | ||
|
||
// Split into parallelized and serial | ||
c = c->parallelize_in_tiles(tiling, loop_nest, params, target, true, false); | ||
|
||
if (vectorized_loop_index >= 0) { | ||
tiling[vectorized_loop_index] = inner_extent; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that it is not checked for whether vectorized_loop_index == -1
here in line 222 (and it seems that -1
is a legal value here because of the subsequent check in line 230), and leads to a crash when running test/autoschedulers/anderson2021/test.cpp
. I changed
int64_t inner_extent = std::min(c->size[vectorized_loop_index], (int64_t)32);
to
int64_t inner_extent = vectorized_loop_index != -1 ? std::min(c->size[vectorized_loop_index], (int64_t)32) : (int64_t)32;
to fix this but I am not sure if this is right.
@steven-johnson Thanks for the review! Made all the requested changes. Thanks to all the others who left comments/suggestions too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a duplicate of apps/images/rgb_small.png
@@ -6,10 +6,10 @@ using namespace Halide; | |||
|
|||
class ConvRelu : public Halide::Generator<ConvRelu> { | |||
public: | |||
Input<Buffer<float, 4>> input{"input"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look unintentional (same for the file below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here: #7475
Add Anderson2021 GPU autoscheduler
This is a draft PR to merge the anderson2021 GPU autoscheduler (#5602). It includes the code (there is some overlap with adams2019, but it will probably be a substantial amount of work to deduplicate them), some tests, utility scripts for generating data/statistics/etc. (these can be removed if desired), and baseline weights (trained on a V100).