Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster vars used tracking in simplify let visitor #8205

Merged
merged 18 commits into from
Apr 28, 2024

Conversation

abadams
Copy link
Member

@abadams abadams commented Apr 19, 2024

This visitor shows up as the main cost of lowering in very large
pipelines.

This visitor is for tracking which lets are actually used for real
inside the body of a let block (as opposed to the tracking we do when
mutating, which is approximate, because we could construct an Expr that
uses a Var and then discard it in a later mutation).

The old implementation made a map of all variables referenced, and then
checked each let name against that map one by one. If there are a small
number of lets outside a huge Stmt, this is bad, because the data
structure has to hold a number of names proportional to the Stmt size
instead of proportional to the number of lets.

This new implementation instead makes a hash set of the let names, and
than traverses the Stmt, removing names from the set as they are
encountered. This is a big speed-up.

We then make the speed-up larger by about the same factor again doing
the following:

  1. Only add names to the map that might be used based on the recursive
    mutate call. These are very very likely to be used, because we saw them
    at least once, and mutations that remove all uses of a Var are rare.

  2. The visitor should early out when the map becomes empty. The let
    variables are often all used immediately, so this is frequent.

Speeds up lowering of local laplacian by 1.44x, 2.6x, and 4.8x
for 20, 50, and 100 pyramid levels respectively.

Speeds up lowering of resnet50 by 1.04x.

Speeds up lowering of lens blur by 1.06x

Built on #8202

Complementary with #8204 though the total speed-up from merging both is not going to be the product of the two speed-ups. The two PRs solve the same local laplacian pathological lowering time problem in two different ways.

Deletes a bunch of code and speeds up lowering time of local laplacian
with 20 pyramid levels by ~2.5%
It was O(n) for n facts. This makes it O(log(n))

This was particularly bad for pipelines with lots of inputs or outputs,
because those pipelines have lots of asserts, which make for lots of
facts to substitute in.

Speeds up lowering of local laplacian with 20 pyramid levels (which has
only one input and one output) by 1.09x

Speeds up lowering of the adams 2019 cost model training pipeline (lots
of weight inputs and lots outputs due to derivatives) by 1.5x

Speeds up resnet50 (tons of weight inputs) lowering by 7.3x!
Interval::is_single_point() used to only compare expressions by shallow
equality to see if they are the same Expr object.

However, bounds_of_expr_in_scope is really improved if it uses deep
equality instead, so it has a prepass that goes over the provided scope,
calls equal(min, max) on everything, and fixes up anything where deep
equality is true but shallow equality.

This prepass costs O(n) for n things in scope, regardless of how complex
the expression being analyzed is. So if you ask for the bounds of '4'
say in a context where there are lots of things in the scope, it's
absurdly slow. We were doing this! BoxTouched calls
bounds_of_expr_in_scope lots of times on small index Exprs within the
same very large scope.

It's better to just make Interval::is_single_point() check deep
equality. This speeds up local laplacian lowering by 1.1x, and resnet50
lowering by 1.5x.

There were also places where intervals that were a single point were
diverging due to carelessly written code. E.g. the interval [40*8,
40*8], where both of those 40*8s are the same Mul node, was being
simplified like this:

interval.min = simplify(interval.min);
interval.max = simplify(interval.max);

Not only does this do double the simplification work it should, but it
also caused something that was a single point to diverge into not being
a single point, because the repeated constant-folding creates a new
Expr. With the new is_single_point this matters a lot less, but even so,
I centralized simplification of intervals into a single helper that
doesn't do the pointless double-simplification for single points.

Some of these shallowly-unequal but deeply-equal Intervals were being
created in bounds inference itself after the prepass, which may have
been generating suboptimal bounds. This change should fix that in
addition to the compile-time benefits.

Also added a simplify call in SkipStages because I noticed when it
processed specializations it was creating things like (condition) ||
(!condition).
This visitor shows up as the main cost of lowering in very large
pipelines.

This visitor is for tracking which lets are actually used for real
inside the body of a let block (as opposed to the tracking we do when
mutating, which is approximate, because we could construct and Expr that
uses a Var and then discard it in a later mutation).

The old implementation made a map of all variables referenced, and then
checked each let name against that map one by one. If there are a small
number of lets outside a huge Stmt, this is bad, because the data
structure has to hold a number of names proportional to the stmt size
instead of proportional to the number of lets.

This new implementation instead makes a hash set of the let names, and
than traverses the Stmt, removing names from the set as they are
encountered. This is a big speed-up.

We then make the speed-up larger by about the same factor again doing
the following:

1) Only add names to the map that might be used based on the recursive
mutate call. These are very very likely to be used, because we saw them
at least once, and mutations that remove *all* uses of a Var are rare.

2) The visitor should early out when the map becomes empty. The let
variables are often all used immediately, so this is frequent.

Speeds up lowering of local laplacian by 1.44x, 2.6x, and 4.8x
respectively for 20, 50, and 100 pyramid levels.

Speeds up lowering of resnet50 by 1.04x. Speeds up lowering of lens blur
by 1.06x
@rootjalex
Copy link
Member

It looks like this PR also contains #8202 ? Might need to merge with main to clean it up

@abadams
Copy link
Member Author

abadams commented Apr 23, 2024

Merge with main done

@abadams
Copy link
Member Author

abadams commented Apr 23, 2024

With the last merge, when combined with #8204, this is 6.7x faster for local laplacian with 100 pyramid levels. But Amdahl's law has kicked in here. For just the pass that was slow (the simplification pass just after vectorization, because it did the first simplification after unify_duplicate_lets), this is actually over 400x faster.

std::map<std::string, int> vars_used;
count_var_uses(result, vars_used);

std::unordered_set<std::string> unused_vars(frames.size() * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does frames.size() * 2 come from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, each frame can maybe insert two names. Maybe add a comment above this definition, for explanation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think old_uses and new_uses might be mutually exclusive. Will test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Fixed.

@rootjalex
Copy link
Member

Is it reasonable to write tests for this new functionality, or do existing tests already stress it enough?

@abadams
Copy link
Member Author

abadams commented Apr 25, 2024

Everything breaks if you remove a let that you should not have. I also confirmed the existing tests fail if you leave too many lets in (it messes up some instruction selection tests).

@abadams
Copy link
Member Author

abadams commented Apr 28, 2024

Just needs an approval

Copy link
Member

@rootjalex rootjalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@abadams abadams merged commit 64caf31 into main Apr 28, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants