Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some additional performance optimization #119

Merged
merged 51 commits into from
Jul 1, 2021

Conversation

szeiger
Copy link
Collaborator

@szeiger szeiger commented Apr 12, 2021

The latest result from 0.4.0 was:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  751.757 ± 6.697  ms/op

This PR brings it down to:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  310.339 ± 1.590  ms/op

The main optimizations are:

  • New scope handling: ValScope is now symbol-based instead of name-based. New bindings are appended at the end with no shadowing; self, $ and super are treated like regular variables in the scope.
  • A new StaticOptimizer which uses a single AST transformation after parsing to implement several of the following optimizations (per source file; stored in the parse cache for reuse)
  • Static binding of standard library calls
  • Arity-specific function calls for standard library functions (ApplyBuiltin) and user-defined functions (Apply)
  • Static resolution of named arguments and defaults in function calls
  • Remove several unnecessary object allocations and indirections (Applyer, lambdas in members & standard library functions)
  • No more ValScope allocation for standard library calls
  • Partial application and peephole optimizations of standard library calls via Builtin.specialize (e.g. pre-compiled patterns in format and strReplaceAll; specialized implementation of length(filter(...)))
  • Static application of functions when all parameters are literals
  • Inlining of literals
  • General optimizations of various standard library functions
  • Make Val a subclass of Lazy to avoid unnecessary wrappers when a Lazy is required for an already computed Val
  • Allow arrays of literals to be treated as literals (similar to static objects introduced in the last round of optimizations)
  • Faster dispatch in the Evaluator methods by using a tableswitch-based dispatch for operators and a lookupswitch ordered by frequency (in our benchmarks) for node types
  • Avoid unnecessary multiple loading of imported files
  • New Renderer implementations based on the latest ujson
  • New JSON parsing directly to Sjsonnet Val without an intermediate usjon AST

New features:

  • ExprTransform and ScopedExprTransform for implementing tree transforms (used by the optimizer and some benchmarks)
  • Benchmarks for parser, optimizer and materializer
  • A profiler for gathering jfr-like profiling data at the level of AST evaluation

val l = visitExpr(lhs)
val r = visitExpr(rhs)
def fail() = Error.fail(s"Unknown binary operation: ${l.prettyName} ${Expr.BinaryOp.name(op)} ${r.prettyName}", pos)
op match {
Copy link
Contributor

@lihaoyi-databricks lihaoyi-databricks Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be annotated : @switch to ensure the tableswitch-based compilation isn't accidentally broken?

Ditto for the visitUnaryOp pattern match above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR @switch is broken. But even if it works, it also accepts a lookupswitch so it's not much of help.

def cast[T: ClassTag: PrettyNamed] =
if (implicitly[ClassTag[T]].runtimeClass.isInstance(this)) this.asInstanceOf[T]
else throw new Error.Delegate(
"Expected " + implicitly[PrettyNamed[T]].s + ", found " + prettyName
)
def pos: Position

private[this] def failAs(err: String): Nothing =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this take an implicit T: PrettyNamed for the error message, rather than passing in the err string manually? That would help ensure that we're consistent with our names for the various Val.* classes


parse("1 + 2 * 3") ==>
BinaryOp(pos(2), Num(pos(0), 1), BinaryOp.`+`, BinaryOp(pos(6), Num(pos(4), 2), BinaryOp.`*`, Num(pos(8), 3)))
BinaryOp(pos(2), Num(pos(0), 1), BinaryOp.OP_+, BinaryOp(pos(6), Num(pos(4), 2), BinaryOp.OP_*, Num(pos(8), 3)))
}
test("duplicateFields") {
parseErr("{ a: 1, a: 2 }") ==> """Expected no duplicate field: a:1:14, found "}""""
Copy link
Contributor

@lihaoyi-databricks lihaoyi-databricks Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a few tests to ParserTests to validate the construction of static Val.Arrs and Val.Objs? A lot of the new changes are semantically indistinguishable from the status quo, and thus wouldn't get validated through the normal course of compiling jsonnet. We're also starting to have edge cases that are worth validating in tests, e.g. nested static arrays, nested static objects, alternating nested static arrays and static objects, static arrays containing a mix of static primitives and other static arrays, etc.

@@ -33,20 +33,24 @@ class Evaluator(parseCache: collection.mutable.HashMap[(Path, String), fastparse
def visitExpr(expr: Expr)
(implicit scope: ValScope): Val = try {
expr match {
case Id(pos, value) => visitId(pos, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you re-ordered these to try and speed up lookup for the most common cases. How did you find the order? e.g. was it just a guesstimate, did you instrument it to see which ones are most common, or something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gathered statistics from our universe benchmark. It may not be representative for all use cases but the current version doesn't seem to be ordered for performance anyway.

Error.fail("Too many args, function has " + params.names.length + " parameter(s)", outerPos)
}
arrI
} else if(params.indices.length < argsSize) {
Copy link
Contributor

@lihaoyi-databricks lihaoyi-databricks Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would moving this case to the top of the if-else chain allow us to avoid the try-catch above? That would also allow us to avoid duplicating the Error.fail call. AFAICT the most common code path is the last else, so re-ordering the first two cases shouldn't slow things down too much

@lihaoyi-databricks
Copy link
Contributor

Looks good, left some comments

@szeiger
Copy link
Collaborator Author

szeiger commented Apr 17, 2021

Some of those might already be obsolete with the further changes I made. Current benchmark time is 483ms. I'm still exploring a few options for improving it.

@szeiger
Copy link
Collaborator Author

szeiger commented Jun 15, 2021

Updated with the latest changes. I'm running out of ideas and haven't made progress in a while. We should get this version merged. I tested it against universe. It will require 2 changes there because we relied on incorrect behavior of the old Sjsonnet release.

@szeiger
Copy link
Collaborator Author

szeiger commented Jun 15, 2021

0.4.0 stands at 751 ms in our benchmark. Here's the progress over the course of this PR:

Remove Applyer & optimize function application:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  738.439 ± 6.316  ms/op

Scope-free function application:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  690.672 ± 4.851  ms/op

Optimize Std:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  673.334 ± 5.585  ms/op

Optimize builtinWithDefaults:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  663.788 ± 2.486  ms/op

Optimize Builtin args handling:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  642.562 ± 3.537  ms/op

Shared Val.Strict + Array literals:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  626.566 ± 3.125  ms/op

Static object optimization + Obj.Member without lambdas:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  616.049 ± 4.625  ms/op

Reorder operations by frequency:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  595.176 ± 3.789  ms/op

tableswitch-based operator lookup:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  582.002 ± 5.800  ms/op

Static bindings of Std calls:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  501.764 ± 3.263  ms/op

More static optimization:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  494.850 ± 2.327  ms/op

Fix import caching:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  483.840 ± 3.675  ms/op

Conventional scope handling:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  480.387 ± 2.640  ms/op

Optimize scope handling:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  469.976 ± 2.968  ms/op

Val extends Lazy:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  464.764 ± 1.853  ms/op

Optimize Materializer + Renderer:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  431.154 ± 1.928  ms/op

ApplyN:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  408.612 ± 3.715  ms/op

Add special-case Exprs

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  401.357 ± 1.430  ms/op

Stdlib specialization:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  391.974 ± 2.038  ms/op

% specialization:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  378.203 ± 2.186  ms/op

Static apply + length optimization:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  363.323 ± 2.544  ms/op

MaterializeRenderer:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  337.980 ± 1.187  ms/op

Filter strictness optimization:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  332.285 ± 1.596  ms/op

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  331.977 ± 1.212  ms/op

Builtin with defaults & std.setInter:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  322.575 ± 0.885  ms/op

length(filter) specialization:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  314.919 ± 1.103  ms/op

Reduce phase mismatches in scoped transform:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  310.339 ± 1.590  ms/op

@@ -6,7 +6,7 @@ cancelable in Global := true

lazy val main = (project in file("sjsonnet"))
.settings(
scalacOptions in Compile ++= Seq("-opt:l:inline", "-opt-inline-from:sjsonnet.**"),
scalacOptions in Compile ++= Seq("-opt:l:inline", "-opt-inline-from:sjsonnet.*,sjsonnet.**"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -opt-inline-from syntax is confusing. sjsonnet.** only inlines from subpackages (which we don't have). We need to add sjsonnet.* to inline from the sjsonnet package, too. Ultimately it doesn't make a big difference. HotSpot has become so good that it doesn't matter in most cases but the optimizations are not entirely deterministic. Sometimes you end up with a benchmark run that is 10% slower than it should be. Letting scalac do the trivial inlining (and thus creating less work for HotSpot) makes this more reliable.

@lihaoyi-databricks
Copy link
Contributor

lihaoyi-databricks commented Jun 22, 2021

Looks good, few things before we merge it:

  • Update the PR description to have your latest benchmarks (in your later comment), and try to write up a summary of all the disparate changes you did. e.g. you introduced an optimizer, what you optimize, what changes to the core datatypes are, what common themes like specialization/unboxing/fastpaths occur throughout the PR

  • Flesh out each of the commits with a paragraph description of what you are doing and why? That will help make the blame useful for when we merge/rebase it without squashing. At least 2-3 sentence, but many commits deserve more since clearly a lot of thought/work went into them

  • Update the readme.md Architecture section: we have a new phase (static optimizations) and the data types are different (now Val and Expr are related, and Lazy and Strict now exist)

  • Add some docs about how to run the benchmarks and profiler; both on proprietary code (for people at databricks) and how to point it at other codebases (for anyone not at databricks)

Other than that, I have reasonable confidence in the test suite and our ZZZ golden files to catch any bugs before they slip through. Main concern now is just to make sure the knowledge of why you did each of these commits you earned from the hours you spent on this stuff is preserved in the codebase/git, for future maintainers to pick up and continue. Especially non-obvious things like avoiding megamorphic code in hot paths, removing boxing, fast-paths for common usage patterns, would be much harder for someone to later reverse-engineer v.s. simply reading why you did something

szeiger added 15 commits June 22, 2021 19:20
This hides the internal array and provides all the necessary methods to access it efficiently. There are experiments in https://github.com/szeiger/sjsonnet/tree/wip/perf-arropt that didn't improve performance.
Applyer required an extra object and indirection for every call of a higher-order function. We can do without the convenience and use Val.Func directly, passing the EvalScope and FileScope directly.
This introduces the Val.Builtin types for built-in functions of various arities, plus arity-specific apply methods for all functions. The arity-specific methods and types allow us to avoid array allocation for parameters. The Builtin types further avoid allocating a new ValScope when calling a built-in function. This commits starts the refactoring of standard library functions to the new Builtin types, which is an ongoing effort. Normal Val.Func functions are still supported, but they do not benefit from the same optimizations.
Refactor more standard library functions to use `Val.Builtin` and introduce convenience methods in `Val` for type coercions. These take the place of the `ReadWriter` implicits (which are still used in some places). The old way of implementing built-in functions as `Val.Func` via the `builtin` methods handled type coercions automatically (with `ReadWriter`), but the new style of manually implementing `Val.Builtin` is much easier with the convenience methods.
Some performance optimizations for `builtinWithDefaults` to avoid complex collection operations when calling such a function.
This starts the refactoring of built-in functions into objects (instead of anonymous classes) and removes the unnecessary FileScope.
Avoid creating many individual subclasses of `Lazy` for representing strict values (which have already been evaluated or are safe to evaluate immediately because we know they will be evaluated anyway). The new shared `Strict` class assigns the cached value in the constructor (in addition to returning it in `compute`). This looks unnecessary from a functional point of view, but it is important for performance as it avoids the megamorphic `compute` call the first time it is forced. `force` itself is a short monomorphic method that can easily be inlined by HotSpot.
Arrays can now be literals. Any array expression encountered by the parser which contains only literals, is itself turned into a literal, i.e. an instance of `Val.Arr` rather than `Expr.Arr`.
We already introduced the concept of static objects (created from object literals, containing only members with static names and literal values) in the previous round of optimizations. This can be used for faster key and value lookup.
`Val.Obj.Member` is now an abstract class with an abstract `invoke` method instead of taking a function argument for the member implementation. This avoids the extra object allocation per definition and the extra indirection per call site. When a member returns a statically known value (which is always the case in a static object) we use the special `ConstMember` class which serves the same purpose as `Val.Strict` for `Val.Lazy`.
The main benchmark is still over 600ms at this point. The new parser and materializer benchmarks tell us how much of this time is spent outside the evaluator (a bit over 30ms for the parser and 60ms for the materializer).
This is based on statistics gathered from our benchmark. The class-based lookups in `visitExpr` get compiled to a `lookupswitch` which is generally pretty fast, but with linear time based on the position in the method.
Unary and binary operators already used the same expression types, with the `Op` objects only being used as markers. They can be easily replaced by `Int` literals, thus allowing `visitUnaryOp` and `visitBinaryOp` to be compiled to a `tableswitch` at the outer layer, which makes all operator lookups equally fast.
This commit introduces the static optimizer. The base class `ExprTransform` implements an AST transformer which can recursively transform and rebase an `Expr`.

`StaticOptimizer` adds a scoped transform (which keeps track of the names that are in scope for each `Expr`) and implements the first optimization: Calls of the form `std.x(...)`, where `std` is the standard library (i.e. the name `std` has not been shadowed by a local definition) and `x` is a valid method name in the standard library, are replaced by one of the new `Expr.ApplyBuiltin` expression types (for various arities). This allows us to skip looking up `std` and `x` again during evaluation.
- Add a new benchmark for the optimizer (~2-3ms in our normal benchmark run)
- Introduce the `Resolver` abstraction for resolving imports
- Refactor the scoped transformations from `StaticOptimizer` into a new superclass `ScopedExprTransform`
szeiger added 24 commits June 22, 2021 21:44
We are adding another new step in the optimization of function application expressions. Any `Builtin` now gets the opportunity to rewrite the call site. This is particularly useful for partially applying literal arguments during optimization. For example, the `from` argument of `std.replaceAll` has to be parsed as a regular expression every time the function is called. When it is statically known we now generate a call to a specialized version that performs the parsing only once during optimization.
This is similar to what jfr does for the JVM, but based on the Sjsonnet AST.
The `%` operator can benefit greatly from partial application when the lhs is a string literal, so we add an optimizer rule for this. This is not useful for any other operators. We do not need a generic mechanism like we have for `Builtin` functions.
- As another step in function application optimization we now try to statically apply a `Builtin` function when all arguments are literals.
- Some micro-optimizations for `std.length`.
- Don't parse JSON to an intermediate ujson AST first in `std.parseJson`. The new `ValVisitor` can parse directly to an Sjsonnet `Val`.
- Similarly the new `MaterializeJsonRenderer` used by `std.manifestJson` renders the output without an intermediate ujson AST.
We can avoid allocating a new `ValScope` for each predicate call in `std.filter`. Normally every definition has to copy the existing scope instead of updating a shared array because any value could be read at an arbitrary later time due to lazy evaluation. But this is not the case when a function returns a primitive value. In particular, in `std.filter` we are repeatedly calling the same predicate function and we only check if it returned `Val.True`. The value is not stored anywhere for later use. This makes it safe to reuse a single `ValScope` with a single bindings array for all calls.
Built-in functions with defaults were still treated as `Val.Func`. With the old scope handling we would have needed a more complex implementation to also handle default values but this is no longer a problem. We simply have to pass them on to the superclass.

Now that we can turn `std.setInter` into a `Builtin` we can optimize it further by partially applying a static argument.
Calls of the form `std.length(std.filter(...), ...)` can be optimized to skip creation of the filtered array. We can simply count as we go along. It is not clear at this time if the Jsonnet specification allows us to go further and skip evaluation entirely in cases like `std.length(std.filter(...), ...) == 0` so we still evaluate everything.
Some micro-optimizations
When looking up definitions in the static scope of a `ScopedExprTransform` it is not always possible to see the value at the current phase (i.e. after `StaticOptimizer` vs after `Parser` at the moment; we do not have more phases yet). This is a general problem in any language that allows recursive references in definitions.

Previously we used the simplest possible implementation in `ScopedExprTransform`: All definitions that are made together (e.g. in a single `local` expression) are stored in the scope at the same time (using their value after the previous phase), and then they are evaluated to provide an improved scope in which the body of the `local` can see them with the updated values after the current phase. This prevents full optimization in cases like this:

```
local a = 1,
      b = a,
      c = b;
c
```

When the rhs of `c` is optimized, the scope still contains `b = a` even though `a` has already been inlined (`b = 1`).

With this PR we do the next better thing: Update the scope incrementally to allow back-references to see the current phase.

Note that `a`, `b` and `c` are all allowed to refer to each other and they may be defined in an arbitrary order. In these cases we can still miss some optimizations, but supporting forward references would require lazy evaluation of scopes and recursion detection. In practice most references that benefit from these optimizations are expected to go backwards and we want to keep it simple.
This avoids creating `Lazy` objects in some cases when we alreayd know that evaluation will be strict.
Some refactoring to simplify and generalize these optimizations.
We have to do them before the optimizer because they are based on syntax.
PrettyYamlRenderer relies on subVisitor calls for individual elements (and no subVisitor call for an empty array)
The current behavior (after the scope handling overhaul) is correct (matching the specification and Google Jsonnet) but the error message was misleading. Sjsonnet 0.4 did not detect the illegal call at all.
`extVar` depends on external variables which are part of a specific evaluation. We have to ensure that they do not end up in the shared parse cache.
The result is puzzling but it matches the specification and Google Jsonnet. Returning a nullary `function() true` causes it to be evaluated during materialization to `true` but comparing it explicitly to `true` must yield `false` because there is no implicit evaluation in this case. This was previously broken in Sjsonnet.
They are only needed for tests and benchmarks. All production code uses the new implementations.
Oops, replaceAll accidentally did the work twice.
@szeiger
Copy link
Collaborator Author

szeiger commented Jun 25, 2021

Updated with additional docs.

@szeiger szeiger merged commit d013527 into databricks:master Jul 1, 2021
stephenamar-db pushed a commit that referenced this pull request Dec 12, 2024
…236)

This PR optimizes `lstripChars` / `rstripChars` / `stripChars` built-in
functions by using the specialization framework from #119 /
0bd255a to pre-compile and re-use
`Pattern` instances when the replacement / strip arguments are
constants.

In Java, `String.replaceAll()` compiles and uses a `Pattern` under the
hood and this is relatively expensive; specialization lets us save this
cost when the `chars`-to-be-stripped are constant.

## Testing


**Correctness**: ran all tests with a manual change to disable static
function application optimizations (a prerequisite required to achieve
full test coverage, as the static application prevents specialization
from kicking in during most tests). In a followup, I think we should
explore adding flags for optimizer features and running all tests
with/without optimization.

**Performance**: ran benchmarks on a large file and with this change I
save ~11% of allocated bytes and ~8% of wall time. I ran that same file
through `RunProfiler` and saw a large speedup in `stripChars`, costing
~619ns/call before and ~154ns/call after.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants