-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark the production code rather than some arbitrary thing #5200
Benchmark the production code rather than some arbitrary thing #5200
Conversation
Not running the benchmarks yet, because the machine is busy apparently. |
23270c8
to
d3b87a3
Compare
/benchmark plutus-benchmark:validation |
Click here to check the status of your benchmark. |
Comparing benchmark results of ' plutus-benchmark:validation' on '9611fd59f' (base) and 'd3b87a355' (PR) Results table
|
/benchmark plutus-benchmark:validation |
Click here to check the status of your benchmark. |
Comparing benchmark results of ' plutus-benchmark:validation' on '9611fd59f' (base) and 'd3b87a355' (PR) Results table
|
OK, in addition to the issues referenced in the description of the PR, we have more problems:
So why the 4-5% difference? I've checked the Core, this is what we have on -- RHS size: {terms: 7, types: 6, coercions: 63, joins: 0/0}
unsafeEvaluateCekNoEmit'2
:: forall {s}.
State# s
-> (# State# s,
ExBudgetInfo RestrictingSt DefaultUni DefaultFun s #)
unsafeEvaluateCekNoEmit'2
= \ (@s_adFA) (eta_adFB :: State# s_adFA) ->
$wrestricting
(unsafeEvaluateCekNoEmit'3 `cast` <Co:63>)
9223372036854775807#
9223372036854775807#
eta_adFB
-- RHS size: {terms: 26, types: 114, coercions: 98, joins: 0/0}
unsafeEvaluateCekNoEmit'
:: Term NamedDeBruijn DefaultUni DefaultFun ()
-> EvaluationResult (Term NamedDeBruijn DefaultUni DefaultFun ())
unsafeEvaluateCekNoEmit'
= \ (x_abKi :: Term NamedDeBruijn DefaultUni DefaultFun ()) ->
case runCekDeBruijn
(unsafeEvaluateCekNoEmit'3 `cast` <Co:63>)
defaultCekParameters1
(unsafeEvaluateCekNoEmit'2 `cast` <Co:16>)
(noEmitter1 `cast` <Co:19>)
x_abKi
of
{ (e_a9Z4, ds_dbMx, ds1_dbMy) ->
case e_a9Z4 of {
Left ds2_abKw ->
case ds2_abKw of { ErrorWithCause evalErr_abKB cause_abKC ->
case evalErr_abKB of {
InternalEvaluationError err_abMs ->
unsafeEvaluateCekNoEmit'1 err_abMs cause_abKC;
UserEvaluationError ds3_adLv -> EvaluationFailure
}
};
Right term1_abMv -> $WEvaluationSuccess term1_abMv
}
} and this is what we have in this branch: -- RHS size: {terms: 7, types: 6, coercions: 63, joins: 0/0}
evaluateCekLikeInProd1
:: forall {s}.
State# s
-> (# State# s,
ExBudgetInfo RestrictingSt DefaultUni DefaultFun s #)
evaluateCekLikeInProd1
= \ (@s_ahLF) (eta_ahLG :: State# s_ahLF) ->
$wrestricting
(evaluateCekLikeInProd2 `cast` <Co:63>)
9223372036854775807#
9223372036854775807#
eta_ahLG
-- RHS size: {terms: 16, types: 76, coercions: 99, joins: 0/0}
evaluateCekLikeInProd
:: Term NamedDeBruijn DefaultUni DefaultFun ()
-> Either
(CekEvaluationException NamedDeBruijn DefaultUni DefaultFun)
(Term NamedDeBruijn DefaultUni DefaultFun ())
evaluateCekLikeInProd
= \ (term_acKg :: Term NamedDeBruijn DefaultUni DefaultFun ()) ->
case getEvalCtx of {
Left l_ahIy -> Left l_ahIy;
Right r_ahIA ->
case runCekDeBruijn
(evaluateCekLikeInProd17 `cast` <Co:63>)
(r_ahIA `cast` <Co:1>)
(evaluateCekLikeInProd1 `cast` <Co:16>)
(noEmitter1 `cast` <Co:19>)
term_acKg
of
{ (getRes_adJt, ds_dfRE, ds1_dfRF) ->
getRes_adJt
}
} I can't pinpoint what makes this evaluator slower than the one we have on
Overall, we should just finish off this PR, i.e. document it and test that the evaluator used for benchmarking doesn't perform discharging (like we test for the production evaluator in |
@@ -16,11 +17,10 @@ import UntypedPlutusCore as UPLC | |||
`cabal bench -- plutus-benchmark:validation --benchmark-options crowdfunding`. | |||
-} | |||
main :: IO () | |||
main = benchWith mkCekBM | |||
main = evaluate getEvalCtx *> benchWith mkCekBM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this definitely ensure that this is shared across the runs? It seems like it would be simpler to use https://hackage.haskell.org/package/criterion-1.6.0.0/docs/Criterion-Types.html#v:env
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, or perhaps it wasn't clear: the creation of the evaluation context has non-trivial cost, but is shared by the ledger between script evaluations. So it shouldn't be part of the benchmarked code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this definitely ensure that this is shared across the runs?
getEvalCtx
is a CAF and doesn't get inlined (it has a NOINLINE
pragma and I've verified that it's not getting inlined by looking at the Core anyway). I can't imagine it not being reused, why would GHC ever recompute a non-functional monomorphic top-level value?
Maybe it should be evaluate (force getEvalCtx)
, though.
It seems like it would be simpler to use https://hackage.haskell.org/package/criterion-1.6.0.0/docs/Criterion-Types.html#v:env
Did you read the description? The function seems completely irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, yes, of course it should be evaluate (force getEvalCtx)
, there's that lazy Either
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you read the description? The function seems completely irrelevant.
Yes, and I've used it before. I thought the point of it was to clearly compute a value (it lets you use IO so you can do evaluate
) before the benchmarking begins, and feed it into the individual benchmarks as a function argument. We could even share it across all the benchmarking runs, which would save us some time.
getEvalCtx is a CAF and doesn't get inlined (it has a NOINLINE pragma and I've verified that it's not getting inlined by looking at the Core anyway). I can't imagine it not being reused, why would GHC ever recompute a non-functional monomorphic top-level value?
This seems like far more complex reasoning than using a function argument and env
? That guarantees it'll only be computed once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the point of it was to clearly compute a value (it lets you use IO so you can do evaluate) before the benchmarking begins, and feed it into the individual benchmarks as a function argument.
But it's not what this function is for!
Motivation. In earlier versions of criterion, all benchmark inputs were always created when a program started running. By deferring the creation of an environment when its associated benchmarks need the its, we avoid two problems that this strategy caused:
- Memory pressure distorted the results of unrelated benchmarks. If one benchmark needed e.g. a gigabyte-sized input, it would force the garbage collector to do extra work when running some other benchmark that had no use for that input. Since the data created by an environment is only available when it is in scope, it should be garbage collected before other benchmarks are run.
- The time cost of generating all needed inputs could be significant in cases where no inputs (or just a few) were really needed. This occurred often, for instance when just one out of a large suite of benchmarks was run, or when a user would list the collection of benchmarks without running any.
OK, forget about CAF etc. We only care about getEvalCtx
being forced before benchmarking starts. evaluate
does exactly that. Once it's forced, it can't be unforced and whatever they do with env
isn't more reliable than that.
And getEvalCtx
isn't an input to the benchmarks to even consider env
. It's a value that the function being benchmarked uses under the hood, you can't feed the value to the function, it's already inside!
If you want to change evaluateCekLikeInProd
to accept an EvaluationContext
as an input, I don't mind that, in that case it would make more sense to call env
, although as I said whatever lazy magic it does, we don't need or care about it, it's completely irrelevant for our use case, so even in that case it would make to force getEvalCtx
manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to change evaluateCekLikeInProd to accept an EvaluationContext as an input
Yes sorry, that's what I was suggesting. evaluateCekLikeInProd
just calls getEvalCtx
and then calls evaluateTerm
. Note that that's actually less like the production version, which really does take it as a parameter and it's cached externally. This way we're relying on some complicated logic that we can force it here and then apparently recompute it inside a function and have it not repeat work. It seems much clearer to just... pass it as an argument!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK then, sounds reasonable.
Feel like doing it yourself? I'm trying to give away as much non-costing-related work as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
:: Either | ||
(UPLC.CekEvaluationException UPLC.NamedDeBruijn UPLC.DefaultUni UPLC.DefaultFun) | ||
EvaluationContext | ||
getEvalCtx = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be legit to just do mkDynEvaluationContext <version> PLC.defaultCostModelParams
. Since we're trying to not include this in the benchmark run.
This is also fine, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're trying to not include this in the benchmark run.
What "this"?
In any case, I wanted to stay as close to the production evaluator as possible and not do any "reasoning". getEvalCtx
is a CAF that is forced before benchmarking starts and is used across all benchmarks, hence I don't see how it would distort the results (assuming I've fixed the forcing to go to NF rather than WHNF, but even the latter can't distort the results in a meaningful manner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What "this"?
This function goes through the dance of computing the integers-only version of the cost model, and then feeding it back into the function that converts it into the original form. That's good, insofar as it mirrors the production version, except that this happens in the shared "compute the evaluation context" block, and so shouldn't be included in the benchmarks anyway. So it's fine to use the faster version that just creates the evaluation context without jumping through the hoops.
EvaluationContext | ||
getEvalCtx = do | ||
costParams <- | ||
maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bezirg do we not have a function somewhere for going from CostModelParams
to the list of integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should even just so we can test that they round-trip.
Co-authored-by: Michael Peyton Jones <michael.peyton-jones@iohk.io>
/benchmark plutus-benchmark:validation |
Click here to check the status of your benchmark. |
Comparing benchmark results of ' plutus-benchmark:validation' on '4497b2f7d' (base) and '098aed22f' (PR) Results table
|
Hm, it's +6.39% now. Let's try again. |
/benchmark plutus-benchmark:validation |
Click here to check the status of your benchmark. |
Comparing benchmark results of ' plutus-benchmark:validation' on '4497b2f7d' (base) and '098aed22f' (PR) Results table
|
+5.07%. That's a pretty big gap in benchmarking results TBH. |
Comments addressed, @michaelpj merge if you're OK with what's in here. |
I'm happy modulo the thing about how we ensure that the evaluation context is shared. I just think we could establish it more simply. |
The only difference I can see is that:
I don't know if this is the culprit of the 5%. This shouldn't attribute to such a big difference. |
It's basically zero-cost: newtype EvaluationContext = EvaluationContext
{ machineParameters :: DefaultMachineParameters
}
toMachineParameters :: ProtocolVersion -> EvaluationContext -> DefaultMachineParameters
toMachineParameters _ = machineParameters
Now we have a different CAF: |
…sectMBO#5200) * Benchmark the production code rather than some arbitrary thing * Apply suggestions from code review Co-authored-by: Michael Peyton Jones <michael.peyton-jones@iohk.io> * Address comments --------- Co-authored-by: Michael Peyton Jones <michael.peyton-jones@iohk.io>
Hopefully fixes the issue described here