[ML] Encode distribution model weight style by offset in a fixed size weight array #54

tveasey · 2018-04-13T11:57:40Z

Profiling anomaly detection on a large population (cardinality 2m) showed up that accessing weights, which communicate things like sample importance, seasonal heteroskedasticity, and so on, consumes about 8.5% of the total end of bucket processing time. We have a small number of possible weights. Therefore, switching to encoding the weight style by offset in a fixed size weights array means we can avoids nearly all this overhead. I get a concomitant performance improvement on highly partitioned analysis, where end of bucket processing is the bottleneck. This is should have no impact on any results. A step towards #53.

…ed size weight array

droberts195

I left a couple of minor comments.

I would also be nice to add a comment to the PR saying which dataset and config you saw the speedup on and timings before and after the change.

droberts195 · 2018-04-16T08:29:42Z

lib/model/unittest/CStringStoreTest.cc

    void propagateLastThreadAssert() {
        if (m_LastException) {
-            throw *m_LastException;
+            throw * m_LastException;


Why did this change?

I think this must be something about the version of clang format I'm using. (Although I thought I was using the same version as Ed.)

I used clang-format 5.0.1 (for the record)

I've also got 5.0.1, having download the pre-built binaries from http://releases.llvm.org/download.html

On running dev-tools/clang-format.sh on this PR branch it reverted this change and also made one other change in CMultivariateNormalConjugate.h.

It seems that we're going to have to mandate an exact version of clang-format for each branch rather than just the major version before we can start failing builds due to formatting.

Ok I'm using 5.0.0, so I think this must be the reason. I'll correct the formatting here.

droberts195 · 2018-04-16T08:32:13Z

lib/maths/CLogNormalMeanPrecConjugate.cc

+        return {x1, x2};
    } catch (const std::exception& e) {
-        LOG_ERROR(<< "Failed to compute confidence interval: " << e.what());
+        LOG_ERROR("Failed to compute confidence interval: " << e.what());


I think we should stick to the pattern of beginning all log statements with << even though it's not necessary when the first item is a string literal. Otherwise the rules about when a leading << is required will be very complicated.

Thanks, this was a merge error.

droberts195 · 2018-04-16T08:44:25Z

lib/maths/CLogNormalMeanPrecConjugate.cc

+            double shift = (r - t) / 2.0;
+            logSamplesMoments.add(std::log(x) - shift, n / scale);
        }
-    } catch (const std::exception& e) {


There is some quite deeply nested and complex logic in the block above. Are you certain none of it uses Boost or Eigen functionality that might throw an exception? If it does then such an exception will now be fatal.

Yes this is now safe to remove. The nested code is basic statistics stuff, Gauss-Legendre integration and in the local class CLogSampleSquareDeviation none of which can throw.

I actually thought a bit about whether or not to remove this try catch. (For example, there are other try catch blocks I could down scope as a result of this change, which I haven't touched.) In the end I decided since I could remove it altogether that I would go ahead and make the change on understandability grounds.

tveasey · 2018-04-16T10:05:46Z

The original profile which showed this up was a large population analysis attached to issue #53. The probability calculation cache change significantly altered the breakdown of runtime for population analysis, so this is no longer as useful a test case.

I verified profiling a custom standalone executable I built that this change kills the contribution from weight look up to end-of-bucket processing. I was planning to extract the delta in runtime on the full QA regression suite when this is committed and update #53. The problem is that we need cases where the runtime bottleneck is end of bucket processing of the autodetect process, which isn't always the case.

The clearest results will be for high cardinality partition analyses running the autodetect process standalone. I'll add runtimes before and after this change for these cases to issue #53.

droberts195

LGTM

This should have been done in #54 but slipped through the net as we compile out trace logging in optimised builds.

droberts195 · 2018-04-18T08:59:33Z

When this is backported to 6.4 please also backport 1a750aa.

… weight array (#54) Part of #53.

This should have been done in #54 but slipped through the net as we compile out trace logging in optimised builds.

… weight array (#54) Part of #53.

Switch to encoding distribution model weight style by offset in a fix…

771def1

…ed size weight array

tveasey added >enhancement v7.0.0 :ml >refactoring v6.4.0 labels Apr 13, 2018

tveasey added 4 commits April 13, 2018 13:05

Fix up formatting

8af6d35

Revert accidental formatting change (due to clang version change)

b6a9b46

Revert accidental new lines

09950ef

Merge branch 'master' into enhancement/fixed-size-weight-vectors

3e1b7ee

droberts195 reviewed Apr 16, 2018

View reviewed changes

Review comments

5c551a4

droberts195 approved these changes Apr 16, 2018

View reviewed changes

tveasey changed the title ~~[ML] Encoding distribution model weight style by offset in a fixed size weight array~~ [ML] Encode distribution model weight style by offset in a fixed size weight array Apr 17, 2018

tveasey merged commit fc48c7c into elastic:master Apr 17, 2018

tveasey mentioned this pull request Apr 17, 2018

[ML] Analytics Runtime Improvements #53

Closed

5 tasks

droberts195 pushed a commit that referenced this pull request Apr 18, 2018

Fix variable name in trace logging

1a750aa

This should have been done in #54 but slipped through the net as we compile out trace logging in optimised builds.

droberts195 pushed a commit that referenced this pull request Apr 23, 2018

[ML] Encode distribution model weight style by offset in a fixed size…

e2cdd71

… weight array (#54) Part of #53.

droberts195 pushed a commit that referenced this pull request Apr 23, 2018

Fix variable name in trace logging

2192a1e

This should have been done in #54 but slipped through the net as we compile out trace logging in optimised builds.

tveasey mentioned this pull request Apr 24, 2018

[ML] Speed up trend model component prediction #73

Merged

tveasey added a commit that referenced this pull request Apr 26, 2018

[ML] Encode distribution model weight style by offset in a fixed size…

a59d3f2

… weight array (#54) Part of #53.

[ML] Encode distribution model weight style by offset in a fixed size weight array #54

[ML] Encode distribution model weight style by offset in a fixed size weight array #54

Uh oh!

Conversation

tveasey commented Apr 13, 2018 • edited by droberts195 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey commented Apr 16, 2018

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

droberts195 commented Apr 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tveasey commented Apr 13, 2018 •

edited by droberts195

Loading