change to exhausive search for training data #184

wnbts · 2020-07-02T16:59:59Z

Issue #, if available: #180

Description of changes: This pr uses a robust exhaustive search for collecting training data samples at model creation time so the process is the best-effort against missing data in indices. For more discussion, use the issue above.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

kaituo · 2020-07-06T17:30:27Z

Please add labels to your PR so that we can generate release notes automatically. See c062263

kaituo · 2020-07-06T17:40:16Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/feature/FeatureManager.java

+    private Optional<double[][]> batchShingle(List<double[]> points, int shingleSize) {
+        return Optional
+            .ofNullable(points)
+            .filter(p -> p.size() >= shingleSize)


add some interpolation if you miss less than 25% of points?

no, since the results from search are known to be missing. that truth should not be manipulated. interpolation is for when the data is unknown.

Interpolation is for robust results. I don't agree with truth should not be manipulated assertion. Say samples are always 6 points, you are gonna drop them all and start with 0 shingles.

yes, the behavior in the example is expected. when data is sparse and shingle is required, the conflict will persist throughout the detector run. even training data were to be populated by introducing non-existent data, real-time prediciton will still fail for lack of data. there will need to be a different solution for solving this problem. let's start with the right solution and see the feedback.

We had this conversation before: once in your initial cold start CR where you dropped all samples if you encounter a hole, once for missing data inside the shingle. We agree to use limited interpolation and not dropping samples. What's the difference this time?

Before this change, due to the few number of samples (a few dozens), interpolation has to be used or there is no training data at all.

With exhaustive search, there are more samples for training and knowingly inserting data that is known not to exist becomes unnecessary given its downsides.

interpolated data affects model accuracy.

it's highly controversial and idiosyncratic when to/not to fake data, how, and even why and therefore there is unlikely a solution satisfactory to all. Think person A thinks 20% is ok for interpolation. person B thinks 60% is better. person C thinks a different interpolation is the best. and yet person D says even 10% is bad. Using data as-it-is is the most agreeable approach.

This pr is adding improvements from standard data processing over the existing one, serving as the least controversial foundation yet not a one-way door. In case someone has a strong personal preference, they can always raise a new pr based on this one for everyone to see and take/reject.

About your downside points:

no doubt

Interpolation is basic in machine learning. I am not sure about the high controversial part. We don't want to put users on such delicate choice of the percentage of fake data. We want to guide users reasonable choice.

I feel the PR can be better by considering data imputation. I don't want a half-done solution. We have already gone this far by running much more queries than we planned. It's better to make it worthwhile and not trivially throw away points.

the controversies are as follows.

when to interpolate/not to interpolate. with 1% data, can anyone "interpolate" the rest 99%? 10% to 90%? what's the boundary? who makes the rules on what authority? even with interpolation, how many more data points are expected to be produced? no one can guarantee those arbitrary rules will produce more data points. if there is still no enough samples, then what? keep beating this dead horse?

how to interpolate/impute. there are many options. for example, why not just filter out gaps and shingle the remaining? again, any proposal will be unfair to some and attract challenges and who is the judge.

it should be clear now that on this topic, the matters are not what everyone, or even we, can agree on, and they may continue forever due to their flexible/ill-defined targets and lack of a legal means/authority to resolve. What in one's opinion is completely-done can be considered further wrong in another's.

LiuJoyceC · 2020-07-06T20:59:38Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/feature/FeatureManager.java

        if (latest.isPresent()) {
-            searchFeatureDao
-                .getFeaturesForSampledPeriods(
-                    detector,
-                    maxTrainSamples,
-                    maxSampleStride,
-                    latest.get(),
-                    ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
-                );
+            List<Entry<Long, Long>> sampleRanges = getColdStartSampleRanges(detector, latest.get());
+            try {
+                searchFeatureDao
+                    .getFeatureSamplesForPeriods(
+                        detector,
+                        sampleRanges,
+                        ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
+                    );


Is checking for latest.isPresent() (or even passing in latest as a param) still needed, since latest is not used by searchFeatureDao.getFeatureSamplesForPeriods or anywhere else in this function anymore?

it's now used in line 253.

wnbts

Please add labels to your PR so that we can generate release notes automatically. See c062263

thanks, i see you added enhancement label to this pr.

wnbts · 2020-07-06T21:47:14Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/feature/FeatureManager.java

        if (latest.isPresent()) {
-            searchFeatureDao
-                .getFeaturesForSampledPeriods(
-                    detector,
-                    maxTrainSamples,
-                    maxSampleStride,
-                    latest.get(),
-                    ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
-                );
+            List<Entry<Long, Long>> sampleRanges = getColdStartSampleRanges(detector, latest.get());
+            try {
+                searchFeatureDao
+                    .getFeatureSamplesForPeriods(
+                        detector,
+                        sampleRanges,
+                        ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
+                    );


it's now used in line 253.

wnbts · 2020-07-06T21:50:19Z

src/main/java/com/amazon/opendistroforelasticsearch/ad/feature/FeatureManager.java

+    private Optional<double[][]> batchShingle(List<double[]> points, int shingleSize) {
+        return Optional
+            .ofNullable(points)
+            .filter(p -> p.size() >= shingleSize)


no, since the results from search are known to be missing. that truth should not be manipulated. interpolation is for when the data is unknown.

wnbts marked this pull request as ready for review July 2, 2020 17:32

kaituo added the enhancement New feature or request label Jul 6, 2020

kaituo reviewed Jul 6, 2020

View reviewed changes

LiuJoyceC reviewed Jul 6, 2020

View reviewed changes

wnbts commented Jul 6, 2020

View reviewed changes

change to exhausive search for training data

13ca74e

kaituo approved these changes Jul 9, 2020

View reviewed changes

LiuJoyceC approved these changes Jul 10, 2020

View reviewed changes

LiuJoyceC merged commit cd3bdae into opendistro-for-elasticsearch:master Jul 10, 2020

yizheliu-amazon mentioned this pull request Aug 25, 2020

Add action item and message for init failue case due to invalid search query. opendistro-for-elasticsearch/anomaly-detection-kibana-plugin#285

Merged

yizheliu-amazon pushed a commit that referenced this pull request Aug 28, 2020

change to exhausive search for training data (#184)

311629b

wnbts deleted the traindata branch September 11, 2020 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change to exhausive search for training data #184

change to exhausive search for training data #184

wnbts commented Jul 2, 2020 •

edited

Loading

kaituo commented Jul 6, 2020

kaituo Jul 6, 2020

wnbts Jul 6, 2020

kaituo Jul 6, 2020

wnbts Jul 7, 2020

kaituo Jul 7, 2020

wnbts Jul 7, 2020

kaituo Jul 7, 2020 •

edited

Loading

wnbts Jul 8, 2020

LiuJoyceC Jul 6, 2020

wnbts Jul 6, 2020

wnbts left a comment

wnbts Jul 6, 2020

wnbts Jul 6, 2020

change to exhausive search for training data #184

change to exhausive search for training data #184

Conversation

wnbts commented Jul 2, 2020 • edited Loading

kaituo commented Jul 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaituo Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wnbts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wnbts commented Jul 2, 2020 •

edited

Loading

kaituo Jul 7, 2020 •

edited

Loading