Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

change to exhausive search for training data #184

Merged
merged 1 commit into from
Jul 10, 2020
Merged

change to exhausive search for training data #184

merged 1 commit into from
Jul 10, 2020

Conversation

wnbts
Copy link
Contributor

@wnbts wnbts commented Jul 2, 2020

Issue #, if available: #180

Description of changes: This pr uses a robust exhaustive search for collecting training data samples at model creation time so the process is the best-effort against missing data in indices. For more discussion, use the issue above.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@wnbts wnbts marked this pull request as ready for review July 2, 2020 17:32
@kaituo
Copy link
Member

kaituo commented Jul 6, 2020

Please add labels to your PR so that we can generate release notes automatically. See c062263

@kaituo kaituo added the enhancement New feature or request label Jul 6, 2020
private Optional<double[][]> batchShingle(List<double[]> points, int shingleSize) {
return Optional
.ofNullable(points)
.filter(p -> p.size() >= shingleSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some interpolation if you miss less than 25% of points?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since the results from search are known to be missing. that truth should not be manipulated. interpolation is for when the data is unknown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interpolation is for robust results. I don't agree with truth should not be manipulated assertion. Say samples are always 6 points, you are gonna drop them all and start with 0 shingles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the behavior in the example is expected. when data is sparse and shingle is required, the conflict will persist throughout the detector run. even training data were to be populated by introducing non-existent data, real-time prediciton will still fail for lack of data. there will need to be a different solution for solving this problem. let's start with the right solution and see the feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this conversation before: once in your initial cold start CR where you dropped all samples if you encounter a hole, once for missing data inside the shingle. We agree to use limited interpolation and not dropping samples. What's the difference this time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, due to the few number of samples (a few dozens), interpolation has to be used or there is no training data at all.

With exhaustive search, there are more samples for training and knowingly inserting data that is known not to exist becomes unnecessary given its downsides.

  1. interpolated data affects model accuracy.
  2. it's highly controversial and idiosyncratic when to/not to fake data, how, and even why and therefore there is unlikely a solution satisfactory to all. Think person A thinks 20% is ok for interpolation. person B thinks 60% is better. person C thinks a different interpolation is the best. and yet person D says even 10% is bad. Using data as-it-is is the most agreeable approach.

This pr is adding improvements from standard data processing over the existing one, serving as the least controversial foundation yet not a one-way door. In case someone has a strong personal preference, they can always raise a new pr based on this one for everyone to see and take/reject.

Copy link
Member

@kaituo kaituo Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About your downside points:

  1. no doubt
  2. Interpolation is basic in machine learning. I am not sure about the high controversial part. We don't want to put users on such delicate choice of the percentage of fake data. We want to guide users reasonable choice.

I feel the PR can be better by considering data imputation. I don't want a half-done solution. We have already gone this far by running much more queries than we planned. It's better to make it worthwhile and not trivially throw away points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the controversies are as follows.

  1. when to interpolate/not to interpolate. with 1% data, can anyone "interpolate" the rest 99%? 10% to 90%? what's the boundary? who makes the rules on what authority? even with interpolation, how many more data points are expected to be produced? no one can guarantee those arbitrary rules will produce more data points. if there is still no enough samples, then what? keep beating this dead horse?
  2. how to interpolate/impute. there are many options. for example, why not just filter out gaps and shingle the remaining? again, any proposal will be unfair to some and attract challenges and who is the judge.

it should be clear now that on this topic, the matters are not what everyone, or even we, can agree on, and they may continue forever due to their flexible/ill-defined targets and lack of a legal means/authority to resolve. What in one's opinion is completely-done can be considered further wrong in another's.

Comment on lines 252 to +262
if (latest.isPresent()) {
searchFeatureDao
.getFeaturesForSampledPeriods(
detector,
maxTrainSamples,
maxSampleStride,
latest.get(),
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
);
List<Entry<Long, Long>> sampleRanges = getColdStartSampleRanges(detector, latest.get());
try {
searchFeatureDao
.getFeatureSamplesForPeriods(
detector,
sampleRanges,
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is checking for latest.isPresent() (or even passing in latest as a param) still needed, since latest is not used by searchFeatureDao.getFeatureSamplesForPeriods or anywhere else in this function anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's now used in line 253.

Copy link
Contributor Author

@wnbts wnbts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add labels to your PR so that we can generate release notes automatically. See c062263

thanks, i see you added enhancement label to this pr.

Comment on lines 252 to +262
if (latest.isPresent()) {
searchFeatureDao
.getFeaturesForSampledPeriods(
detector,
maxTrainSamples,
maxSampleStride,
latest.get(),
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
);
List<Entry<Long, Long>> sampleRanges = getColdStartSampleRanges(detector, latest.get());
try {
searchFeatureDao
.getFeatureSamplesForPeriods(
detector,
sampleRanges,
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure)
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's now used in line 253.

private Optional<double[][]> batchShingle(List<double[]> points, int shingleSize) {
return Optional
.ofNullable(points)
.filter(p -> p.size() >= shingleSize)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since the results from search are known to be missing. that truth should not be manipulated. interpolation is for when the data is unknown.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants