-
Notifications
You must be signed in to change notification settings - Fork 36
change to exhausive search for training data #184
change to exhausive search for training data #184
Conversation
Please add labels to your PR so that we can generate release notes automatically. See c062263 |
private Optional<double[][]> batchShingle(List<double[]> points, int shingleSize) { | ||
return Optional | ||
.ofNullable(points) | ||
.filter(p -> p.size() >= shingleSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some interpolation if you miss less than 25% of points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, since the results from search are known to be missing. that truth should not be manipulated. interpolation is for when the data is unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interpolation is for robust results. I don't agree with truth should not be manipulated assertion. Say samples are always 6 points, you are gonna drop them all and start with 0 shingles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the behavior in the example is expected. when data is sparse and shingle is required, the conflict will persist throughout the detector run. even training data were to be populated by introducing non-existent data, real-time prediciton will still fail for lack of data. there will need to be a different solution for solving this problem. let's start with the right solution and see the feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had this conversation before: once in your initial cold start CR where you dropped all samples if you encounter a hole, once for missing data inside the shingle. We agree to use limited interpolation and not dropping samples. What's the difference this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change, due to the few number of samples (a few dozens), interpolation has to be used or there is no training data at all.
With exhaustive search, there are more samples for training and knowingly inserting data that is known not to exist becomes unnecessary given its downsides.
- interpolated data affects model accuracy.
- it's highly controversial and idiosyncratic when to/not to fake data, how, and even why and therefore there is unlikely a solution satisfactory to all. Think person A thinks 20% is ok for interpolation. person B thinks 60% is better. person C thinks a different interpolation is the best. and yet person D says even 10% is bad. Using data as-it-is is the most agreeable approach.
This pr is adding improvements from standard data processing over the existing one, serving as the least controversial foundation yet not a one-way door. In case someone has a strong personal preference, they can always raise a new pr based on this one for everyone to see and take/reject.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About your downside points:
- no doubt
- Interpolation is basic in machine learning. I am not sure about the high controversial part. We don't want to put users on such delicate choice of the percentage of fake data. We want to guide users reasonable choice.
I feel the PR can be better by considering data imputation. I don't want a half-done solution. We have already gone this far by running much more queries than we planned. It's better to make it worthwhile and not trivially throw away points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the controversies are as follows.
- when to interpolate/not to interpolate. with 1% data, can anyone "interpolate" the rest 99%? 10% to 90%? what's the boundary? who makes the rules on what authority? even with interpolation, how many more data points are expected to be produced? no one can guarantee those arbitrary rules will produce more data points. if there is still no enough samples, then what? keep beating this dead horse?
- how to interpolate/impute. there are many options. for example, why not just filter out gaps and shingle the remaining? again, any proposal will be unfair to some and attract challenges and who is the judge.
it should be clear now that on this topic, the matters are not what everyone, or even we, can agree on, and they may continue forever due to their flexible/ill-defined targets and lack of a legal means/authority to resolve. What in one's opinion is completely-done can be considered further wrong in another's.
if (latest.isPresent()) { | ||
searchFeatureDao | ||
.getFeaturesForSampledPeriods( | ||
detector, | ||
maxTrainSamples, | ||
maxSampleStride, | ||
latest.get(), | ||
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure) | ||
); | ||
List<Entry<Long, Long>> sampleRanges = getColdStartSampleRanges(detector, latest.get()); | ||
try { | ||
searchFeatureDao | ||
.getFeatureSamplesForPeriods( | ||
detector, | ||
sampleRanges, | ||
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is checking for latest.isPresent()
(or even passing in latest
as a param) still needed, since latest
is not used by searchFeatureDao.getFeatureSamplesForPeriods
or anywhere else in this function anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's now used in line 253.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add labels to your PR so that we can generate release notes automatically. See c062263
thanks, i see you added enhancement label to this pr.
if (latest.isPresent()) { | ||
searchFeatureDao | ||
.getFeaturesForSampledPeriods( | ||
detector, | ||
maxTrainSamples, | ||
maxSampleStride, | ||
latest.get(), | ||
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure) | ||
); | ||
List<Entry<Long, Long>> sampleRanges = getColdStartSampleRanges(detector, latest.get()); | ||
try { | ||
searchFeatureDao | ||
.getFeatureSamplesForPeriods( | ||
detector, | ||
sampleRanges, | ||
ActionListener.wrap(samples -> processColdStartSamples(samples, listener), listener::onFailure) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's now used in line 253.
private Optional<double[][]> batchShingle(List<double[]> points, int shingleSize) { | ||
return Optional | ||
.ofNullable(points) | ||
.filter(p -> p.size() >= shingleSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, since the results from search are known to be missing. that truth should not be manipulated. interpolation is for when the data is unknown.
Issue #, if available: #180
Description of changes: This pr uses a robust exhaustive search for collecting training data samples at model creation time so the process is the best-effort against missing data in indices. For more discussion, use the issue above.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.