-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fear(#23): Replace filterIncompleteRecords boolean with Imputation Enum for Enhanced Data Handling #61
Conversation
This commit introduces the 'handler' package under 'de.eudx.data' to handle the cleaning and processing of incomplete records within a dataset, in correspondence with issue Samyssmile#23. Changes Made: - Added 'handler' package under 'de.edux.data' - Introduced 'EIncompleteRecordsHandlerStratetgy' enum with the constans: DO_NOT_HANDLE, DROP_RECORDS, FILL_RECORDS_WITH_AVERAGE - Introduced 'IIncompleteRecordsHandler' interface - Implemented 'DropIncompleteRecordsHandler' class - Implemented 'AverageFillIncompleteRecordsHandler' class - Implemented 'DoNotHandleIncompleteRecords' class because of DataProcessorTest.testLoadDataWithoutNormalizationAndShuffling() - Started to write the tests for the above mentioned classes and their methods - Revired the existing codebase to be compatible with the new enum
|
||
return dataset.stream().filter(this::containsOnlyCompletedFeatures).toList(); | ||
if (cleanedDataset.size() < dataset.size() * 0.5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for 0.5?
return dataset.stream().filter(this::containsOnlyCompletedFeatures).toList(); | ||
if (cleanedDataset.size() < dataset.size() * 0.5) { | ||
throw new RuntimeException( | ||
"More than 50% of the records will be dropped with this IncompleteRecordsHandlerStrategy. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
@@ -26,41 +26,51 @@ void initializeList() { | |||
void dropRecordsWithIncompleteCategoricalFeature() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like drop test in AverageFill Test class. Do Drop test in Drop Test Class and AverageFill Tests in AverageFill Test Classes...
} | ||
|
||
@Test | ||
void testDropThreeIncompleteResults() { | ||
void testDropTwoIncompleteResult() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls rename the tests with this pattern "should*What Test should do" e.g. shouldDropTwoIncompleResults
@@ -0,0 +1,17 @@ | |||
package de.edux.data.handler; | |||
|
|||
public enum EIncompleteRecordsHandlerStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java World wie never prefix enums with 'E'. As in isssue#23 described you need name it "Imputation" here.
Imputation .DROP_RECORDS....
|
||
import java.util.List; | ||
|
||
public interface IIncompleteRecordsHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IImputationHandler
public List<T> loadDataSetFromCSV(File csvFile, char csvSeparator, boolean normalize, boolean shuffle, boolean filterIncompleteRecords) { | ||
List<String[]> x = csvDataReader.readFile(csvFile, csvSeparator); | ||
List<T> unmodifiableDataset = csvDataReader.readFile(csvFile, csvSeparator) | ||
public List<T> loadDataSetFromCSV(File csvFile, char csvSeparator, boolean normalize, boolean shuffle, EIncompleteRecordsHandlerStrategy incompleteRecordHandlerStrategy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need one more method here
public List<T> loadDataSetFromCSV(File csvFile, char csvSeparator, boolean normalize, boolean shuffle){ this.loadDataSetFromCSV(...., Imputation.Do Nothing)
This PR introduces the 'handler' package under 'de.eudx.data' to handle the cleaning and processing
of incomplete records within a dataset, in correspondence with issue #23.
Changes Made:
DO_NOT_HANDLE, DROP_RECORDS, FILL_RECORDS_WITH_AVERAGE
of DataProcessorTest.testLoadDataWithoutNormalizationAndShuffling()