-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOPSWORKS-1982] Deequ statistics for Feature Groups/Training Datasets #96
Conversation
if (statisticsEnabled) { | ||
statisticsEngine.computeStatistics(this, featureData); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this call the computeStatistics()
method? Otherwise you might end up computing feature for the online feature store. which is not bad per se in this case, as you are not query NDB, but might confuse users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the confusion part, I just wanted to reuse the dataframe as we already have it, instead of rereading it. I am not sure spark is smart enough to recognize that it's already there.
On the other hand this way it would always allow the user to have the statistics from the very first creation of the featuregroup even if it is purely online.
*/ | ||
@JsonIgnore | ||
public Statistics getStatistics() throws FeatureStoreException, IOException { | ||
return statisticsEngine.getLast(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validates my point in the python api. Here we return an object containing commit_time, content. Which I think is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the object in python contains content and commit time as only accessible members
java/src/main/java/com/logicalclocks/hsfs/engine/FeatureGroupEngine.java
Outdated
Show resolved
Hide resolved
java/src/main/java/com/logicalclocks/hsfs/engine/FeatureGroupEngine.java
Outdated
Show resolved
Hide resolved
java/src/main/java/com/logicalclocks/hsfs/metadata/FeatureGroupApi.java
Outdated
Show resolved
Hide resolved
java/src/main/java/com/logicalclocks/hsfs/metadata/StatisticsApi.java
Outdated
Show resolved
Hide resolved
java/src/main/java/com/logicalclocks/hsfs/metadata/StatisticsApi.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just change the deequ group id
Changed the groupId, we should merge this Java PR first, and then I need to rebase #84, so it also has the right groupId. |
Should be rebased either this or the python PR should be merged first and then the other one needs to be rebased.