-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyspark job for feature batch retrieval #1021
Conversation
f247df1
to
579b05d
Compare
join_keys: List[str], | ||
feature_table: DataFrame, | ||
features: List[str], | ||
feature_prefix: str = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind explaining this please?
conf (Dict): | ||
Configuration for the retrieval job, in json format. Sample configuration as follows: | ||
|
||
sample_conf = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved into an example https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html
5488926
to
6e28c56
Compare
/test test-end-to-end-batch-dataflow |
) -> DataFrame: | ||
"""Perform an as of join between entity and feature table, given a maximum age tolerance. | ||
Join conditions: | ||
1. Join keys values match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we talk about entities here? Or are you keeping it generic?
2. Feature event timestamp is the closest match possible to the entity event timestamp, | ||
but must not be more recent than the entity event timestamp, and the difference must | ||
not be greater than max_age, unless max_age is not specified. | ||
3. If more than one feature table rows satisfy condition 1 and 2, feature row with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is no match?
features (List[str]): | ||
The feature columns which should be present in the result dataframe. | ||
feature_prefix (str): | ||
Feature column prefix for the result dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain WHY this would need to be used?
pass | ||
|
||
|
||
def verify_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here
@@ -0,0 +1,2 @@ | |||
id,event_timestamp | |||
1001,2020-09-02T00:00:00.000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add a few more rows, like 5-10. One row is asking for trouble. Same for rest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better to generate thousands to test how multiple partitions can affect result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using code to generate data isn't a bad idea, as long as we avoid too much rand()
without a seed.
The feature columns which should be present in the result dataframe. | ||
feature_prefix (str): | ||
Feature column prefix for the result dataframe. | ||
max_age (str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's str
? what's the format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the Pyspark interval
expression. For example, 13 hour, 6 day, 60 second, 3 month. I can also change this to integer type instead of string, in which case the max age would be in seconds, similar to the current behaviour. Just thought that using a string might be more convenient for users, since they don't need to convert days / months to seconds, though that would be a breaking change. So should i use seconds and integer instead?
@khorshuheng I think we need to add minimum lower boundary |
sdk/python/tests/test_as_of_join.py
Outdated
assert_dataframe_equal(joined_df, expected_joined_df) | ||
|
||
|
||
def test_entity_filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is tested here. Description would be helpful
d1b9392
to
71e7578
Compare
While i agree with that, i am not sure how should the minimum lower boundary be specified. Should it be hard coded within the SDK? If not, where should we retrieve this lower boundary? |
71e7578
to
1c60fff
Compare
Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>
Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>
Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>
Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>
Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>
1c19162
to
43a81aa
Compare
…rge dataframe Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>
43a81aa
to
1a36737
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: khorshuheng, pyalex The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
What this PR does / why we need it:
This is the prerequisite to use PySpark for batch retrieval job instead of BQ Client / Feast Batch Serving.
Which issue(s) this PR fixes:
This PR contains a standalone pyspark job (i.e. has not Feast or other external dependencies) to perform batch feature retrieval. Feast users are not expected to use this pyspark script directly. Rather, Feast SDK will be responsible for submitting the pyspark job to a spark cluster, which will be implemented in a separate PR.
Does this PR introduce a user-facing change?: