Pyspark job for feature batch retrieval #1021

khorshuheng · 2020-09-30T03:09:33Z

What this PR does / why we need it:
This is the prerequisite to use PySpark for batch retrieval job instead of BQ Client / Feast Batch Serving.

Which issue(s) this PR fixes:
This PR contains a standalone pyspark job (i.e. has not Feast or other external dependencies) to perform batch feature retrieval. Feast users are not expected to use this pyspark script directly. Rather, Feast SDK will be responsible for submitting the pyspark job to a spark cluster, which will be implemented in a separate PR.

Does this PR introduce a user-facing change?:

NONE

woop · 2020-09-30T03:47:23Z

sdk/python/feast/pyspark/batch_retrieval_job.py

+    join_keys: List[str],
+    feature_table: DataFrame,
+    features: List[str],
+    feature_prefix: str = "",


Would you mind explaining this please?

sdk/python/feast/pyspark/batch_retrieval_job.py

woop · 2020-09-30T04:02:44Z

sdk/python/feast/pyspark/batch_retrieval_job.py

+        conf (Dict):
+            Configuration for the retrieval job, in json format. Sample configuration as follows:
+
+            sample_conf = {


This can be moved into an example https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html

khorshuheng · 2020-10-02T08:12:04Z

/test test-end-to-end-batch-dataflow

woop · 2020-10-05T03:42:15Z

sdk/python/feast/pyspark/batch_retrieval_job.py

+) -> DataFrame:
+    """Perform an as of join between entity and feature table, given a maximum age tolerance.
+    Join conditions:
+    1. Join keys values match.


Should we talk about entities here? Or are you keeping it generic?

woop · 2020-10-05T03:43:01Z

sdk/python/feast/pyspark/batch_retrieval_job.py

+    2. Feature event timestamp is the closest match possible to the entity event timestamp,
+       but must not be more recent than the entity event timestamp, and the difference must
+       not be greater than max_age, unless max_age is not specified.
+    3. If more than one feature table rows satisfy condition 1 and 2, feature row with the


What happens if there is no match?

woop · 2020-10-05T03:44:29Z

sdk/python/feast/pyspark/batch_retrieval_job.py

+        features (List[str]):
+            The feature columns which should be present in the result dataframe.
+        feature_prefix (str):
+            Feature column prefix for the result dataframe.


Can you explain WHY this would need to be used?

woop · 2020-10-05T03:45:35Z

sdk/python/feast/pyspark/batch_retrieval_job.py

+    pass
+
+
+def verify_schema(


Please add a comment here

sdk/python/feast/pyspark/batch_retrieval_job.py

woop · 2020-10-05T03:50:30Z

sdk/python/tests/data/column_mapping_test_entity.csv

@@ -0,0 +1,2 @@
+id,event_timestamp
+1001,2020-09-02T00:00:00.000


Can we please add a few more rows, like 5-10. One row is asking for trouble. Same for rest

Even better to generate thousands to test how multiple partitions can affect result

Yes, using code to generate data isn't a bad idea, as long as we avoid too much rand() without a seed.

pyalex · 2020-10-05T12:52:17Z

sdk/python/feast/pyspark/batch_retrieval_job.py

+            The feature columns which should be present in the result dataframe.
+        feature_prefix (str):
+            Feature column prefix for the result dataframe.
+        max_age (str):


why it's str? what's the format?

This is for the Pyspark interval expression. For example, 13 hour, 6 day, 60 second, 3 month. I can also change this to integer type instead of string, in which case the max age would be in seconds, similar to the current behaviour. Just thought that using a string might be more convenient for users, since they don't need to convert days / months to seconds, though that would be a breaking change. So should i use seconds and integer instead?

pyalex · 2020-10-05T12:56:33Z

@khorshuheng I think we need to add minimum lower boundary
min(event_timestamp) of all entities - max_age
and apply this as default filter to feature_table. Otherwise there's no filters pushed to BQ, and we load too much unnecessary data

pyalex · 2020-10-05T13:17:10Z

sdk/python/tests/test_as_of_join.py

+    assert_dataframe_equal(joined_df, expected_joined_df)
+
+
+def test_entity_filter(


Not sure what is tested here. Description would be helpful

khorshuheng · 2020-10-05T14:24:36Z

@khorshuheng I think we need to add minimum lower boundary
min(event_timestamp) of all entities - max_age
and apply this as default filter to feature_table. Otherwise there's no filters pushed to BQ, and we load too much unnecessary data

While i agree with that, i am not sure how should the minimum lower boundary be specified. Should it be hard coded within the SDK? If not, where should we retrieve this lower boundary?

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

…rge dataframe Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

feast-ci-bot · 2020-10-07T06:44:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khorshuheng, pyalex

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [khorshuheng,pyalex]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pyalex · 2020-10-07T10:50:27Z

/lgtm

khorshuheng requested review from davidheryanto, pyalex, woop and zhilingc as code owners September 30, 2020 03:09

feast-ci-bot added approved needs-kind size/XL labels Sep 30, 2020

khorshuheng force-pushed the spark-retrieval branch 6 times, most recently from f247df1 to 579b05d Compare September 30, 2020 03:39

woop reviewed Sep 30, 2020

View reviewed changes

sdk/python/feast/pyspark/batch_retrieval_job.py Outdated Show resolved Hide resolved

woop reviewed Sep 30, 2020

View reviewed changes

khorshuheng mentioned this pull request Sep 30, 2020

Dataproc and Standalone Cluster Spark Job launcher #1022

Merged

khorshuheng force-pushed the spark-retrieval branch 4 times, most recently from 5488926 to 6e28c56 Compare October 1, 2020 03:39

feast-ci-bot added size/XXL and removed size/XL labels Oct 2, 2020

khorshuheng added the kind/feature New feature or request label Oct 2, 2020

feast-ci-bot removed the needs-kind label Oct 2, 2020

woop reviewed Oct 5, 2020

View reviewed changes

sdk/python/feast/pyspark/batch_retrieval_job.py Outdated

pass

def verify_schema(

Copy link

Member

woop Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here

woop reviewed Oct 5, 2020

View reviewed changes

sdk/python/feast/pyspark/batch_retrieval_job.py Outdated Show resolved Hide resolved

woop reviewed Oct 5, 2020

View reviewed changes

pyalex reviewed Oct 5, 2020

View reviewed changes

khorshuheng force-pushed the spark-retrieval branch 3 times, most recently from d1b9392 to 71e7578 Compare October 5, 2020 14:17

khorshuheng force-pushed the spark-retrieval branch from 71e7578 to 1c60fff Compare October 5, 2020 14:26

khorshuheng added 5 commits October 6, 2020 16:26

Pyspark job for feature batch retrieval

afbf7cf

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Add pyspark to ci requirements

675a952

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Additional documentation and col mapping

dbe6192

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Add Schema validation

30be3a5

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Improve test case and documentation

7522b88

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng force-pushed the spark-retrieval branch 2 times, most recently from 1c19162 to 43a81aa Compare October 6, 2020 08:57

Change max age to integer, filter source feature tables, tests for la…

1a36737

…rge dataframe Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng force-pushed the spark-retrieval branch from 43a81aa to 1a36737 Compare October 6, 2020 09:00

pyalex approved these changes Oct 7, 2020

View reviewed changes

feast-ci-bot assigned pyalex Oct 7, 2020

feast-ci-bot added the lgtm label Oct 7, 2020

feast-ci-bot merged commit d1807c9 into feast-dev:master Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyspark job for feature batch retrieval #1021

Pyspark job for feature batch retrieval #1021

khorshuheng commented Sep 30, 2020 •

edited

Loading

woop Sep 30, 2020

woop Sep 30, 2020

khorshuheng commented Oct 2, 2020

woop Oct 5, 2020 •

edited

Loading

woop Oct 5, 2020

woop Oct 5, 2020

woop Oct 5, 2020

woop Oct 5, 2020 •

edited

Loading

pyalex Oct 5, 2020

woop Oct 5, 2020 •

edited

Loading

pyalex Oct 5, 2020

khorshuheng Oct 5, 2020

pyalex commented Oct 5, 2020

pyalex Oct 5, 2020

khorshuheng commented Oct 5, 2020

feast-ci-bot commented Oct 7, 2020

pyalex commented Oct 7, 2020

		@@ -0,0 +1,2 @@
		id,event_timestamp
		1001,2020-09-02T00:00:00.000

		assert_dataframe_equal(joined_df, expected_joined_df)


		def test_entity_filter(

Pyspark job for feature batch retrieval #1021

Pyspark job for feature batch retrieval #1021

Conversation

khorshuheng commented Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khorshuheng commented Oct 2, 2020

woop Oct 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woop Oct 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woop Oct 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyalex commented Oct 5, 2020

Choose a reason for hiding this comment

khorshuheng commented Oct 5, 2020

feast-ci-bot commented Oct 7, 2020

pyalex commented Oct 7, 2020

khorshuheng commented Sep 30, 2020 •

edited

Loading

woop Oct 5, 2020 •

edited

Loading

woop Oct 5, 2020 •

edited

Loading

woop Oct 5, 2020 •

edited

Loading