Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Spark Provider which supports to submit feature join job in local spark #644

Merged
merged 10 commits into from
Sep 20, 2022

Conversation

Yuqing-cat
Copy link
Collaborator

@Yuqing-cat Yuqing-cat commented Sep 5, 2022

Description

Resolves #643

How was this PR tested?

Try pytest test/test_local_spark_e2e.py, please remember to set export SPARK_CONFIG__SPARK_CLUSTER=local.

  • Please notice that current client requires self.credential = DefaultAzureCredential( exclude_interactive_browser_credential=False) if credential is None else credential to build default registry.

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to clarify your proposed changes.

Included in this PR

  • A new _FeathrDLocalSparkJobLauncher
  • Support local as spark_cluster after databricks and synapse
  • Test case: test_local_spark_e2e.py (a pure local e2e test)
    - non udf feature join
    - udf feature join with wasb observation path
    - feature materialization with Redis online store. (*Spark job has correct logs. However, Redis write seems not working. Will keep investigating.)
  • Document: local-spark-provider.md
  • feature_conf/features.conf is removed so that it will affect test results.

@Yuqing-cat Yuqing-cat added documentation Improvements or additions to documentation feature New feature or request python code is mostly python labels Sep 5, 2022
@blrchen blrchen requested a review from hangfei September 5, 2022 08:40
@xiaoyongzhu xiaoyongzhu added the safe to test Tag to execute build pipeline for a PR from forked repo label Sep 5, 2022
@xiaoyongzhu
Copy link
Member

Thanks! I have minor comments but overall LGTM.

@blrchen
Copy link
Collaborator

blrchen commented Sep 5, 2022

Please investigate if ci test failure is related?

@Yuqing-cat
Copy link
Collaborator Author

Please investigate if ci test failure is related?

Not Sure why it happens, as no Scala change is made in this PR. Will investigate.
[info] *** 4 TESTS FAILED ***
[error] Failed tests:
[error] com.linkedin.feathr.offline.swa.TestSlidingWindowFeatureUtils
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 179 s (02:59), completed Sep 5, 2022 1:00:41 PM

@xiaoyongzhu
Copy link
Member

The scala part should be irrelevant. Sometimes there might be random failures there (not sure why).

But the local testing code is failed. Looks like because the env is forced to be set as synapse or databricks:

https://github.com/linkedin/feathr/blob/main/.github/workflows/pull_request_push_test.yml#L105

so @Yuqing-cat you might want to add a test for local only (i.e. force the env to run only the local test)

@Yuqing-cat
Copy link
Collaborator Author

https://github.com/linkedin/feathr/blob/main/.github/workflows/pull_request_push_test.yml#L105

Is there any best practice for local test?
A straight forward way would be move test_local_spark_e2e.py to anther folder, like feathr_project/local_test.
Or I could put test_local_spark_e2e.py to feathr_project/test/local_test and move all the rest tests into feathr_project/test/test.
However, I'm not sure if it will cause conflict / risks to current code base.
Any suggestions?

@windoze
Copy link
Member

windoze commented Sep 6, 2022

https://github.com/linkedin/feathr/blob/main/.github/workflows/pull_request_push_test.yml#L105

Is there any best practice for local test? A straight forward way would be move test_local_spark_e2e.py to anther folder, like feathr_project/local_test. Or I could put test_local_spark_e2e.py to feathr_project/test/local_test and move all the rest tests into feathr_project/test/test. However, I'm not sure if it will cause conflict / risks to current code base. Any suggestions?

You can just create a test script that uses another feathr_config.yaml or set environment to override the spark provider setting, Spark should be able to run within the Docker container.
You may need to reduce the executor number to avoid consuming too many resources, CI env may not like memory hogs.

@xiaoyongzhu
Copy link
Member

@Yuqing-cat I was looking thru the test failures and feel it's related with this line:

default_registry_client

We should probably make that line optional, if no registry service is set in the YAML config/env config.

@HanumathRao
Copy link

@Yuqing-cat Thanks for making the changes. It will be very useful to the newbies to try out e2e test case on the local machine.

I tried to do some local testing based on this branch and am running into an issue (which I think is related to a bug in the feathr code). I am curious if you encountered this issue.

Please find the stack trace below.

22/09/12 20:07:26 INFO SlidingWindowAggregationJoiner: Selected max window duration PT2160H across all anchors for source abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv/daily, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:807) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:105) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:774) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) at com.linkedin.feathr.offline.util.HdfsUtils$.exists(HdfsUtils.scala:453) at com.linkedin.feathr.offline.source.pathutil.LocalPathChecker.exists(LocalPathChecker.scala:50) at com.linkedin.feathr.offline.source.pathutil.TimeBasedHdfsPathAnalyzer.analyze(TimeBasedHdfsPathAnalyzer.scala:51) at com.linkedin.feathr.offline.transformation.AnchorToDataSourceMapper.getWindowAggAnchorDFMapForJoin(AnchorToDataSourceMapper.scala:102) at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.$anonfun$joinWindowAggFeaturesAsDF$8(SlidingWindowAggregationJoiner.scala:142) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.immutable.Map$Map1.foreach(Map.scala:193) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.joinWindowAggFeaturesAsDF(SlidingWindowAggregationJoiner.scala:122) at com.linkedin.feathr.offline.join.DataFrameFeatureJoiner.joinSWAFeatures(DataFrameFeatureJoiner.scala:325) at com.linkedin.feathr.offline.join.DataFrameFeatureJoiner.joinFeaturesAsDF(DataFrameFeatureJoiner.scala:196) at com.linkedin.feathr.offline.client.FeathrClient.joinFeaturesAsDF(FeathrClient.scala:267) at com.linkedin.feathr.offline.client.FeathrClient.doJoinObsAndFeatures(FeathrClient.scala:186) at com.linkedin.feathr.offline.job.FeatureJoinJob$.getFeathrClientAndJoinFeatures(FeatureJoinJob.scala:141) at com.linkedin.feathr.offline.job.FeatureJoinJob$.feathrJoinRun(FeatureJoinJob.scala:193) at com.linkedin.feathr.offline.job.FeatureJoinJob$.run(FeatureJoinJob.scala:79) at com.linkedin.feathr.offline.job.FeatureJoinJob$.main(FeatureJoinJob.scala:349) at com.linkedin.feathr.offline.job.FeatureJoinJob.main(FeatureJoinJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@Yuqing-cat
Copy link
Collaborator Author

@Yuqing-cat Thanks for making the changes. It will be very useful to the newbies to try out e2e test case on the local machine.

I tried to do some local testing based on this branch and am running into an issue (which I think is related to a bug in the feathr code). I am curious if you encountered this issue.

Please find the stack trace below.

22/09/12 20:07:26 INFO SlidingWindowAggregationJoiner: Selected max window duration PT2160H across all anchors for source abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv/daily, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:807) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:105) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:774) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) at com.linkedin.feathr.offline.util.HdfsUtils$.exists(HdfsUtils.scala:453) at com.linkedin.feathr.offline.source.pathutil.LocalPathChecker.exists(LocalPathChecker.scala:50) at com.linkedin.feathr.offline.source.pathutil.TimeBasedHdfsPathAnalyzer.analyze(TimeBasedHdfsPathAnalyzer.scala:51) at com.linkedin.feathr.offline.transformation.AnchorToDataSourceMapper.getWindowAggAnchorDFMapForJoin(AnchorToDataSourceMapper.scala:102) at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.$anonfun$joinWindowAggFeaturesAsDF$8(SlidingWindowAggregationJoiner.scala:142) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.immutable.Map$Map1.foreach(Map.scala:193) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.joinWindowAggFeaturesAsDF(SlidingWindowAggregationJoiner.scala:122) at com.linkedin.feathr.offline.join.DataFrameFeatureJoiner.joinSWAFeatures(DataFrameFeatureJoiner.scala:325) at com.linkedin.feathr.offline.join.DataFrameFeatureJoiner.joinFeaturesAsDF(DataFrameFeatureJoiner.scala:196) at com.linkedin.feathr.offline.client.FeathrClient.joinFeaturesAsDF(FeathrClient.scala:267) at com.linkedin.feathr.offline.client.FeathrClient.doJoinObsAndFeatures(FeathrClient.scala:186) at com.linkedin.feathr.offline.job.FeatureJoinJob$.getFeathrClientAndJoinFeatures(FeatureJoinJob.scala:141) at com.linkedin.feathr.offline.job.FeatureJoinJob$.feathrJoinRun(FeatureJoinJob.scala:193) at com.linkedin.feathr.offline.job.FeatureJoinJob$.run(FeatureJoinJob.scala:79) at com.linkedin.feathr.offline.job.FeatureJoinJob$.main(FeatureJoinJob.scala:349) at com.linkedin.feathr.offline.job.FeatureJoinJob.main(FeatureJoinJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@Yuqing-cat Thanks for making the changes. It will be very useful to the newbies to try out e2e test case on the local machine.

I tried to do some local testing based on this branch and am running into an issue (which I think is related to a bug in the feathr code). I am curious if you encountered this issue.

Please find the stack trace below.

22/09/12 20:07:26 INFO SlidingWindowAggregationJoiner: Selected max window duration PT2160H across all anchors for source abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv/daily, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:807) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:105) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:774) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) at com.linkedin.feathr.offline.util.HdfsUtils$.exists(HdfsUtils.scala:453) at com.linkedin.feathr.offline.source.pathutil.LocalPathChecker.exists(LocalPathChecker.scala:50) at com.linkedin.feathr.offline.source.pathutil.TimeBasedHdfsPathAnalyzer.analyze(TimeBasedHdfsPathAnalyzer.scala:51) at com.linkedin.feathr.offline.transformation.AnchorToDataSourceMapper.getWindowAggAnchorDFMapForJoin(AnchorToDataSourceMapper.scala:102) at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.$anonfun$joinWindowAggFeaturesAsDF$8(SlidingWindowAggregationJoiner.scala:142) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.immutable.Map$Map1.foreach(Map.scala:193) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at com.linkedin.feathr.offline.swa.SlidingWindowAggregationJoiner.joinWindowAggFeaturesAsDF(SlidingWindowAggregationJoiner.scala:122) at com.linkedin.feathr.offline.join.DataFrameFeatureJoiner.joinSWAFeatures(DataFrameFeatureJoiner.scala:325) at com.linkedin.feathr.offline.join.DataFrameFeatureJoiner.joinFeaturesAsDF(DataFrameFeatureJoiner.scala:196) at com.linkedin.feathr.offline.client.FeathrClient.joinFeaturesAsDF(FeathrClient.scala:267) at com.linkedin.feathr.offline.client.FeathrClient.doJoinObsAndFeatures(FeathrClient.scala:186) at com.linkedin.feathr.offline.job.FeatureJoinJob$.getFeathrClientAndJoinFeatures(FeatureJoinJob.scala:141) at com.linkedin.feathr.offline.job.FeatureJoinJob$.feathrJoinRun(FeatureJoinJob.scala:193) at com.linkedin.feathr.offline.job.FeatureJoinJob$.run(FeatureJoinJob.scala:79) at com.linkedin.feathr.offline.job.FeatureJoinJob$.main(FeatureJoinJob.scala:349) at com.linkedin.feathr.offline.job.FeatureJoinJob.main(FeatureJoinJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Hi @HanumathRao , thank you for your feedback.
Please replace all 'abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv' in your config file or python sample code into a local path like './green_tripdata_2020-04.csv' where you should download the sample file into, and see if the issue can be solved.

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
@xiaoyongzhu
Copy link
Member

Approved but we also need sure that users can run this locally without setting Azure credentials

@blrchen
Copy link
Collaborator

blrchen commented Sep 19, 2022

@Yuqing-cat can you please take a look at CI failures before merging?

@Yuqing-cat Yuqing-cat merged commit b299379 into feathr-ai:main Sep 20, 2022
@Yuqing-cat
Copy link
Collaborator Author

@Yuqing-cat can you please take a look at CI failures before merging?
Please check:
https://github.com/feathr-ai/feathr/actions/runs/3087050712/jobs/4992014137
I was intended to create a pure local (no Azure Credential) e2e test but find some credential logic is embedded in spark critical path. Need more effort to decouple them.

hyingyang-linkedin pushed a commit to hyingyang-linkedin/feathr that referenced this pull request Oct 25, 2022
…al spark (feathr-ai#644)

* local spark feature join job with local file

* update local spark with udf support

* add feature gen support in local spark

* update test case

* remove unused feature conf, update doc

* expose master as input and refine local spark provider

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
windoze added a commit that referenced this pull request Nov 9, 2022
* Add Data Models in Feathr

This RB is to create data models based on proposal: https://microsoft-my.sharepoint.com/:w:/g/personal/djkim_linkedin_biz/EZspGt7jJlRAqHTICZg3UbcBgQQ_VncOgM48hKW--T8qkg?e=T4N3zw

* Update models.py

* Update models.py

* Update models.py

* Update models.py

* Update models.py

* Add attributes to data models

Add data attributes to data models

* Added _scproxy necessary for MacOS (#651)

* Added _scproxy necessary for MacOS

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Changed to conditional import

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Added comments

Signed-off-by: changyonglik <theeahlag@gmail.com>

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Add docs for consuming features in online environment (#609)

* Create consume-features.md

* Update consume-features.md

* rename docs

* Update model-inference-with-feathr.md

* Update README.md

* update docs per feedback

* Update streaming-source-ingestion.md

* update docs

* update docs

* Update azure-deployment-arm.md

* Update model-inference-with-feathr.md

* add sign off message

Signed-off-by: Xiaoyong Zhu xiaoyzhu@outlook.com

* fix comments

* Delete deploy-feathr-api-as-webapp.md

* Update model-inference-with-feathr.md

Signed-off-by: Xiaoyong Zhu xiaoyzhu@outlook.com

* Clean up after moving to LFAI (#665)

* Clean up after moving to LFAI

Clean up after moving to LFAI

* Update README.md

* Updating docker version in ARM template to use latest release tagged image (#668)

* Adding DevSkim linter to Github actions

* Fix in ARM template to pull latest tagged release image from dockerhub

* Removing dev skim file from this branch

* Fixing linkedin org reference

* Added prettier documentation (#672)

* Added prettier documentation

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Fixed prettier documentation

Signed-off-by: changyonglik <theeahlag@gmail.com>

Signed-off-by: changyonglik <theeahlag@gmail.com>

* UI: Add data source detail page (#620)

UI: Add data source detail page

* Add aerospike sink (#632)

* squash commit and avoid conflict

* Revert legacy purview client issue

* Fix typo

* Remove auth from assink

* Update aerospike guidance document

* Chaneg port param to int

* Remove reference to aerospike in sbt (#680)

* Extend RBAC to support project id as input (#673)

* extend rbac to support project id as input

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* update registry docs and interface

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* user name case sensitive hot fix

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Local Spark Provider which supports to submit feature join job in local spark (#644)

* local spark feature join job with local file

* update local spark with udf support

* add feature gen support in local spark

* update test case

* remove unused feature conf, update doc

* expose master as input and refine local spark provider

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Fixing issue with docker image on demo apps not getting updated (#686)

Fixes #685
Look at the screenshot in the issue with the fixes.

Basically it seems for dockerhub images, we don't need to pass in the full URL (domain name) for the image name while publishing them to webapps.

* Lock python dependency versions (#690)

* Update setup.py

* Update setup.py

* Update setup.py

* Apply 'aggregation_features' parameter to merge dataframes (#667)

* Apply 'aggregation_features' parameter to merge dataframes

* modify test cases

* modify test case filter rule to keep same results as before

* add typekey check and improve previous changes

* merge to main and quick change

* revert change by mistake

* Apply this parameter to HDSF sink and add comments

* quick fix

* quick improve

Co-authored-by: Enya-Yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>
Co-authored-by: enya-yx <enya@7633599-06281.northamerica.corp.microsoft.com>
Co-authored-by: enya-yx <enya@LAPTOP-NBH6175C.redmond.corp.microsoft.com>

* Fix data source detail page in rbac registry (#698)

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Fix multi-keyed feature in anchor (direct purview) (#676)

When using old purview client (not registry client), if features inside an anchor has a same SET of keys , when calling get_feature_from_registry, only the first key will be used.

This PR handles the situation where each feature has multiple keys, and the collection of keys are identical among features inside an anchor.

* Fix path with #LATEST (#684)

* Fix Feature value adaptor and UDF adaptor on Spark executors (#660)

* Fix Feature value adaptor and UDF adaptor on Spark executors

* Fix path with #LATEST

* Add comments

* Defer version bump

* Enhance SQL Registry Error Messages (#674)

Right now the SQL registry returns all errors as 500 Internal Error, this PR improves the error handling and returns 400/404/409 on corresponding criteria.

Also it introduces an environment variable REGISTRY_DEBUGGING, the returned HTTP error will include the detailed track back info when it's set to non-empty string. This variable should only be used for debugging purposes.

* bump version to 0.8.0 (#694)

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Fix feature type bug where inferred feature type might not be honored when all the feature types are not provided (#701)

* Update setup.py (#702)

* fix rbac+purview web app issue (#700)

Resolves #699

Root cause:
Purview Registry starts too slow (than SQL registry) while RBAC layer add a dependency to its API in RBAC init which causes the web app crash
Trials and Fix
Trial: Add a sleep(60) command in start.sh will make the deployment successful
Fix: Move the registry api dependency outside of RBAC init; Log the failure as Runtime Exception

* Remove hard coded resources in docs (#696)

Remove hard coded resources, such as synapse, app endpoint, redis and so on in docs to avoid causing confusion.

* Add e2e test for purview registry and rbac registry (#689)

* Add e2e test for purview registry and rbac registry

* Add purview and rbac env e2e to registry tests

* Fix merge issue

* Update test use runtime jar from maven for spark submission to cover databricks (#706)

* Enhance databricks submission error message (#710)

Enhance databricks submission error message

* Enhance purview registry error messages (#709)

* Enhance purview registry error messages

* Update doc for REGISTRY_DEBUGGING

* hot fix databricks es dependency issue (#713)

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Fix materialize to sql e2e test failure (#717)

* Fix materialize to sql e2e test failure
* Update sql server name

* Add Data Models in Feathr (#659)

* Add Data Models in Feathr

This RB is to create data models based on proposal: https://microsoft-my.sharepoint.com/:w:/g/personal/djkim_linkedin_biz/EZspGt7jJlRAqHTICZg3UbcBgQQ_VncOgM48hKW--T8qkg?e=T4N3zw

* Update models.py

* Update models.py

* Update models.py

* Update models.py

* Update models.py

* Revert "Enhance purview registry error messages (#709)" (#720)

This reverts commit 059f2b4.

* Improve Avro GenericRecord and SpecificRecord based row-level extractor performance (#723)

* 1. In SparkRowExtractor.scala, add new extractor method which can be extended to do batch preprocess source dataframe into RDD[IndexRecord].
2. In FeatureTransformation.scala, add logic to extract features from RDD[IndexedRecord].
3. Improve some error messages.

* Save lookup feature definition to HOCON files (#732)

* Fix function string parsing (#725)

* Add version. Fix function string parsing

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

* Add unit test

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

* Add comments

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

* Apply a same credential within each sample (#718)

Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>

* Enable incremental for HDFS sink (#695)

* Enable incremental for HDFS sink

* Add docstring

* Add docs

* minor fix

* minor changes

* quick fix

Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>

* #492 fix, fail only if different sources have same name (#733)

* Remove unused credentials and deprecated purview settings (#708)

* Remove unused credentials and deprecated purview settings

* Revoke token submitted by mistaken (#730)

* Update product_recommendation_demo.ipynb

* Fix synapse errors not print out issue (#734)

Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>

* Spark config passing bug fix for local spark submission (#729)

* Fix local spark output file-format bug

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

* Add dev dependencies. Add unit-test for local spark job launcher

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

* Fix local spark submission unused param error

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

* Fix direct purview client missing transformation (#736)

* Revert "Derived feature bugfix (#121)" (#731)

This reverts commit fa645f3.

* Support SWA with groupBy to 1d tensor conversion (#748)

* Support SWA with groupby to 1d tensor conversion

* Rijai/armfix (#742)

* Adding DevSkim linter to Github actions

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Making ARM instructions for Owner role and AAD App more clear

* Removing devskim file

* Reverting the changes to docker file to match with feathr/main

* bump version to 0.8.2 (#722)

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Added latest deltalake version (#735)

* Added latest deltalake version

* Changed == to <= for deltalake installation

* Changed <= to >=

* #474 Disable local mode (#738)

* Allow recreating entities for PurView registry (#691)

* Allow recreating entities for PurView registry

* Use constants

* Adding DevSkim linter to Github actions (#657)

* Adding DevSkim linter to Github actions
* Ignoring .git and test folder

* Fix icons in UI cannot auto scale (#737) (#744)

* Fix icons in UI cannot auto scale (#737)

* Fix home.css code style issue

* Expose 'timePartitionPattern' in Python API [ WIP ] (#714)

* Expose 'timePartitionPattern'

* add test case

* Add test cases and docstring

* delete local files

* quick fix

Co-authored-by: enya-yx <enya@LAPTOP-NBH6175C.redmond.corp.microsoft.com>
Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>

* Setting up component governance pipeline (#655)

[skip ci]

* Add docs to explain on feature materialization behavior (#688)

* Update materializing-features.md

* Update materializing-features.md

* Fix protobuf version (#711)

* Fix protobuf version

* quick fix

Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>

* Add some notes based on on-call issues (#753)

* Add some notes based on on-call issues

* quick fix

Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>

* Refine spark runtime error message (#755)

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Serialization bug due to version incompatibility between azure-core and msrest (#763)

* Adding DevSkim linter to Github actions

* Fix in ARM template to pull latest tagged release image from dockerhub

* Removing dev skim file from this branch

* Fixing linkedin org reference

* Removing the docker index url from dockerhub image name as it seems to cause problem with the update

* Adding to the right file, had a dockerhub workflow file with different name

* Adding debug statements to test udf issue on Synapse

* Adding more print statements

* Pinning msrest version to work with pinned version of azure-core

* Removing debug code from previous branch

* Unify Python SDK Build Version and decouple Feathr Maven Version (#746)

* unify python package version and enable env setting for scala version

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* update docs and decouple maven version

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* change version back to 0.8.0 to avoid conflicts

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* fix typo

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* replace hard code string in notebook and align with others (#765)

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Add flag to enable generation non-agg features (#719)

* Add flag to enable generation non-agg features

* Typo

* Resolve comments

* rollback 0.8.2 version bump PR (#771)

Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>

* Refactor Product Recommendation sample notebook  (#743)

* Adding DevSkim linter to Github actions

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Update docker-publish.yml

* Removing devskim file

* Restructuring the Prod Reco sample

* Adjusting headings

* Minor changes

* Removing changes to docker publish file

* Addressing PR comments, moving Product recommendation notebook sample to Synapse folder since it is strongly tied to Synapse

* Addressing PR comments

* Fixing images

* Removing the need to pass email id as we could directly compute object Id using az command, also making CLI instructions clearer that it is for advance users

* Update role-management page in UI (#751) (#764)

* Update role-management page in UI (#751)

* fix home.css LF file

* fix RoleForm eslint warning

* remove import dayjs

Signed-off-by: Boli Guan <ifendoe@gmail.com>

* Change components to arrow function.

Signed-off-by: Boli Guan <ifendoe@gmail.com>

Signed-off-by: Boli Guan <ifendoe@gmail.com>

* Create Feature less module in UI code and import alias (#768)

* Add craco devDependencies

Signed-off-by: Boli Guan <ifendoe@gmail.com>

* Add classnames, @ant-design/icons,eslint-plugin.. dependencies.

Signed-off-by: Boli Guan <ifendoe@gmail.com>

* Update .editorconfig and .eslintrc

* Update .editorconfig

Signed-off-by: Boli Guan <ifendoe@gmail.com>

Signed-off-by: Boli Guan <ifendoe@gmail.com>

* Add dev and notebook dependencies. Add extra dependency installation to the test pipeline yml (#773)

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>

* Fix Windows compatibility issues (#776)

* Update _databricks_submission.py

* Update feathr-configuration-and-env.md

* Update feathr-configuration-and-env.md

* Update _databricks_submission.py

* Update models.py

* Update models.py

* Address comments

* Remove sourceRef

* Add functionType

Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: Xiaoyong Zhu xiaoyzhu@outlook.com
Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com>
Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>
Signed-off-by: Boli Guan <ifendoe@gmail.com>
Co-authored-by: Chang Yong Lik <51813538+ahlag@users.noreply.github.com>
Co-authored-by: Xiaoyong Zhu <xiaoyongzhu@users.noreply.github.com>
Co-authored-by: Richin Jain <rijai@microsoft.com>
Co-authored-by: Yihui Guo <yihgu@microsoft.com>
Co-authored-by: Yuqing Wei <weiyuqing021@outlook.com>
Co-authored-by: Enya-Yx <108409954+enya-yx@users.noreply.github.com>
Co-authored-by: Enya-Yx <enya@v-ellinlu-2.fareast.corp.microsoft.com>
Co-authored-by: enya-yx <enya@7633599-06281.northamerica.corp.microsoft.com>
Co-authored-by: enya-yx <enya@LAPTOP-NBH6175C.redmond.corp.microsoft.com>
Co-authored-by: Jinghui Mo <jmo@linkedin.com>
Co-authored-by: 徐辰 <windoze@0d0a.com>
Co-authored-by: Blair Chen <blrchen@users.noreply.github.com>
Co-authored-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com>
Co-authored-by: Hangfei Lin <hnlin@linkedin.com>
Co-authored-by: Boli Guan <ifendoe@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation feature New feature or request python code is mostly python safe to test Tag to execute build pipeline for a PR from forked repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support to Run Feature Join Job in Local Spark Environment
5 participants