-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Avro GenericRecord and SpecificRecord based row-level extractor performance #723
Improve Avro GenericRecord and SpecificRecord based row-level extractor performance #723
Conversation
…extended to do batch preprocess source dataframe into RDD[IndexRecord]. 2. In FeatureTransformation.scala, add logic to extract features from RDD[IndexedRecord]. 3. Improve some error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just a couple comments on naming and documentation.
src/main/scala/com/linkedin/feathr/offline/job/FeatureTransformation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/job/FeatureTransformation.scala
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/common/SparkRowExtractor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/job/FeatureTransformation.scala
Outdated
Show resolved
Hide resolved
Seems the sbt test is failed. Not sure if it's related with this change. |
src/main/scala/com/linkedin/feathr/common/SparkRowExtractor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/anchored/keyExtractor/MVELSourceKeyExtractor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/anchored/keyExtractor/MVELSourceKeyExtractor.scala
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/job/FeatureTransformation.scala
Show resolved
Hide resolved
7175ecd
to
e87d596
Compare
e87d596
to
c617cd6
Compare
Thanks, this PR looks good to me. I also like the added unit test to make sure the parsing works correctly. Have a few minor comments, and also let's make sure it works when reading features from the registry (that's where most of the edge cases are from). |
src/main/scala/com/linkedin/feathr/common/WorkWithAvroRdd.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/linkedin/feathr/offline/anchored/keyExtractor/MVELSourceKeyExtractor.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…or performance (feathr-ai#723) * 1. In SparkRowExtractor.scala, add new extractor method which can be extended to do batch preprocess source dataframe into RDD[IndexRecord]. 2. In FeatureTransformation.scala, add logic to extract features from RDD[IndexedRecord]. 3. Improve some error messages.
* Add Data Models in Feathr This RB is to create data models based on proposal: https://microsoft-my.sharepoint.com/:w:/g/personal/djkim_linkedin_biz/EZspGt7jJlRAqHTICZg3UbcBgQQ_VncOgM48hKW--T8qkg?e=T4N3zw * Update models.py * Update models.py * Update models.py * Update models.py * Update models.py * Add attributes to data models Add data attributes to data models * Added _scproxy necessary for MacOS (#651) * Added _scproxy necessary for MacOS Signed-off-by: changyonglik <theeahlag@gmail.com> * Changed to conditional import Signed-off-by: changyonglik <theeahlag@gmail.com> * Added comments Signed-off-by: changyonglik <theeahlag@gmail.com> Signed-off-by: changyonglik <theeahlag@gmail.com> * Add docs for consuming features in online environment (#609) * Create consume-features.md * Update consume-features.md * rename docs * Update model-inference-with-feathr.md * Update README.md * update docs per feedback * Update streaming-source-ingestion.md * update docs * update docs * Update azure-deployment-arm.md * Update model-inference-with-feathr.md * add sign off message Signed-off-by: Xiaoyong Zhu xiaoyzhu@outlook.com * fix comments * Delete deploy-feathr-api-as-webapp.md * Update model-inference-with-feathr.md Signed-off-by: Xiaoyong Zhu xiaoyzhu@outlook.com * Clean up after moving to LFAI (#665) * Clean up after moving to LFAI Clean up after moving to LFAI * Update README.md * Updating docker version in ARM template to use latest release tagged image (#668) * Adding DevSkim linter to Github actions * Fix in ARM template to pull latest tagged release image from dockerhub * Removing dev skim file from this branch * Fixing linkedin org reference * Added prettier documentation (#672) * Added prettier documentation Signed-off-by: changyonglik <theeahlag@gmail.com> * Fixed prettier documentation Signed-off-by: changyonglik <theeahlag@gmail.com> Signed-off-by: changyonglik <theeahlag@gmail.com> * UI: Add data source detail page (#620) UI: Add data source detail page * Add aerospike sink (#632) * squash commit and avoid conflict * Revert legacy purview client issue * Fix typo * Remove auth from assink * Update aerospike guidance document * Chaneg port param to int * Remove reference to aerospike in sbt (#680) * Extend RBAC to support project id as input (#673) * extend rbac to support project id as input Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * update registry docs and interface Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * user name case sensitive hot fix Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Local Spark Provider which supports to submit feature join job in local spark (#644) * local spark feature join job with local file * update local spark with udf support * add feature gen support in local spark * update test case * remove unused feature conf, update doc * expose master as input and refine local spark provider Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Fixing issue with docker image on demo apps not getting updated (#686) Fixes #685 Look at the screenshot in the issue with the fixes. Basically it seems for dockerhub images, we don't need to pass in the full URL (domain name) for the image name while publishing them to webapps. * Lock python dependency versions (#690) * Update setup.py * Update setup.py * Update setup.py * Apply 'aggregation_features' parameter to merge dataframes (#667) * Apply 'aggregation_features' parameter to merge dataframes * modify test cases * modify test case filter rule to keep same results as before * add typekey check and improve previous changes * merge to main and quick change * revert change by mistake * Apply this parameter to HDSF sink and add comments * quick fix * quick improve Co-authored-by: Enya-Yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> Co-authored-by: enya-yx <enya@7633599-06281.northamerica.corp.microsoft.com> Co-authored-by: enya-yx <enya@LAPTOP-NBH6175C.redmond.corp.microsoft.com> * Fix data source detail page in rbac registry (#698) Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Fix multi-keyed feature in anchor (direct purview) (#676) When using old purview client (not registry client), if features inside an anchor has a same SET of keys , when calling get_feature_from_registry, only the first key will be used. This PR handles the situation where each feature has multiple keys, and the collection of keys are identical among features inside an anchor. * Fix path with #LATEST (#684) * Fix Feature value adaptor and UDF adaptor on Spark executors (#660) * Fix Feature value adaptor and UDF adaptor on Spark executors * Fix path with #LATEST * Add comments * Defer version bump * Enhance SQL Registry Error Messages (#674) Right now the SQL registry returns all errors as 500 Internal Error, this PR improves the error handling and returns 400/404/409 on corresponding criteria. Also it introduces an environment variable REGISTRY_DEBUGGING, the returned HTTP error will include the detailed track back info when it's set to non-empty string. This variable should only be used for debugging purposes. * bump version to 0.8.0 (#694) Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Fix feature type bug where inferred feature type might not be honored when all the feature types are not provided (#701) * Update setup.py (#702) * fix rbac+purview web app issue (#700) Resolves #699 Root cause: Purview Registry starts too slow (than SQL registry) while RBAC layer add a dependency to its API in RBAC init which causes the web app crash Trials and Fix Trial: Add a sleep(60) command in start.sh will make the deployment successful Fix: Move the registry api dependency outside of RBAC init; Log the failure as Runtime Exception * Remove hard coded resources in docs (#696) Remove hard coded resources, such as synapse, app endpoint, redis and so on in docs to avoid causing confusion. * Add e2e test for purview registry and rbac registry (#689) * Add e2e test for purview registry and rbac registry * Add purview and rbac env e2e to registry tests * Fix merge issue * Update test use runtime jar from maven for spark submission to cover databricks (#706) * Enhance databricks submission error message (#710) Enhance databricks submission error message * Enhance purview registry error messages (#709) * Enhance purview registry error messages * Update doc for REGISTRY_DEBUGGING * hot fix databricks es dependency issue (#713) Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Fix materialize to sql e2e test failure (#717) * Fix materialize to sql e2e test failure * Update sql server name * Add Data Models in Feathr (#659) * Add Data Models in Feathr This RB is to create data models based on proposal: https://microsoft-my.sharepoint.com/:w:/g/personal/djkim_linkedin_biz/EZspGt7jJlRAqHTICZg3UbcBgQQ_VncOgM48hKW--T8qkg?e=T4N3zw * Update models.py * Update models.py * Update models.py * Update models.py * Update models.py * Revert "Enhance purview registry error messages (#709)" (#720) This reverts commit 059f2b4. * Improve Avro GenericRecord and SpecificRecord based row-level extractor performance (#723) * 1. In SparkRowExtractor.scala, add new extractor method which can be extended to do batch preprocess source dataframe into RDD[IndexRecord]. 2. In FeatureTransformation.scala, add logic to extract features from RDD[IndexedRecord]. 3. Improve some error messages. * Save lookup feature definition to HOCON files (#732) * Fix function string parsing (#725) * Add version. Fix function string parsing Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> * Add unit test Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> * Add comments Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> * Apply a same credential within each sample (#718) Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> * Enable incremental for HDFS sink (#695) * Enable incremental for HDFS sink * Add docstring * Add docs * minor fix * minor changes * quick fix Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> * #492 fix, fail only if different sources have same name (#733) * Remove unused credentials and deprecated purview settings (#708) * Remove unused credentials and deprecated purview settings * Revoke token submitted by mistaken (#730) * Update product_recommendation_demo.ipynb * Fix synapse errors not print out issue (#734) Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> * Spark config passing bug fix for local spark submission (#729) * Fix local spark output file-format bug Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> * Add dev dependencies. Add unit-test for local spark job launcher Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> * Fix local spark submission unused param error Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> * Fix direct purview client missing transformation (#736) * Revert "Derived feature bugfix (#121)" (#731) This reverts commit fa645f3. * Support SWA with groupBy to 1d tensor conversion (#748) * Support SWA with groupby to 1d tensor conversion * Rijai/armfix (#742) * Adding DevSkim linter to Github actions * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Making ARM instructions for Owner role and AAD App more clear * Removing devskim file * Reverting the changes to docker file to match with feathr/main * bump version to 0.8.2 (#722) Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Added latest deltalake version (#735) * Added latest deltalake version * Changed == to <= for deltalake installation * Changed <= to >= * #474 Disable local mode (#738) * Allow recreating entities for PurView registry (#691) * Allow recreating entities for PurView registry * Use constants * Adding DevSkim linter to Github actions (#657) * Adding DevSkim linter to Github actions * Ignoring .git and test folder * Fix icons in UI cannot auto scale (#737) (#744) * Fix icons in UI cannot auto scale (#737) * Fix home.css code style issue * Expose 'timePartitionPattern' in Python API [ WIP ] (#714) * Expose 'timePartitionPattern' * add test case * Add test cases and docstring * delete local files * quick fix Co-authored-by: enya-yx <enya@LAPTOP-NBH6175C.redmond.corp.microsoft.com> Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> * Setting up component governance pipeline (#655) [skip ci] * Add docs to explain on feature materialization behavior (#688) * Update materializing-features.md * Update materializing-features.md * Fix protobuf version (#711) * Fix protobuf version * quick fix Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> * Add some notes based on on-call issues (#753) * Add some notes based on on-call issues * quick fix Co-authored-by: enya-yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> * Refine spark runtime error message (#755) Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Serialization bug due to version incompatibility between azure-core and msrest (#763) * Adding DevSkim linter to Github actions * Fix in ARM template to pull latest tagged release image from dockerhub * Removing dev skim file from this branch * Fixing linkedin org reference * Removing the docker index url from dockerhub image name as it seems to cause problem with the update * Adding to the right file, had a dockerhub workflow file with different name * Adding debug statements to test udf issue on Synapse * Adding more print statements * Pinning msrest version to work with pinned version of azure-core * Removing debug code from previous branch * Unify Python SDK Build Version and decouple Feathr Maven Version (#746) * unify python package version and enable env setting for scala version Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * update docs and decouple maven version Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * change version back to 0.8.0 to avoid conflicts Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * fix typo Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * replace hard code string in notebook and align with others (#765) Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Add flag to enable generation non-agg features (#719) * Add flag to enable generation non-agg features * Typo * Resolve comments * rollback 0.8.2 version bump PR (#771) Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> * Refactor Product Recommendation sample notebook (#743) * Adding DevSkim linter to Github actions * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Update docker-publish.yml * Removing devskim file * Restructuring the Prod Reco sample * Adjusting headings * Minor changes * Removing changes to docker publish file * Addressing PR comments, moving Product recommendation notebook sample to Synapse folder since it is strongly tied to Synapse * Addressing PR comments * Fixing images * Removing the need to pass email id as we could directly compute object Id using az command, also making CLI instructions clearer that it is for advance users * Update role-management page in UI (#751) (#764) * Update role-management page in UI (#751) * fix home.css LF file * fix RoleForm eslint warning * remove import dayjs Signed-off-by: Boli Guan <ifendoe@gmail.com> * Change components to arrow function. Signed-off-by: Boli Guan <ifendoe@gmail.com> Signed-off-by: Boli Guan <ifendoe@gmail.com> * Create Feature less module in UI code and import alias (#768) * Add craco devDependencies Signed-off-by: Boli Guan <ifendoe@gmail.com> * Add classnames, @ant-design/icons,eslint-plugin.. dependencies. Signed-off-by: Boli Guan <ifendoe@gmail.com> * Update .editorconfig and .eslintrc * Update .editorconfig Signed-off-by: Boli Guan <ifendoe@gmail.com> Signed-off-by: Boli Guan <ifendoe@gmail.com> * Add dev and notebook dependencies. Add extra dependency installation to the test pipeline yml (#773) Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> * Fix Windows compatibility issues (#776) * Update _databricks_submission.py * Update feathr-configuration-and-env.md * Update feathr-configuration-and-env.md * Update _databricks_submission.py * Update models.py * Update models.py * Address comments * Remove sourceRef * Add functionType Signed-off-by: changyonglik <theeahlag@gmail.com> Signed-off-by: Xiaoyong Zhu xiaoyzhu@outlook.com Signed-off-by: Yuqing Wei <weiyuqing021@outlook.com> Signed-off-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> Signed-off-by: Boli Guan <ifendoe@gmail.com> Co-authored-by: Chang Yong Lik <51813538+ahlag@users.noreply.github.com> Co-authored-by: Xiaoyong Zhu <xiaoyongzhu@users.noreply.github.com> Co-authored-by: Richin Jain <rijai@microsoft.com> Co-authored-by: Yihui Guo <yihgu@microsoft.com> Co-authored-by: Yuqing Wei <weiyuqing021@outlook.com> Co-authored-by: Enya-Yx <108409954+enya-yx@users.noreply.github.com> Co-authored-by: Enya-Yx <enya@v-ellinlu-2.fareast.corp.microsoft.com> Co-authored-by: enya-yx <enya@7633599-06281.northamerica.corp.microsoft.com> Co-authored-by: enya-yx <enya@LAPTOP-NBH6175C.redmond.corp.microsoft.com> Co-authored-by: Jinghui Mo <jmo@linkedin.com> Co-authored-by: 徐辰 <windoze@0d0a.com> Co-authored-by: Blair Chen <blrchen@users.noreply.github.com> Co-authored-by: Jun Ki Min <42475935+loomlike@users.noreply.github.com> Co-authored-by: Hangfei Lin <hnlin@linkedin.com> Co-authored-by: Boli Guan <ifendoe@gmail.com>
Description
Improve Avro GenericRecord and SpecificRecord based row-level extractor performance by supporting convert source dataframe into RDD[IndexedRecord] in batch manner.
Major changes: