Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear Feathr UDF state and configuration template in work directory #557

Merged
merged 6 commits into from
Aug 8, 2022

Conversation

xiaoyongzhu
Copy link
Member

@xiaoyongzhu xiaoyongzhu commented Aug 7, 2022

The first issue

In the existing code, we didn't remove generated_feathr_pyspark_metadata when building the features. This is problematic in a few end users, in particular: If UDFs are defined and then are removed, since this file is not cleared, the code will still think there are UDFs, which will either yield wrong results, or exist incorrectly.

This PR makes sure that we remove generated_feathr_pyspark_metadata and a few UDF files every time users build features.

The second issue (#559 )

This is an issue which isn't very obvious. Sometimes after running get_offline_features, then running materialize_features API, the materialize_features API will not be successful, and in many cases there's no values in the online store such as Redis.

This only happens when using databricks.

This is caused by the fact that if the databricks configuration is not a string (i.e. end users use a dict to provide all the required configurations), then there's a line in the code

submission_params = self.config_template

Since self.config_template is a dict, this is actually a reference rather than a copy of self.config_template. In the code later, submission_params will be modified and the value will be carried over across jobs, which will cause different jobs share the same state, and will cause unexpected behaviors.

Other issues

This PR also fixes a few OS compatibility issues (when parsing paths we always assume it's Linux style which isn't true), and fix a few typos.

@xiaoyongzhu xiaoyongzhu changed the title Remove udf state in feathr work directory Clear Feathr UDF state in work directory Aug 7, 2022
Yuqing-cat
Yuqing-cat previously approved these changes Aug 7, 2022
@xiaoyongzhu xiaoyongzhu changed the title Clear Feathr UDF state in work directory Clear Feathr UDF state and configuration template in work directory Aug 8, 2022
@xiaoyongzhu xiaoyongzhu merged commit 443d881 into main Aug 8, 2022
@xiaoyongzhu xiaoyongzhu deleted the xiaoyzhu/fix_udf_cannot_find branch August 8, 2022 15:01
ahlag pushed a commit to ahlag/feathr that referenced this pull request Aug 26, 2022
…eathr-ai#557)

* Remove udf state in feathr work directory

* Update _preprocessing_pyudf_manager.py

* Update _preprocessing_pyudf_manager.py

* update test

* fix join/gen job mix issues

* Update _preprocessing_pyudf_manager.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feathr get_offline_features API and materialize_features will share the state
3 participants