Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python UDF in Ingestion being used for feature validation #1234

Merged
merged 24 commits into from
Dec 22, 2020
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/workflows/complete.yml
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,10 @@ jobs:
java-version: '11'
java-package: jdk
architecture: x64
- uses: actions/setup-python@v2
with:
python-version: '3.6'
architecture: 'x64'
- uses: actions/cache@v2
with:
path: ~/.m2/repository
Expand All @@ -136,6 +140,13 @@ jobs:
${{ runner.os }}-it-maven-
- name: Run integration tests
run: make test-java-integration
- name: Save report
uses: actions/upload-artifact@v2
if: failure()
with:
name: it-report
path: spark/ingestion/target/test-reports/TestSuite.txt
retention-days: 5

tests-docker-compose:
needs:
Expand Down
29 changes: 28 additions & 1 deletion .github/workflows/master_only.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,4 +76,31 @@ jobs:
VERSION=${RELEASE_VERSION:1}
make build-java-no-tests REVISION=${VERSION}
gsutil cp ./spark/ingestion/target/feast-ingestion-spark-${VERSION}.jar gs://${PUBLISH_BUCKET}/spark/ingestion/
fi
fi

publish-ingestion-pylibs:
strategy:
matrix:
os: [ ubuntu-latest, macos-latest ]
python-version: [ 3.6, 3.7, 3.8 ]
env:
PUBLISH_BUCKET: feast-jobs
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v2
- uses: GoogleCloudPlatform/github-actions/setup-gcloud@master
with:
version: '290.0.1'
export_default_credentials: true
project_id: ${{ secrets.GCP_PROJECT_ID }}
service_account_key: ${{ secrets.GCP_SA_KEY }}
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Create libs archive
env:
PY_VERSION: ${{ matrix.python-version }}
run: |
export PLATFORM=$(python -c 'import platform; print(platform.system().lower())')
./infra/scripts/build-ingestion-py-dependencies.sh "py${PY_VERSION}-$PLATFORM" gs://${PUBLISH_BUCKET}/spark/validation/
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,19 @@ lint-java:
${MVN} --no-transfer-progress spotless:check

test-java:
${MVN} --no-transfer-progress test
${MVN} --no-transfer-progress -DskipITs=true test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? (i'm not super familiar with our java test machinery, does it now skip some tests it didn't skip before?)

Copy link
Collaborator Author

@pyalex pyalex Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: mvn test runs only unit tests, everything else should be skipped

we have two toggles skipITs and skipUTs- means skip Integration Tests and skip Unit Tests, we need them to run thoses test suites separately. Before there was no need in skipITs, since in maven pipeline test is before verify (used for IT), but in spark part we have some additional steps (generate-test-source phase) that are required only by integration tests, and don't needed by unit tests (see spark/ingestion/pom.xml). To skip those steps I added this flag here.


test-java-integration:
${MVN} --no-transfer-progress -Dmaven.javadoc.skip=true -Dgpg.skip -DskipUTs=true clean verify

test-java-with-coverage:
${MVN} --no-transfer-progress test jacoco:report-aggregate
${MVN} --no-transfer-progress -DskipITs=true test jacoco:report-aggregate

build-java:
${MVN} clean verify

build-java-no-tests:
${MVN} --no-transfer-progress -Dmaven.javadoc.skip=true -Dgpg.skip -DskipUTs=true -Drevision=${REVISION} clean package
${MVN} --no-transfer-progress -Dmaven.javadoc.skip=true -Dgpg.skip -DskipUTs=true -DskipITs=true -Drevision=${REVISION} clean package

# Python SDK

Expand Down
17 changes: 17 additions & 0 deletions infra/scripts/build-ingestion-py-dependencies.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash

pyalex marked this conversation as resolved.
Show resolved Hide resolved
PLATFORM=$1
DESTINATION=$2
PACKAGES=${PACKAGES:-"great-expectations==0.13.2 pyarrow==2.0.0"}

tmp_dir=$(mktemp -d)

pip3 install -t ${tmp_dir}/libs $PACKAGES

cd $tmp_dir
tar -czf pylibs-ge-$PLATFORM.tar.gz libs/
if [[ $DESTINATION == gs* ]]; then
gsutil cp pylibs-ge-$PLATFORM.tar.gz $GS_DESTINATION
else
mv pylibs-ge-$PLATFORM.tar.gz $DESTINATION
fi
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@
<parent.basedir>${maven.multiModuleProjectDirectory}</parent.basedir>

<skipUTs>false</skipUTs>
<skipITs>false</skipITs>
<feast.auth.providers.http.client.package.name>feast.common.auth.providers.http.client</feast.auth.providers.http.client.package.name>
</properties>

Expand Down
4 changes: 4 additions & 0 deletions sdk/python/feast/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,10 @@ class ConfigOptions(metaclass=ConfigMeta):
#: https://github.com/gojekfarm/stencil
STENCIL_URL: str = ""

#: If set to true rows that do not pass custom validation (see feast.contrib.validation)
#: won't be saved to Online Storage
INGESTION_DROP_INVALID_ROWS = "False"

#: EMR cluster to run Feast Spark Jobs in
EMR_CLUSTER_ID: Optional[str] = None

Expand Down
Empty file.
Empty file.
13 changes: 13 additions & 0 deletions sdk/python/feast/contrib/validation/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import io

try:
from pyspark import cloudpickle
except ImportError:
raise ImportError("pyspark must be installed to enable validation functionality")


def serialize_udf(fun, return_type) -> bytes:
buffer = io.BytesIO()
command = (fun, return_type)
cloudpickle.dump(command, buffer)
return buffer.getvalue()
126 changes: 126 additions & 0 deletions sdk/python/feast/contrib/validation/ge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
import io
import json
from typing import TYPE_CHECKING
from urllib.parse import urlparse

import pandas as pd

from feast.constants import ConfigOptions
from feast.contrib.validation.base import serialize_udf
from feast.staging.storage_client import get_staging_client

try:
from great_expectations.core import ExpectationSuite
from great_expectations.dataset import PandasDataset
except ImportError:
raise ImportError(
"great_expectations must be installed to enable validation functionality. "
"Please install feast[validation]"
)

try:
from pyspark.sql.types import BooleanType
except ImportError:
raise ImportError(
"pyspark must be installed to enable validation functionality. "
"Please install feast[validation]"
)


if TYPE_CHECKING:
from feast import Client, FeatureTable


GE_PACKED_ARCHIVE = "https://storage.googleapis.com/feast-jobs/spark/validation/pylibs-ge-%(platform)s.tar.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work from distribution perspective? I worry that if this is not in any way tied to Feast version, we can't upgrade GE without breaking older versions of ingestion job.

Maybe we don't have to address this now, especially while this is contrib, but sometime down the road we probably need to pin the version of this tarball to a Feast version somehow.

Copy link
Collaborator Author

@pyalex pyalex Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that smells. But since feature is experimental I think it's ok for now.
As an option for future - we could put this archive inside ingestion jar or docker image (jobservice).

_UNSET = object()


class ValidationUDF:
def __init__(self, name: str, pickled_code: bytes):
self.name = name
self.pickled_code = pickled_code


def create_validation_udf(name: str, expectations: ExpectationSuite) -> ValidationUDF:
"""
Wraps your expectations into Spark UDF.

Expectations should be generated & validated using training dataset:
>>> from great_expectations.dataset import PandasDataset
>>> ds = PandasDataset.from_dataset(you_training_df)
>>> ds.expect_column_values_to_be_between('column', 0, 100)

>>> expectations = ds.get_expectation_suite()

Important: you expectations should pass on training dataset, only successful checks
will be converted and stored in ExpectationSuite.

Now you can create UDF that will validate data during ingestion:
>>> create_validation_udf("myValidation", expectations)

:param name
:param expectations: collection of expectation gathered on training dataset
:return: ValidationUDF with serialized code
"""

def udf(df: pd.DataFrame) -> pd.Series:
ds = PandasDataset.from_dataset(df)
result = ds.validate(expectations, result_format="COMPLETE")
valid_rows = pd.Series([True] * df.shape[0])

for check in result.results:
if check.success:
continue

if check.exception_info["raised_exception"]:
# ToDo: probably we should mark all rows as invalid
continue

valid_rows.iloc[check.result["unexpected_index_list"]] = False

return valid_rows

pickled_code = serialize_udf(udf, BooleanType())
return ValidationUDF(name, pickled_code)


def apply_validation(
client: "Client",
feature_table: "FeatureTable",
udf: ValidationUDF,
validation_window_secs: int,
include_py_libs=_UNSET,
):
"""
Uploads validation udf code to staging location &
stores path to udf code and required python libraries as FeatureTable labels.
"""
include_py_libs = (
include_py_libs if include_py_libs is not _UNSET else GE_PACKED_ARCHIVE
)

staging_location = client._config.get(ConfigOptions.SPARK_STAGING_LOCATION).rstrip(
"/"
)
staging_scheme = urlparse(staging_location).scheme
staging_client = get_staging_client(staging_scheme)

pickled_code_fp = io.BytesIO(udf.pickled_code)
remote_path = f"{staging_location}/udfs/{udf.name}.pickle"
staging_client.upload_fileobj(
pickled_code_fp, f"{udf.name}.pickle", remote_uri=urlparse(remote_path)
)

feature_table.labels.update(
{
"_validation": json.dumps(
dict(
name=udf.name,
pickled_code_path=remote_path,
include_archive_path=include_py_libs,
)
),
"_streaming_trigger_secs": str(validation_window_secs),
}
)
client.apply_feature_table(feature_table)
7 changes: 7 additions & 0 deletions sdk/python/feast/pyspark/abc.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,7 @@ def __init__(
statsd_port: Optional[int] = None,
deadletter_path: Optional[str] = None,
stencil_url: Optional[str] = None,
drop_invalid_rows: bool = False,
):
self._feature_table = feature_table
self._source = source
Expand All @@ -318,6 +319,7 @@ def __init__(
self._statsd_port = statsd_port
self._deadletter_path = deadletter_path
self._stencil_url = stencil_url
self._drop_invalid_rows = drop_invalid_rows

def _get_redis_config(self):
return dict(host=self._redis_host, port=self._redis_port, ssl=self._redis_ssl)
Expand Down Expand Up @@ -362,6 +364,9 @@ def get_arguments(self) -> List[str]:
if self._stencil_url:
args.extend(["--stencil-url", self._stencil_url])

if self._drop_invalid_rows:
args.extend(["--drop-invalid"])

return args


Expand Down Expand Up @@ -430,6 +435,7 @@ def __init__(
statsd_port: Optional[int] = None,
deadletter_path: Optional[str] = None,
stencil_url: Optional[str] = None,
drop_invalid_rows: bool = False,
):
super().__init__(
feature_table,
Expand All @@ -442,6 +448,7 @@ def __init__(
statsd_port,
deadletter_path,
stencil_url,
drop_invalid_rows,
)
self._extra_jars = extra_jars

Expand Down
2 changes: 2 additions & 0 deletions sdk/python/feast/pyspark/launcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ def _feature_table_to_argument(
for n in feature_table.entities
],
"max_age": feature_table.max_age.ToSeconds() if feature_table.max_age else None,
"labels": dict(feature_table.labels),
}


Expand Down Expand Up @@ -288,6 +289,7 @@ def get_stream_to_online_ingestion_params(
and client._config.getint(opt.STATSD_PORT),
deadletter_path=client._config.get(opt.DEADLETTER_PATH),
stencil_url=client._config.get(opt.STENCIL_URL),
drop_invalid_rows=client._config.get(opt.INGESTION_DROP_INVALID_ROWS),
)


Expand Down
1 change: 1 addition & 0 deletions sdk/python/feast/pyspark/launchers/aws/emr_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,6 +332,7 @@ def _stream_ingestion_step(
],
"Args": ["spark-submit", "--class", "feast.ingestion.IngestionJob"]
+ jars_args
+ ["--conf", "spark.yarn.isPython=true"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any downside of doing this? Do you know why it isn't it always set it to true in Spark?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's yarn specific and from what I found it's used mostly to enable distribution of python related stuff (like pyspark.zip) to yarn workers. It's being set by spark-submit when main file is py-file, which is not the case for our IngestionJob.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Maybe add a comment there for the future generations?

+ ["--packages", BQ_SPARK_PACKAGE, jar_path]
+ args,
"Jar": "command-runner.jar",
Expand Down
2 changes: 2 additions & 0 deletions sdk/python/feast/pyspark/launchers/gcloud/dataproc.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,8 @@ def dataproc_submit(
"spark.executor.instances": self.executor_instances,
"spark.executor.cores": self.executor_cores,
"spark.executor.memory": self.executor_memory,
"spark.pyspark.driver.python": "python3.6",
"spark.pyspark.python": "python3.6",
}

properties.update(extra_properties)
Expand Down
3 changes: 2 additions & 1 deletion sdk/python/requirements-ci.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ pytest-lazy-fixture==0.6.3
pytest-timeout==1.4.2
pytest-ordering==0.6.*
pytest-mock==1.10.4
PyYAML==5.3.1
PyYAML==5.3.1
great-expectations==0.13.2
5 changes: 4 additions & 1 deletion sdk/python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,10 @@
install_requires=REQUIRED,
# https://stackoverflow.com/questions/28509965/setuptools-development-requirements
# Install dev requirements with: pip install -e .[dev]
extras_require={"dev": ["mypy-protobuf==1.*", "grpcio-testing==1.*"]},
extras_require={
"dev": ["mypy-protobuf==1.*", "grpcio-testing==1.*"],
"validation": ["great_expectations==0.13.2", "pyspark==3.0.1"]
},
include_package_data=True,
license="Apache",
classifiers=[
Expand Down
23 changes: 23 additions & 0 deletions spark/ingestion/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,11 @@
<goals>
<goal>test</goal>
</goals>
<configuration>
<reportsDirectory>${project.build.directory}/test-reports</reportsDirectory>
<junitxml>.</junitxml>
<filereports>WDF TestSuite.txt</filereports>
</configuration>
</execution>
</executions>
</plugin>
Expand Down Expand Up @@ -366,6 +371,24 @@
<skipTests>true</skipTests>
</configuration>
</plugin>
<plugin>
<artifactId>exec-maven-plugin</artifactId>
<groupId>org.codehaus.mojo</groupId>
<executions>
<execution>
<id>Python UDF setup</id>
<phase>generate-test-sources</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<skip>${skipITs}</skip>
<executable>./setup.sh</executable>
<workingDirectory>${basedir}/src/test/resources/python/</workingDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
<pluginManagement>
<plugins>
Expand Down
Loading