Skip to content

Commit

Permalink
[FSTORE-345] Update documentation to reflect supported methods in hsf…
Browse files Browse the repository at this point in the history
…s engines (#814) (#815)

(cherry picked from commit 05c1481)
  • Loading branch information
moritzmeister authored Oct 4, 2022
1 parent 9a44eca commit a0b1a89
Show file tree
Hide file tree
Showing 7 changed files with 74 additions and 5 deletions.
6 changes: 6 additions & 0 deletions auto_doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,12 @@
"bigquery_properties": keras_autodoc.get_properties(
"hsfs.storage_connector.BigQueryConnector"
),
"kafka_methods": keras_autodoc.get_methods(
"hsfs.storage_connector.KafkaConnector", exclude=["from_response_json"]
),
"kafka_properties": keras_autodoc.get_properties(
"hsfs.storage_connector.KafkaConnector"
),
},
"api/statistics_config_api.md": {
"statistics_config": ["hsfs.statistics_config.StatisticsConfig"],
Expand Down
12 changes: 11 additions & 1 deletion docs/templates/api/storage_connector_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,6 @@ Read more about encryption on [Google Documentation.](https://cloud.google.com/s
The storage connector uses the Google `gcs-connector-hadoop` behind the scenes. For more information, check out [Google Cloud Storage Connector for Spark and Hadoop](
https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/master/gcs#google-cloud-storage-connector-for-spark-and-hadoop 'google-cloud-storage-connector-for-spark-and-hadoop')


### Properties

{{gcs_properties}}
Expand All @@ -100,10 +99,21 @@ on service accounts and creating keyfile in GCP, read [Google Cloud documentatio
The storage connector uses the Google `spark-bigquery-connector` behind the scenes.
To read more about the spark connector, like the spark options or usage, check [Apache Spark SQL connector for Google BigQuery.](https://github.com/GoogleCloudDataproc/spark-bigquery-connector#usage
'github.com/GoogleCloudDataproc/spark-bigquery-connector')

### Properties

{{bigquery_properties}}

### Methods

{{bigquery_methods}}

## Kafka

### Properties

{{kafka_properties}}

### Methods

{{kafka_methods}}
8 changes: 8 additions & 0 deletions python/hsfs/constructor/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,14 @@ def read(
It is possible to specify the storage (online/offline) to read from and the
type of the output DataFrame (Spark, Pandas, Numpy, Python Lists).
!!! warning "External Feature Group Engine Support"
**Spark only**
Reading a Query containing an External Feature Group directly into a
Pandas Dataframe using Python/Pandas as Engine is not supported,
however, you can use the Query API to create Feature Views/Training
Data containing External Feature Groups.
# Arguments
online: Read from online storage. Defaults to `False`.
dataframe_type: DataFrame type to return. Defaults to `"default"`.
Expand Down
35 changes: 34 additions & 1 deletion python/hsfs/feature_group.py
Original file line number Diff line number Diff line change
Expand Up @@ -1120,6 +1120,16 @@ def insert_stream(
[q.name for q in sqm.active]
```
!!! warning "Engine Support"
**Spark only**
Stream ingestion using Pandas/Python as engine is currently not supported.
Python/Pandas has no notion of streaming.
!!! warning "Data Validation Support"
`insert_stream` does not perform any data validation using Great Expectations
even when a expectation suite is attached.
# Arguments
features: Features in Streaming Dataframe to be saved.
query_name: It is possible to optionally specify a name for the query to
Expand Down Expand Up @@ -1610,7 +1620,30 @@ def save(self):
self._statistics_engine.compute_statistics(self, self.read())

def read(self, dataframe_type="default"):
"""Get the feature group as a DataFrame."""
"""Get the feature group as a DataFrame.
!!! warning "Engine Support"
**Spark only**
Reading an External Feature Group directly into a Pandas Dataframe using
Python/Pandas as Engine is not supported, however, you can use the
Query API to create Feature Views/Training Data containing External
Feature Groups.
# Arguments
dataframe_type: str, optional. Possible values are `"default"`, `"spark"`,
`"pandas"`, `"numpy"` or `"python"`, defaults to `"default"`.
# Returns
`DataFrame`: The spark dataframe containing the feature data.
`pyspark.DataFrame`. A Spark DataFrame.
`pandas.DataFrame`. A Pandas DataFrame.
`numpy.ndarray`. A two-dimensional Numpy array.
`list`. A two-dimensional Python list.
# Raises
`RestAPIError`.
"""
engine.get_instance().set_job_group(
"Fetching Feature group",
"Getting feature group: {} from the featurestore {}".format(
Expand Down
9 changes: 7 additions & 2 deletions python/hsfs/feature_view.py
Original file line number Diff line number Diff line change
Expand Up @@ -1005,8 +1005,13 @@ def get_training_data(
Get training data from storage or feature groups.
!!! info
If a materialised training data has deleted. Use `recreate_training_dataset()` to
recreate the training data.
If a materialised training data has deleted. Use `recreate_training_dataset()` to
recreate the training data.
!!! warning "External Storage Support"
Reading training data that was written to external storage using a Storage
Connector other than S3 can currently not be read using HSFS APIs with
Python as Engine, instead you will have to use the storage's native client.
# Arguments
version: training dataset version
Expand Down
6 changes: 5 additions & 1 deletion python/hsfs/storage_connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -889,7 +889,11 @@ def read_stream(
):
"""Reads a Kafka stream from a topic or multiple topics into a Dataframe.
Currently, this method is only supported for Spark engines.
!!! warning "Engine Support"
**Spark only**
Reading from data streams using Pandas/Python as engine is currently not supported.
Python/Pandas has no notion of streaming.
# Arguments
topic: Name or pattern of the topic(s) to subscribe to.
Expand Down
3 changes: 3 additions & 0 deletions python/hsfs/training_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,9 @@ def save(
lists or Numpy ndarrays.
From v2.5 onward, filters are saved along with the `Query`.
!!! warning "Engine Support"
Creating Training Datasets from Dataframes is only supported using Spark as Engine.
# Arguments
features: Feature data to be materialized.
write_options: Additional write options as key-value pairs, defaults to `{}`.
Expand Down

0 comments on commit a0b1a89

Please sign in to comment.