From 4355497bc894a40f80518d321c77a1cc66d06353 Mon Sep 17 00:00:00 2001 From: Fabio Buso Date: Thu, 6 Aug 2020 00:28:52 +0200 Subject: [PATCH 1/2] add readme before making the repository public --- README.md | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 60 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 5d66dda8b9..3613c573d0 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,62 @@ Python feature store API. +========================= -Based on design doc: https://git.logicalclocks.com/logicalclocks/ft_api_v2/blob/master/dsl_design_doc.md \ No newline at end of file +HSFS is the new library to interact with the Hopsworks Feature Store. The library makes creating new features, feature groups and training datasets easier. + +The library can be used in two modes: +- Spark mode : For data engineering jobs that create and write features into the feature store or generate training datasets. It requires a Spark environment such as the one provided in the Hopsworks platform or Databricks. In Spark mode, HSFS provides binding both for Python and JVM languages. + +- Python mode : For data science jobs to explore the features available in the feature store, generate training datasets and feed them in a training pipeline. Python mode requires just a Python interpreter and can be used both in Hopsworks from Python Jobs/Jupyter Kernels, Amazon SageMaker, KubeFlow. + +The library automatically configures itself based on the environment it is run. + +You can read more about the Hopsworks Feature Store and its concepts [here](https://hopsworks.readthedocs.io) + +Getting Started +--------------- + +Instantiate a connection and get the project feature store handler +```python +import hsfs + +connection = hsfs.connection() +fs = connection.get_feature_store() +``` + +Create a new feature group +```python +fg = fs.create_feature_group("rain", + version=1, + description="Rain features", + primary_key=['date', 'location_id'], + online_enabled=True) + +fg.save(dataframe) +``` + +Join features together +```python +feature_join = rain_fg.select_all() + .join(temperature_fg.select_all(), ["date", "location_id"]) + .join(location_fg.select_all())) + +feature_join.show(5) +``` + +Use the query object to create a training dataset: +```python +td = fs.create_training_dataset("training_dataset", + version=1, + data_format="tfrecords", + description="A test training dataset saved in TfRecords format", + splits={'train': 0.7, 'test': 0.2, 'validate': 0.1}) + +td.save(feature_join) +``` + +You can find more examples on how to use the library in our [hops-examples](https://github.com/logicalclocks/hops-examples) repository. + +Issues +------ + +Please report any issue using [Github issue tracking](https://github.com/logicalclocks/feature-store-api/issues) \ No newline at end of file From f0b86562fe0153f509bcc3b16112ea85c1c4e3e3 Mon Sep 17 00:00:00 2001 From: Fabio Buso Date: Thu, 6 Aug 2020 10:14:54 +0200 Subject: [PATCH 2/2] Add training dataset feed to readme --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 3613c573d0..58fecc3b8c 100644 --- a/README.md +++ b/README.md @@ -54,6 +54,12 @@ td = fs.create_training_dataset("training_dataset", td.save(feature_join) ``` +Feed the training dataset to a TensorFlow model: +```python +train_input_feeder = training_dataset.feed(target_name='label',split='train', is_training=True) +train_input = train_input_feeder.tf_record_dataset() +``` + You can find more examples on how to use the library in our [hops-examples](https://github.com/logicalclocks/hops-examples) repository. Issues