Skip to content

Commit

Permalink
Add docs to explain on feature materialization behavior (#688)
Browse files Browse the repository at this point in the history
* Update materializing-features.md

* Update materializing-features.md
  • Loading branch information
xiaoyongzhu committed Oct 18, 2022
1 parent 83b79c9 commit b036898
Showing 1 changed file with 50 additions and 0 deletions.
50 changes: 50 additions & 0 deletions docs/concepts/materializing-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,53 @@ More reference on the APIs:

- [MaterializationSettings API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings)
- [HdfsSink API](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSource)

## Expected behavior on Feature Materialization

When end users materialize features to a sink, what is the expected behavior?

It seems to be a straightforward question, but actually it is not. Basically when end users want to materialize a feature, Feathr is expecting that: For a certain entity key (say a user_id), there will be multiple features (say user_total_gift_card_balance, and user_purchase_in_last_week). So two checks will be performed:

1. Those features should have the same entity key (say a user_id). You cannot materialize features for two entity keys in the same materialization job (although you can do it in different jobs), for example materializing `uer_total_purchase` and `product_sold_in_last_week` in the same Feathr materialization job.
2. Those features should all be "aggregated" feature. I.e. they should be a feature which has a type of `WindowAggTransformation`, such as `product_sold_in_last_week`, or `user_latest_total_gift_card_balance`.

The first constraint is pretty straightforward to explain - since when Feathr materializes certain features, they are used to describe certain aspects of a given entity such as user. Describing `product_sold_in_last_week` would not make sense for users.

The second constraint is a bit more interesting. For example, you have defined `user_total_gift_card_balance` and it has different value for the same user across different time, say the corresponding value is 40,30,20,20 for the last 4 days, like below.
Original data:

| UserId | user_total_gift_card_balance | Date |
| ------ | ---------------------------- | ---------- |
| 1 | 40 | 2022/01/01 |
| 1 | 30 | 2022/01/02 |
| 1 | 20 | 2022/01/03 |
| 1 | 20 | 2022/01/04 |
| 2 | 40 | 2022/01/01 |
| 2 | 30 | 2022/01/02 |
| 2 | 20 | 2022/01/03 |
| 2 | 20 | 2022/01/04 |
| 3 | 40 | 2022/01/01 |
| 3 | 30 | 2022/01/02 |
| 3 | 20 | 2022/01/03 |
| 3 | 20 | 2022/01/04 |

However, the materialized features have no dates associated with them. I.e. the materialized result should be something like this:

| UserId | user_total_gift_card_balance |
| ------ | ---------------------------- |
| 1 | ? |
| 2 | ? |
| 3 | ? |

When you ask Feathr to "materialize" `user_total_gift_card_balance` for you, there's only one value that can be materialized, since the materialized feature does not have a date associated with them. So the problem is - for a given `user_id`, only one `user_total_gift_card_balance` can be its feature. Which value you are choosing out of the 4 values? A random value? The latest value?

It might be natural to think that "we should materialize the latest feature", and that behavior, by definition, is an "aggregation" operation, since we have 4 values for a given `user_id` but we are only materializing and using one of them. In that case, Feathr asks you to explicitly say that you want to materialize the latest feature (i.e. by using [Point-in-time Join](./point-in-time-join.md))

```python
feature = Feature(name="user_total_gift_card_balance",
key=UserId,
feature_type=FLOAT,
transform=WindowAggTransformation(agg_expr="gift_card_balance",
agg_func="LATEST",
window="7d"))
```

0 comments on commit b036898

Please sign in to comment.