`FeatureStore` abstraction definition #4534

mananshah99 · 2022-04-25T21:08:27Z

Defines a FeatureStore abstraction and associated tests as a first step to allow for independent scale-out for a graph's features and its edge_index. A roadmap for remaining features and implementation details will follow shortly. More details:

Defines FeatureStore(MutableMapping) with basic CRUD operations as abstract methods as well as support for advanced indexing
Defines TensorAttr and AttrView as classes that aid in accessing, modifying, and viewing elements in a feature store
Defines CastMixin for straightforward TensorAttr creation
Defines a basic (incomplete) feature store implementation MyFeatureStore along with associated tests to showcase API and indexing functionality

codecov · 2022-04-25T21:11:51Z

Codecov Report

Merging #4534 (23d4e58) into master (85fc32e) will increase coverage by 0.05%.
The diff coverage is 88.74%.

@@            Coverage Diff             @@
##           master    #4534      +/-   ##
==========================================
+ Coverage   82.71%   82.77%   +0.05%     
==========================================
  Files         313      315       +2     
  Lines       16361    16512     +151     
==========================================
+ Hits        13533    13667     +134     
- Misses       2828     2845      +17

Impacted Files	Coverage Δ
torch_geometric/utils/mixin.py	`85.71% <85.71%> (ø)`
torch_geometric/data/feature_store.py	`88.80% <88.80%> (ø)`
torch_geometric/typing.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85fc32e...23d4e58. Read the comment docs.

torch_geometric/data/feature_store.py

torch_geometric/utils/mixin.py

torch_geometric/data/feature_store.py

yaoyaowd

Put two thoughts in the code. I think what we wrote is cool and pythonic, but not easy to follow if I read this code for the first time.
I don't have a strong opinion to change the style now but want to hear people thoughts.

torch_geometric/utils/mixin.py

torch_geometric/data/feature_store.py

rusty1s

Thanks for the PR. The nesting of TensorAttr looks pretty cool. Please see my comments below (especially about some confusions I have regarding index). I am not really sure why we store index in the first place. It looks to be because we require TensorAttr to be fully specified. On the other hand, it doesn't make much sense to require it in case the user wants to access the full matrix. Originally, I would have thought that it makes sense to include the tensor inside TensorAttr, and move any index logic to TensorAttr instead:

class FeatureStore:
    def put_tensor(tensor, name, group):
        self._put_tensor(TensorAttr(tensor, name, group))

where TensorAttr manages any index logic:

class TensorAttr:
    def __getitem__(self, idx: IndexType):
        return self.tensor[idx]
    
    def __call__(self):
        return self.tensor

This would then result in an API similar to

x_full = store['x', 'paper']()
x_selected = store['x', 'paper'][index]

Also agree with @yaoyaowd that it is pretty hard to understand on first read (although BaseStorage has the same problem).

torch_geometric/data/feature_store.py

test/data/test_feature_store.py

torch_geometric/data/feature_store.py

torch_geometric/typing.py

torch_geometric/data/feature_store.py

rusty1s · 2022-04-26T08:48:37Z

There exists two other alternatives to handle index as far as I can tell:

Simply return the full matrix in case index is None. This way, store[attr_name, index] is required rather than doing store[attr_name][index].
Allow both : and tensors to model indices, e.g., store[attr_name, :], and store[attr_name][:]. I think this aligns well with any PyTorch/numpy logic.

rusty1s · 2022-04-26T09:04:50Z

Two additional thoughts:

I think the FeatureStore needs to provide the capability to determine which attributes in TensorAttr are required. For example, for in-memory stores, we probably want to make index optional such that store[group_name, attr_name] always returns the tensor. For torch_geometric.data.Data, we also may want to make group_name optional/disallow its usage.
FeatureStore may need to provide the capability to determine the order of attributes. For homogeneous data, we probably want to allow access via store[attr_name, index], while for heterogeneous data, access via store[group_name, attr_name, index] feels more natural to me (this is likely a P1).

yaoyaowd · 2022-04-26T18:18:18Z

@rusty1s is the group_name and attr_name in your comment match node_type, tensor_type in the code?
If yes, agree group_name is better and it cover edge_type too.

I also feel directly use store[node_type, tensor_type, index], store[node_type, tensor_type, :], store[node_type, tensor_type] to refer tensors is cleaner. In this way maybe we don't even need TensorAttr.

And if necessary, we define store.view(...) to get view instead of tensor.

I think it is a better idea to simplify interface now and iterate fast to get back additional features we need.

mananshah99 · 2022-04-26T19:00:45Z

@yaoyaowd @rusty1s thank you both for the great discussion. Let me summarize some thoughts and the resulting design changes.

Requiring an index. It seems clear that we do not want index to be required. The initial motivation here was to define a TensorAttr object as fully specified when index is present, but I understand the reasons to avoid this. While large distributed feature stores may choose to raise an exception if a tensor is queried without an index, as @rusty1s pointed out this should be the job of implementor classes, which should be able to set required attributes.
TensorAttr and AttrView. There seems to be some debate about TensorAttr and AttrView. My opinion is the following:
- TensorAttr should define all parameters necessary to uniquely identify a tensor from the feature store. In particular, this includes parameters such as the index, tensor type, node type, etc. (of course, the required nature of these parameters is up to the feature store). It can be thought of as a protobuf or other serializable data format that can be sent to a store over-the-wire to perform put or get calls.
- AttrView defines a view on the feature store (emphasizing view due to its additional _store parameter that references the feature store it points to) that allows for indexing. For this reason, I think it is more natural to implement indexing logic in AttrView and not TensorAttr. As @yaoyaowd pointed out, we can allow users to access an AttrView with a store.view(...) method that explicitly returns an AttrView.
Handling advanced indexing. There are a few thoughts on how we can implement advanced indexing in a cleaner way. I agree with the sentiment here, as the current set-up can be a bit hard to read and understand from a user perspective. My opinion is the following:
- Fully-specified forms. We should be able to support the following fully-specified forms: store[node_type, tensor_type, index], store[node_type, tensor_type, index_slice], and store[node_type].tensor_type[index]. Index slices will automatically be converted to the internal index representation we use. If a form is fully-specified, a Tensor/None will be returned or an Exception will be thrown.
- Partially-specified forms. In order to support the last format, which begins with a partially-specified form (store[node_type]) and adds further specification via __getattr__, we will need to have store[node_type] return an AttrView. More generally, we should have non-fully specified forms always return an AttrView. To explicitly perform a GET call with a non-fully specified form, a user will have to make use of __call__.
- In summary, we will have the following API, where all indexing and __call__ will be handled by AttrView.
```
# Fully-specified forms, produce a Tensor output
store[node_type, tensor_type, index] 
store[node_type, tensor_type, None]
store[node_type, tensor_type, :]
store[node_type].tensor_type[:]

# Partially-specified forms, produce an AttrView object
store[node_type]
store[node_type].tensor_type

# Partially-specified forms, when called, produce a Tensor output
# from the `TensorAttr` that has been partially specified.
store[node_type]()
store[node_type].tensor_type()
```
Naming. There were a couple minor points made about attribute naming. I think group_name makes sense, but I think tensor_type is more specific where attr_name feels a bit too general and confusing. @rusty1s if you have any strong opinions here, happy to adjust.

I think this proposal will address many of the main design decisions we discussed above and result in a streamlined, Pythonic API. Please let me know if you have any comments or thoughts!

rusty1s · 2022-04-27T09:11:31Z

I like it very much. Whenever something is not fully-specified, we return a View, and what qualifies as fully-specified depends on the FeatureStore implementation (where both group and index may be optional). Let me know when this PR is ready for another round of reviews :)

mananshah99 · 2022-04-28T01:52:14Z

@rusty1s @yaoyaowd changes have been made to align with the discussion above. I've tried to add clear documentation for all of the builtins to improve code readability, and have also added tests that showcase functionality. @rusty1s, both points you listed above regarding the FeatureStore's interactions with TensorAttr (determining required nature of attributes and determining the order of attributes) have been implemented and are tested as well.

Let me know what you think!

rusty1s

This looks way better now. Thanks for updating. Left a bunch of comments nonetheless :P

torch_geometric/data/feature_store.py

rusty1s · 2022-04-28T12:56:39Z

torch_geometric/data/feature_store.py

+        key = self._attr_cls.cast(key)
+        self.put_tensor(value, key)
+
+    def __getitem__(self, key: TensorAttr):


getattr and setattr equivalents?

I don't think a FeatureStore should implement getattr and setattr; imo, these should only be implemented for views on the store. I don't think it's particularly clean to have

store.group_name -> AttrView(group_name=group_name)

as this syntax seems more confusing to me than clarifying.

Thought it makes sense to implement for stores without group names (like PyG data objects). We could require that the output is fully specified in case we allow it.

Isn't the only way for an output to be fully specified through getattr to allow for chaining, which necessitates that getattr can return an AttrView? That feels odd to me.

If TensorAttr sets both group_name and index to None by default, then store.{attr_name} should give you a tensor. I am okay with leaving this out for now.

…eometric into feature_store

mananshah99 · 2022-04-29T00:33:13Z

@rusty1s as always, thank you for the detailed comments :) Cleaned up the interface and builtins more. A couple general notes I wanted to share:

On UNSET fields in TensorAttr. I've spent some more time thinking about this, and I think we should proceed as follows.
- We need some sort of tombstone in TensorAttr to indicate which attributes have been set; this is important both when checking for full specification and when implementing __getattr__ and __getitem__ in AttrView (which choose the attribute they will be setting based on the first UNSET attribute).
- As a result, I think that all fields in TensorAttr should start as UNSET. If a field is not being used by a TensorAttr subclass, it can default that field to None and move it to the end of the __init__ list (there should be no modification of is_fully_specified, which just checks to see whether there are any UNSET values). Users may set fields however they like (with None or with any other value). It is up to the feature store implementation to handle these values properly.
- Partially-specified views, when called, will set all UNSET values to None and pass this TensorAttr to the get_tensor method. This way, feature store implementors will never have to interact with UNSET values.
- I think the resulting implementation is very clean and reduces the number of moving parts: all TensorAttr attributes start as UNSET, they can be set by the user (or default to None if a partially-specified view is called), and subclasses of TensorAttr just need to modify __init__ and the default values of attributes they are not using.
On subclassing TensorAttr. I have a slight bias towards enforcing that attr_class subclasses TensorAttr (instead of being any arbitrary class), due to the resulting simplicity of implementation (for both the user and the implementor). If we find the need to implement different TensorAttr classes for each feature store, this is of course supported (so long as the class has is_fully_specified and fully_specify methods), but I would avoid this for the time being.

… into feature_store

rusty1s

Thanks for all the updates :)

Removes `TensorAttr.fully_specify` which was originally added in #4534. --------- Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

Removes `TensorAttr.fully_specify` which was originally added in pyg-team#4534. --------- Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

Feature store abstraction + tests

3fec3c6

mananshah99 added feature 0 - Priority P0 labels Apr 25, 2022

mananshah99 requested a review from rusty1s April 25, 2022 21:08

mananshah99 self-assigned this Apr 25, 2022

Merge branch 'master' into feature_store

e42ae42

mananshah99 requested a review from yaoyaowd April 25, 2022 21:16

TensorType -> FeatureTensorType

e962338

yaoyaowd reviewed Apr 25, 2022

View reviewed changes

torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved

torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved

torch_geometric/utils/mixin.py Show resolved Hide resolved

torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved

yaoyaowd reviewed Apr 25, 2022

View reviewed changes

torch_geometric/utils/mixin.py Show resolved Hide resolved

torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved

mananshah99 added 2 commits April 26, 2022 00:46

to_type

b347a3c

fix

a0486b9

yaoyaowd reviewed Apr 26, 2022

View reviewed changes

torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved

rusty1s reviewed Apr 26, 2022

View reviewed changes

mananshah99 added 2 commits April 27, 2022 02:16

API changes, WIP

e602325

Fix

e4f4f42

rusty1s closed this Apr 27, 2022

rusty1s reopened this Apr 27, 2022

mananshah99 added 4 commits April 28, 2022 01:05

Updates

99b625a

More cleanup for new API

832802a

Merge branch 'master' into feature_store

46b76de

Add override example

65e5e0b

Fix

6fdf062

mananshah99 and others added 2 commits April 28, 2022 04:18

Remove unnecessary properties

94369fe

remove backend

e0c63c9

rusty1s reviewed Apr 28, 2022

View reviewed changes

mananshah99 added 3 commits April 28, 2022 23:08

Updates

0fd6dd2

Merge branch 'feature_store' of https://github.com/pyg-team/pytorch_g…

94d50df

…eometric into feature_store

More updates

385512e

rusty1s added 6 commits April 29, 2022 13:36

pass

a5c4f1c

Merge branch 'master' into feature_store

bc2b0e3

pass

f949207

Merge branch 'feature_store' of github.com:pyg-team/pytorch_geometric…

bff0bf1

… into feature_store

pass

b2064d5

pass

23d4e58

rusty1s approved these changes Apr 29, 2022

View reviewed changes

rusty1s merged commit 484b792 into master Apr 29, 2022

rusty1s deleted the feature_store branch April 29, 2022 12:17

akihironitta mentioned this pull request Nov 13, 2024

Drop TensorAttr.fully_specify #9782

Merged

rusty1s added a commit that referenced this pull request Nov 13, 2024

Drop TensorAttr.fully_specify (#9782)

0fa52cb

Removes `TensorAttr.fully_specify` which was originally added in #4534. --------- Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FeatureStore` abstraction definition #4534

`FeatureStore` abstraction definition #4534

mananshah99 commented Apr 25, 2022

codecov bot commented Apr 25, 2022 •

edited

Loading

yaoyaowd left a comment

rusty1s left a comment •

edited

Loading

rusty1s commented Apr 26, 2022 •

edited

Loading

rusty1s commented Apr 26, 2022 •

edited

Loading

yaoyaowd commented Apr 26, 2022

mananshah99 commented Apr 26, 2022 •

edited

Loading

rusty1s commented Apr 27, 2022

mananshah99 commented Apr 28, 2022

rusty1s left a comment

rusty1s Apr 28, 2022

mananshah99 Apr 28, 2022

rusty1s Apr 28, 2022 •

edited

Loading

mananshah99 Apr 29, 2022

rusty1s Apr 29, 2022

mananshah99 commented Apr 29, 2022

rusty1s left a comment

FeatureStore abstraction definition #4534

FeatureStore abstraction definition #4534

Conversation

mananshah99 commented Apr 25, 2022

codecov bot commented Apr 25, 2022 • edited Loading

Codecov Report

yaoyaowd left a comment

Choose a reason for hiding this comment

rusty1s left a comment • edited Loading

Choose a reason for hiding this comment

rusty1s commented Apr 26, 2022 • edited Loading

rusty1s commented Apr 26, 2022 • edited Loading

yaoyaowd commented Apr 26, 2022

mananshah99 commented Apr 26, 2022 • edited Loading

rusty1s commented Apr 27, 2022

mananshah99 commented Apr 28, 2022

rusty1s left a comment

Choose a reason for hiding this comment

rusty1s Apr 28, 2022

Choose a reason for hiding this comment

mananshah99 Apr 28, 2022

Choose a reason for hiding this comment

rusty1s Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

mananshah99 Apr 29, 2022

Choose a reason for hiding this comment

rusty1s Apr 29, 2022

Choose a reason for hiding this comment

mananshah99 commented Apr 29, 2022

rusty1s left a comment

Choose a reason for hiding this comment

`FeatureStore` abstraction definition #4534

`FeatureStore` abstraction definition #4534

codecov bot commented Apr 25, 2022 •

edited

Loading

rusty1s left a comment •

edited

Loading

rusty1s commented Apr 26, 2022 •

edited

Loading

rusty1s commented Apr 26, 2022 •

edited

Loading

mananshah99 commented Apr 26, 2022 •

edited

Loading

rusty1s Apr 28, 2022 •

edited

Loading