Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureStore abstraction definition #4534

Merged
merged 23 commits into from
Apr 29, 2022
Merged

FeatureStore abstraction definition #4534

merged 23 commits into from
Apr 29, 2022

Conversation

mananshah99
Copy link
Contributor

Defines a FeatureStore abstraction and associated tests as a first step to allow for independent scale-out for a graph's features and its edge_index. A roadmap for remaining features and implementation details will follow shortly. More details:

  • Defines FeatureStore(MutableMapping) with basic CRUD operations as abstract methods as well as support for advanced indexing
  • Defines TensorAttr and AttrView as classes that aid in accessing, modifying, and viewing elements in a feature store
  • Defines CastMixin for straightforward TensorAttr creation
  • Defines a basic (incomplete) feature store implementation MyFeatureStore along with associated tests to showcase API and indexing functionality

@codecov
Copy link

codecov bot commented Apr 25, 2022

Codecov Report

Merging #4534 (23d4e58) into master (85fc32e) will increase coverage by 0.05%.
The diff coverage is 88.74%.

@@            Coverage Diff             @@
##           master    #4534      +/-   ##
==========================================
+ Coverage   82.71%   82.77%   +0.05%     
==========================================
  Files         313      315       +2     
  Lines       16361    16512     +151     
==========================================
+ Hits        13533    13667     +134     
- Misses       2828     2845      +17     
Impacted Files Coverage Δ
torch_geometric/utils/mixin.py 85.71% <85.71%> (ø)
torch_geometric/data/feature_store.py 88.80% <88.80%> (ø)
torch_geometric/typing.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85fc32e...23d4e58. Read the comment docs.

@mananshah99 mananshah99 requested a review from yaoyaowd April 25, 2022 21:16
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/utils/mixin.py Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
Copy link
Contributor

@yaoyaowd yaoyaowd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put two thoughts in the code. I think what we wrote is cool and pythonic, but not easy to follow if I read this code for the first time.
I don't have a strong opinion to change the style now but want to hear people thoughts.

torch_geometric/utils/mixin.py Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The nesting of TensorAttr looks pretty cool. Please see my comments below (especially about some confusions I have regarding index). I am not really sure why we store index in the first place. It looks to be because we require TensorAttr to be fully specified. On the other hand, it doesn't make much sense to require it in case the user wants to access the full matrix. Originally, I would have thought that it makes sense to include the tensor inside TensorAttr, and move any index logic to TensorAttr instead:

class FeatureStore:
    def put_tensor(tensor, name, group):
        self._put_tensor(TensorAttr(tensor, name, group))

where TensorAttr manages any index logic:

class TensorAttr:
    def __getitem__(self, idx: IndexType):
        return self.tensor[idx]
    
    def __call__(self):
        return self.tensor

This would then result in an API similar to

x_full = store['x', 'paper']()
x_selected = store['x', 'paper'][index]

Also agree with @yaoyaowd that it is pretty hard to understand on first read (although BaseStorage has the same problem).

torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
test/data/test_feature_store.py Outdated Show resolved Hide resolved
test/data/test_feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/typing.py Outdated Show resolved Hide resolved
torch_geometric/typing.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Show resolved Hide resolved
@rusty1s
Copy link
Member

rusty1s commented Apr 26, 2022

There exists two other alternatives to handle index as far as I can tell:

  • Simply return the full matrix in case index is None. This way, store[attr_name, index] is required rather than doing store[attr_name][index].
  • Allow both : and tensors to model indices, e.g., store[attr_name, :], and store[attr_name][:]. I think this aligns well with any PyTorch/numpy logic.

@rusty1s
Copy link
Member

rusty1s commented Apr 26, 2022

Two additional thoughts:

  • I think the FeatureStore needs to provide the capability to determine which attributes in TensorAttr are required. For example, for in-memory stores, we probably want to make index optional such that store[group_name, attr_name] always returns the tensor. For torch_geometric.data.Data, we also may want to make group_name optional/disallow its usage.
  • FeatureStore may need to provide the capability to determine the order of attributes. For homogeneous data, we probably want to allow access via store[attr_name, index], while for heterogeneous data, access via store[group_name, attr_name, index] feels more natural to me (this is likely a P1).

@yaoyaowd
Copy link
Contributor

@rusty1s is the group_name and attr_name in your comment match node_type, tensor_type in the code?
If yes, agree group_name is better and it cover edge_type too.

I also feel directly use store[node_type, tensor_type, index], store[node_type, tensor_type, :], store[node_type, tensor_type] to refer tensors is cleaner. In this way maybe we don't even need TensorAttr.

And if necessary, we define store.view(...) to get view instead of tensor.

I think it is a better idea to simplify interface now and iterate fast to get back additional features we need.

@mananshah99
Copy link
Contributor Author

mananshah99 commented Apr 26, 2022

@yaoyaowd @rusty1s thank you both for the great discussion. Let me summarize some thoughts and the resulting design changes.

  • Requiring an index. It seems clear that we do not want index to be required. The initial motivation here was to define a TensorAttr object as fully specified when index is present, but I understand the reasons to avoid this. While large distributed feature stores may choose to raise an exception if a tensor is queried without an index, as @rusty1s pointed out this should be the job of implementor classes, which should be able to set required attributes.
  • TensorAttr and AttrView. There seems to be some debate about TensorAttr and AttrView. My opinion is the following:
    • TensorAttr should define all parameters necessary to uniquely identify a tensor from the feature store. In particular, this includes parameters such as the index, tensor type, node type, etc. (of course, the required nature of these parameters is up to the feature store). It can be thought of as a protobuf or other serializable data format that can be sent to a store over-the-wire to perform put or get calls.
    • AttrView defines a view on the feature store (emphasizing view due to its additional _store parameter that references the feature store it points to) that allows for indexing. For this reason, I think it is more natural to implement indexing logic in AttrView and not TensorAttr. As @yaoyaowd pointed out, we can allow users to access an AttrView with a store.view(...) method that explicitly returns an AttrView.
  • Handling advanced indexing. There are a few thoughts on how we can implement advanced indexing in a cleaner way. I agree with the sentiment here, as the current set-up can be a bit hard to read and understand from a user perspective. My opinion is the following:
    • Fully-specified forms. We should be able to support the following fully-specified forms: store[node_type, tensor_type, index], store[node_type, tensor_type, index_slice], and store[node_type].tensor_type[index]. Index slices will automatically be converted to the internal index representation we use. If a form is fully-specified, a Tensor/None will be returned or an Exception will be thrown.
    • Partially-specified forms. In order to support the last format, which begins with a partially-specified form (store[node_type]) and adds further specification via __getattr__, we will need to have store[node_type] return an AttrView. More generally, we should have non-fully specified forms always return an AttrView. To explicitly perform a GET call with a non-fully specified form, a user will have to make use of __call__.
    • In summary, we will have the following API, where all indexing and __call__ will be handled by AttrView.
    # Fully-specified forms, produce a Tensor output
    store[node_type, tensor_type, index] 
    store[node_type, tensor_type, None]
    store[node_type, tensor_type, :]
    store[node_type].tensor_type[:]
    
    # Partially-specified forms, produce an AttrView object
    store[node_type]
    store[node_type].tensor_type
    
    # Partially-specified forms, when called, produce a Tensor output
    # from the `TensorAttr` that has been partially specified.
    store[node_type]()
    store[node_type].tensor_type()
  • Naming. There were a couple minor points made about attribute naming. I think group_name makes sense, but I think tensor_type is more specific where attr_name feels a bit too general and confusing. @rusty1s if you have any strong opinions here, happy to adjust.

I think this proposal will address many of the main design decisions we discussed above and result in a streamlined, Pythonic API. Please let me know if you have any comments or thoughts!

@rusty1s
Copy link
Member

rusty1s commented Apr 27, 2022

I like it very much. Whenever something is not fully-specified, we return a View, and what qualifies as fully-specified depends on the FeatureStore implementation (where both group and index may be optional). Let me know when this PR is ready for another round of reviews :)

@rusty1s rusty1s closed this Apr 27, 2022
@rusty1s rusty1s reopened this Apr 27, 2022
@mananshah99
Copy link
Contributor Author

@rusty1s @yaoyaowd changes have been made to align with the discussion above. I've tried to add clear documentation for all of the builtins to improve code readability, and have also added tests that showcase functionality. @rusty1s, both points you listed above regarding the FeatureStore's interactions with TensorAttr (determining required nature of attributes and determining the order of attributes) have been implemented and are tested as well.

Let me know what you think!

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks way better now. Thanks for updating. Left a bunch of comments nonetheless :P

torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Outdated Show resolved Hide resolved
torch_geometric/data/feature_store.py Show resolved Hide resolved
key = self._attr_cls.cast(key)
self.put_tensor(value, key)

def __getitem__(self, key: TensorAttr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getattr and setattr equivalents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a FeatureStore should implement getattr and setattr; imo, these should only be implemented for views on the store. I don't think it's particularly clean to have

store.group_name -> AttrView(group_name=group_name)

as this syntax seems more confusing to me than clarifying.

Copy link
Member

@rusty1s rusty1s Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought it makes sense to implement for stores without group names (like PyG data objects). We could require that the output is fully specified in case we allow it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the only way for an output to be fully specified through getattr to allow for chaining, which necessitates that getattr can return an AttrView? That feels odd to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TensorAttr sets both group_name and index to None by default, then store.{attr_name} should give you a tensor. I am okay with leaving this out for now.

@mananshah99
Copy link
Contributor Author

@rusty1s as always, thank you for the detailed comments :) Cleaned up the interface and builtins more. A couple general notes I wanted to share:

  • On UNSET fields in TensorAttr. I've spent some more time thinking about this, and I think we should proceed as follows.
    • We need some sort of tombstone in TensorAttr to indicate which attributes have been set; this is important both when checking for full specification and when implementing __getattr__ and __getitem__ in AttrView (which choose the attribute they will be setting based on the first UNSET attribute).
    • As a result, I think that all fields in TensorAttr should start as UNSET. If a field is not being used by a TensorAttr subclass, it can default that field to None and move it to the end of the __init__ list (there should be no modification of is_fully_specified, which just checks to see whether there are any UNSET values). Users may set fields however they like (with None or with any other value). It is up to the feature store implementation to handle these values properly.
    • Partially-specified views, when called, will set all UNSET values to None and pass this TensorAttr to the get_tensor method. This way, feature store implementors will never have to interact with UNSET values.
    • I think the resulting implementation is very clean and reduces the number of moving parts: all TensorAttr attributes start as UNSET, they can be set by the user (or default to None if a partially-specified view is called), and subclasses of TensorAttr just need to modify __init__ and the default values of attributes they are not using.
  • On subclassing TensorAttr. I have a slight bias towards enforcing that attr_class subclasses TensorAttr (instead of being any arbitrary class), due to the resulting simplicity of implementation (for both the user and the implementor). If we find the need to implement different TensorAttr classes for each feature store, this is of course supported (so long as the class has is_fully_specified and fully_specify methods), but I would avoid this for the time being.

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates :)

@rusty1s rusty1s merged commit 484b792 into master Apr 29, 2022
@rusty1s rusty1s deleted the feature_store branch April 29, 2022 12:17
rusty1s added a commit that referenced this pull request Nov 13, 2024
Removes `TensorAttr.fully_specify` which was originally added in #4534.

---------

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>
mattjhayes3 pushed a commit to mattjhayes3/pytorch_geometric that referenced this pull request Dec 14, 2024
Removes `TensorAttr.fully_specify` which was originally added in pyg-team#4534.

---------

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants