Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add edm4hep::Tensor type for use in ML training and inference #388

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

veprbl
Copy link
Contributor

@veprbl veprbl commented Dec 4, 2024

BEGINRELEASENOTES

  • Added edm4hep::Tensor type for use in ML training and inference

ENDRELEASENOTES

This should help to support ML in reconstruction frameworks and to write tensors to disk for training with conventional python ML tools.

@tmadlener
Copy link
Contributor

Hi @veprbl, thanks for this proposal and apologies for the long delay from our side.

We have discussed this proposal in todays EDM4hep meeting and we have a few questions regarding how you (plan to) use this. Presumably, there is already some experience with this from EIC? One of the main concerns that was raised today is that this is obviously extremely generic, and we were wondering whether it is maybe too generic.

  • What are the reasons for choosing float and int64_t as the available datatypes?
  • How do you ensure that shape and the corresponding data remain consistent?
  • Do you store any metadata in some form that describe the content of the tensor? If yes, how are you doing this at the moment?
  • Are you using this for anything else than storing the tensors to disk? Or do you foresee to link this to other datatypes at the moment?
  • Probably a big ask, but could this also work for pytorch?
  • Does ONNX support switching between column / row major layouts? If yes, should this be reflected in the type somehow?

@veprbl
Copy link
Contributor Author

veprbl commented Dec 20, 2024

Hi @veprbl, thanks for this proposal and apologies for the long delay from our side.

We have discussed this proposal in todays EDM4hep meeting and we have a few questions regarding how you (plan to) use this. Presumably, there is already some experience with this from EIC? One of the main concerns that was raised today is that this is obviously extremely generic, and we were wondering whether it is maybe too generic.

Hi @tmadlener , thanks for your very nice feedback. The reason to introduce the type is to enable ML workflows in our reconstruction pipeline, for which we use immutable PODIO objects for exchange of data between algorithms and for storage on disk. We've already implemented this type in EDM4eic and there is a reference implementation for inference with ONNX in reconstruction that goes along with automated training CI workflow (eic/EICrecon#1618). It would appear that the type can have a more general utility outside of ePIC/EIC software. At the same time, we are looking to gather some feedback from the greater community, hence this is submitted. It would help us to share this type to make a better case for introduction of optimizations in PODIO for this use case (e.g. zero-copy facilities).

  • What are the reasons for choosing float and int64_t as the available datatypes?

Indeed several more scalar value types are possible. Following https://onnxruntime.ai/docs/api/c/group___global.html#gaec63cdda46c29b8183997f38930ce38e one could natively add
uint8_t, int8_t, uint16_t, int16_t, uint32_t, int32_t, uint64_t, int64_t, bool, string, double. That would be 11 additional vector members.
Only two types were added because they already provide sufficient utility in describing typical feature types and output types (e.g. probability). So far the philosophy was to add a minimal set that works for most use cases and see what extensions are practically needed, without committing to support what is not going to be used.

  • How do you ensure that shape and the corresponding data remain consistent?

I thought about this. My only idea was that a checkConsistency() member method can be added to allow validation at important times.

  • Do you store any metadata in some form that describe the content of the tensor? If yes, how are you doing this at the moment?

Some thought was given to this, but I didn't see an invariant to check here.

  • Are you using this for anything else than storing the tensors to disk? Or do you foresee to link this to other datatypes at the moment?

For use case of ONNX they support passing value types that are lists of tensors and maps (which are more like lists of 2-tuples). Those need not to belong to EDM4hep. There is also a case for supporting sparse tensors encoding, that would probably be useful to have in EDM4hep.

  • Probably a big ask, but could this also work for pytorch?

This is inspired by ONNX, but not tied to it. Torch and Catboost models exported to ONNX were tested during development of this. I haven't tried it, but also don't see why inference with Torchscript/TF-lite would not work with this.

  • Does ONNX support switching between column / row major layouts? If yes, should this be reflected in the type somehow?

It doesn't do that explicitly. There appears to be a support for named dimensions: https://onnxruntime.ai/docs/api/c/struct_ort_1_1detail_1_1_tensor_type_and_shape_info_impl.html
In the documentation for ONNXruntime API there probably aren't any references to "rows" and "columns" or even interpretations of those tensors with respect to operations supported in ONNX. If you want this to look more like a numpy array, a field could be added to indicate the ordering.

@veprbl
Copy link
Contributor Author

veprbl commented Dec 20, 2024

(answering to question in minutes)

  • Do we need metadata attached to this somehow? How to know where things are in this tensor?
    • E.g. shape parameters for Clusters

If naming is important, specifically in ONNX, one can modify model to take more tensors (inputs are named), a concatenation ONNX operator would need to be inserted into the computation graph. More generally, ML feature representations are not always are tables, so it may not be a functionality that will work generally in every framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants