Add edm4hep::Tensor type for use in ML training and inference #388

veprbl · 2024-12-04T01:14:13Z

BEGINRELEASENOTES

Added edm4hep::Tensor type for use in ML training and inference

ENDRELEASENOTES

This should help to support ML in reconstruction frameworks and to write tensors to disk for training with conventional python ML tools.

tmadlener · 2024-12-17T10:16:07Z

Hi @veprbl, thanks for this proposal and apologies for the long delay from our side.

We have discussed this proposal in todays EDM4hep meeting and we have a few questions regarding how you (plan to) use this. Presumably, there is already some experience with this from EIC? One of the main concerns that was raised today is that this is obviously extremely generic, and we were wondering whether it is maybe too generic.

What are the reasons for choosing float and int64_t as the available datatypes?
How do you ensure that shape and the corresponding data remain consistent?
Do you store any metadata in some form that describe the content of the tensor? If yes, how are you doing this at the moment?
Are you using this for anything else than storing the tensors to disk? Or do you foresee to link this to other datatypes at the moment?
Probably a big ask, but could this also work for pytorch?
Does ONNX support switching between column / row major layouts? If yes, should this be reflected in the type somehow?

veprbl · 2024-12-20T09:57:31Z

Hi @veprbl, thanks for this proposal and apologies for the long delay from our side.

We have discussed this proposal in todays EDM4hep meeting and we have a few questions regarding how you (plan to) use this. Presumably, there is already some experience with this from EIC? One of the main concerns that was raised today is that this is obviously extremely generic, and we were wondering whether it is maybe too generic.

Hi @tmadlener , thanks for your very nice feedback. The reason to introduce the type is to enable ML workflows in our reconstruction pipeline, for which we use immutable PODIO objects for exchange of data between algorithms and for storage on disk. We've already implemented this type in EDM4eic and there is a reference implementation for inference with ONNX in reconstruction that goes along with automated training CI workflow (eic/EICrecon#1618). It would appear that the type can have a more general utility outside of ePIC/EIC software. At the same time, we are looking to gather some feedback from the greater community, hence this is submitted. It would help us to share this type to make a better case for introduction of optimizations in PODIO for this use case (e.g. zero-copy facilities).

What are the reasons for choosing float and int64_t as the available datatypes?

Indeed several more scalar value types are possible. Following https://onnxruntime.ai/docs/api/c/group___global.html#gaec63cdda46c29b8183997f38930ce38e one could natively add
uint8_t, int8_t, uint16_t, int16_t, uint32_t, int32_t, uint64_t, int64_t, bool, string, double. That would be 11 additional vector members.
Only two types were added because they already provide sufficient utility in describing typical feature types and output types (e.g. probability). So far the philosophy was to add a minimal set that works for most use cases and see what extensions are practically needed, without committing to support what is not going to be used.

How do you ensure that shape and the corresponding data remain consistent?

I thought about this. My only idea was that a checkConsistency() member method can be added to allow validation at important times.

Do you store any metadata in some form that describe the content of the tensor? If yes, how are you doing this at the moment?

Some thought was given to this, but I didn't see an invariant to check here.

Are you using this for anything else than storing the tensors to disk? Or do you foresee to link this to other datatypes at the moment?

For use case of ONNX they support passing value types that are lists of tensors and maps (which are more like lists of 2-tuples). Those need not to belong to EDM4hep. There is also a case for supporting sparse tensors encoding, that would probably be useful to have in EDM4hep.

Probably a big ask, but could this also work for pytorch?

This is inspired by ONNX, but not tied to it. Torch and Catboost models exported to ONNX were tested during development of this. I haven't tried it, but also don't see why inference with Torchscript/TF-lite would not work with this.

Does ONNX support switching between column / row major layouts? If yes, should this be reflected in the type somehow?

It doesn't do that explicitly. There appears to be a support for named dimensions: https://onnxruntime.ai/docs/api/c/struct_ort_1_1detail_1_1_tensor_type_and_shape_info_impl.html
In the documentation for ONNXruntime API there probably aren't any references to "rows" and "columns" or even interpretations of those tensors with respect to operations supported in ONNX. If you want this to look more like a numpy array, a field could be added to indicate the ordering.

veprbl · 2024-12-20T10:03:05Z

(answering to question in minutes)

Do we need metadata attached to this somehow? How to know where things are in this tensor?

E.g. shape parameters for Clusters

If naming is important, specifically in ONNX, one can modify model to take more tensors (inputs are named), a concatenation ONNX operator would need to be inserted into the computation graph. More generally, ML feature representations are not always are tables, so it may not be a functionality that will work generally in every framework.

veprbl added 4 commits December 3, 2024 20:11

Add edm4hep::Tensor type for use in ML training/inference

2948a80

README.md: run ./scripts/updateReadmeLinks.py

b592d24

update edm4hep2json.hxx

cd2101a

edm4hep2json.hxx: reorder headers

ddc91df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add edm4hep::Tensor type for use in ML training and inference #388

Add edm4hep::Tensor type for use in ML training and inference #388

veprbl commented Dec 4, 2024 •

edited

Loading

tmadlener commented Dec 17, 2024

veprbl commented Dec 20, 2024

veprbl commented Dec 20, 2024

Add edm4hep::Tensor type for use in ML training and inference #388

Are you sure you want to change the base?

Add edm4hep::Tensor type for use in ML training and inference #388

Conversation

veprbl commented Dec 4, 2024 • edited Loading

tmadlener commented Dec 17, 2024

veprbl commented Dec 20, 2024

veprbl commented Dec 20, 2024

veprbl commented Dec 4, 2024 •

edited

Loading