Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDataFrame (awkward?) interface to input data to RootInteracive as a replacement of tree2Panda (& uproot) JIRA ATO-263 #243

Open
miranov25 opened this issue Aug 29, 2022 · 14 comments

Comments

@miranov25
Copy link
Owner

The functionality to the tree- >Draw queries (used internally in tree2Panda) and RDataFrame does not overlap. To be able to use complex C++ structures in O2, we have to switch to RDataFrame.

New functionality within the scikit-hep - scikit-hep/awkward#1295

tree->Draw

  • (+) easy to query
  • (+) aliase for lazy evaluation
    -- (+) dependency trees created automaticaly

-- (-) caching not supported
-- (-) only simple variables supported
-- (-) slower as an C interpreter used

RDataframe

  • (+) fast -JIT functions supported
  • (+) support for arbitrary C++ structures
  • (-) complex aliases not supported
  • (-) dependency trees to be provide by users
@miranov25 miranov25 changed the title RDataFrame interface to input data to RootInteracive as a replacement of tree2Panda (& uproot) RDataFrame (awkward?) interface to input data to RootInteracive as a replacement of tree2Panda (& uproot) Aug 29, 2022
@ianna
Copy link

ianna commented Sep 1, 2022

@miranov25 - FYI, Awkward <-> RdataFrame has been completed and fully integrated to Awkward master branch. Please, see a tutorial draft here.
Please, let me know if anything is missing or does not work. Thanks!

@miranov25
Copy link
Owner Author

Dear @ianna

Thank you for the link. I went through the tutorial briefly. I have an idea for further steps. We could simplify the creation of a reusable library with the functions, similar to what we did in the past with the declaration of TTree::SetAlias. Perhaps this is already possible in ROOT v6.26. If this works as we envisioned, we could offer our RootInteractive visualisation and interactive ND histogramming/aggregation package (+ ML) for general use.

Before offering it within scikit-hep, we need to improve the documentation and clearly state whether it is an experimental (ML) or stable part of our package. We have considered PyHep 2022, but we are too late https://indico.cern.ch/e/PyHEP2022. We can use the time now for full, roundabout integration of RDataFrame < -- > into our RootInteractive tool, including the possibilities of client queries via joins.

If you are interested, I would like to ask for your advice and help. To get an idea of what the project is about, I would like to refer you to the RootIntreactive tutorial (March 2022):
https://indico.cern.ch/event/1135398/

There are quite a few use cases, so I would like to refer to a more detailed example presentation/Jupyter Notebook/Dashboard/accompanying video:

The tutorial is 6 months old in meantime we improved the speed with the aim of interactive detector physics and physics analysis:

  • Dashboard :

    •    https://indico.cern.ch/event/1135398/contributions/4950038/attachments/2474468/4245987/test_EffTrack.html* 
      
  • Notebook:

    •    https://indico.cern.ch/event/1135398/#preview:4265612
      

I plan to visit CERN in September. In case of interest, we can meet before on zoom or at CERN.

Marian

@ianna
Copy link

ianna commented Sep 5, 2022 via email

@miranov25
Copy link
Owner Author

Related JIRA ticket testing using "ToyMC" dEdx algorithm scan

tested in RI Notebook using :

  • [https://gitlab.cern.ch/alice-tpc-offline/alice-tpc-notes/-/blob/ab61ebe1148ebc7bd88c03f667aff0caa3b2b03e/JIRA/ATO-614/code/toydEdxSimul.C] 
  • [https://gitlab.cern.ch/alice-tpc-offline/alice-tpc-notes/-/blob/611d5048fced221189c232cdc99eb953ab6bf470/JIRA/ATO-614/RDFtoAwkward.ipynb] 

Non-trivial RDataframe used for the dEdx algorithm optimization - translated to the RDataFrame <-> awkward
 

  • Check if the dump working
    • OK
  • Check if it works parallel
    •  ROOT::EnableImplicitMT(32) - ???
    • Working very well - in example 32 parallel jobs
CPU times: user 1min 44s, sys: 884 ms, total: 1min 45s
Wall time: 10.2 s
  • Check if we can do ML training (RadnomForest)
  • Check if we can add the results of ML prediction back to the DataFrame and export to the tree

@miranov25
Copy link
Owner Author

miranov25 commented Nov 11, 2022

RDataframe <-> awkward questions:

Support Jupyter notebook and dashboard:

Open questions:

  • Optimization of the Machine learning training. Awkward structure non-flat. Flattening converting to panda (with entry and subentry), to be able to use it in the ML training resp. prediction
    • Example use case for Machine learning using awkward
    • awkward in -> ML -> awkward out-> RDtataFrame- define
  • Event loop and eventual disk usage
    • Not clear at all if possible
    • RDataFrame - processing per "event"
    • Predictions to be done per batch ...
  • If not possible - what are other options?

@miranov25
Copy link
Owner Author

miranov25 commented Nov 11, 2022

Answers to questions:

@ianna
Copy link

ianna commented Nov 11, 2022

Answers to questions:

Please, try:

>>> import awkward as ak
>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(random_state=0)
>>> xarr = ak.Array([[1,2,3], [11,22,33]])
>>> yarr = ak.Array([0,1])
>>> clf.fit(xarr, yarr)
RandomForestClassifier(random_state=0)
>>> clf.predict(xarr)
array([0, 1])
>>> clf.predict([[4, 5, 6], [14, 15, 16]])
array([0, 1])
>>> 

@ianna
Copy link

ianna commented Nov 11, 2022

Answers to questions:

https://awkward-array.readthedocs.io/en/latest/_auto/ak.pad_none.html

>>> xarr_padded = ak.pad_none(xarr, 3)
>>> xarr_padded
<Array [[1, 2, 3], [11, 22, None]] type='2 * var * ?int64'>
>>> xarr_padded_n = ak.fill_none(xarr_padded, 0)
>>> xarr_padded_n
<Array [[1, 2, 3], [11, 22, 0]] type='2 * var * int64'>

Please, also see this answer to the question if there is a way to use existing ML libraries with Awkward Array?

@miranov25
Copy link
Owner Author

Link to tutorials:
https://github.com/ianna/PyHep2022

@miranov25
Copy link
Owner Author

Link to chat in scikit-hep - discussion about limitations and possible improvement - array of structures

miranov25 added a commit that referenced this issue Dec 17, 2022
@miranov25
Copy link
Owner Author

RDataFrame column filter:

In [86]: filterRDFColumns?
Signature:
filterRDFColumns(
    rdf,
    selectList=['.*'],
    excludeList=[],
    selectTypeList=['.*'],
    excludeTypeList=['.*AliExternal.*'],
    verbose=0,
)
Docstring:
function to filter available columns in RDataFrame
:param rdf:               - input RDataFrame
:param selectList:        - columns to select (regExp)
:param excludeList:       - columns to reject (regExp)
:param selectTypeList     - types to accept   (regExp)
:param excludeTypeList:   - types to reject   (regExp)
:param verbose:           - verbosity 0x1 -print all status  0x2 - print selected  , 0x4 print rejected
:return:                    filtered list of columns

example:
filterRDFColumns(rdf1, ["param.*","delta","covar"],["part.",".*Refit.*"],[".*"],[""], verbose=1)

@miranov25
Copy link
Owner Author

@miranov25
Copy link
Owner Author

@miranov25
Copy link
Owner Author

Indices in the RDataFrame:

Finding relation array1 array2
indices for closest value

arrar1 (N1)-> array2(N2) - indeces (N1) pointing to closet values in N2

@miranov25 miranov25 changed the title RDataFrame (awkward?) interface to input data to RootInteracive as a replacement of tree2Panda (& uproot) RDataFrame (awkward?) interface to input data to RootInteracive as a replacement of tree2Panda (& uproot) JIRA ATO-263 Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants