Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple DynamicTableRegion when converting to a hierarchical dataframe #649

Open
5 tasks done
oruebel opened this issue Jul 19, 2021 · 3 comments
Open
5 tasks done
Labels
category: enhancement improvements of code or code behavior help wanted: good first issue request for community contributions that are good for new contributors priority: low alternative solution already working and/or relevant to only specific user(s)
Milestone

Comments

@oruebel
Copy link
Contributor

oruebel commented Jul 19, 2021

Problem: The function hdmf.common.hierarchicaltable.to_hierarchical_dataframe currently only supports resolution of one DynamicTableRegion column per DynamicTable that is linked to in the table hierarchy. I.e., the function follows and resolves the first DynamicTableRegion found in a table but if any given table contains additional DynamicTableRegion columns, then those will be converted as nested pandas.DataFrame objects.

Possible Solutions:

  1. hdmf.common.hierarchicaltable.to_hierarchical_dataframe should support resolution of multiple DynamicTableRegion columns for each given table.
  2. Add a new function that instead of working on DynamicTable would work on pandas.DataFrame objects and allow resolution of an arbitrary number of user-defined columns. There should then also be an option to automatically find columns that need resolution to allow resolution of all columns at once.

Possible Challenges

  • A challenge with option 1. is that this may require a hybrid approach where one operates in part on DynamicTable and in part on DataFrame
  • A challenge with option 2 is that we may need to associate additional information with the nested DataFrames that are being generated by DynamicTable when loading data, as we need to know, e.g., the name of tables, possibly the targets and structures of links etc.

Checklist

  • Have you ensured the feature or change was not already reported ?
  • Have you included a brief and descriptive title?
  • Have you included a clear description of the problem you are trying to solve?
  • Have you included a minimal code snippet that reproduces the issue you are encountering?
  • Have you checked our Contributing document?
@oruebel oruebel added category: enhancement improvements of code or code behavior help wanted: deep dive request for community contributions that will involve many parts of the code base help wanted: good first issue request for community contributions that are good for new contributors labels Jul 19, 2021
@oruebel
Copy link
Contributor Author

oruebel commented Jul 19, 2021

I added the "good first issue" label mainly because its an issue that is focused on a particular area (i.e., DynamicTable) and remains mostly on the surface (i.e., requires using the API and building new features on top). However, this is not necessarily a trivial issue as it requires some tricky data wrangling with lots of edge cases. That being said, it's a good issue for someone who wants to dive deeper into DynamicTable logic.

oruebel added a commit that referenced this issue Jul 21, 2021
Smplify interacting with DynamicTables that reference other tables via DynamicTableRegion, creating a collection of linked tables. In the ICEphys case this is a "simple" linear hierarchy of tables, but in principle a table may contain any number of DynamicTableRegion columns. This PR adds several functions to simplify introspection of linked DynamicTables and conversion to pandas DataFrames. 

- [X] Fix #646 by adding ``AlignedDynamicTable.get``
- [X] Fix #651 by updating ``AlignedDynamicTable.get`` to support slicing with  ``[int, (str, str)]``, ``[int, str, str]``, and ``[int, str]`` to select a single cell or row of a category table, repectively
- [X] Add ``AlignedDynamicTable.get_colnames(...)`` functions to allow us to keep compliance of the ``colnames`` property with ``DynamicTable`` while providing an easy way to get the full list of column names.
- [X] Set name of DataFrame in ``DynamicTable.to_dataframe()`` and ``DynamicTable.get`` 
- [X] Add helper functions to ``DynamicTable`` to deal with foreign columns:
    - [X] ``DynamicTable.get_foreign_columns`` to identify if the table contains ``DynamicTableRegion`` columns
    - [X] ``DynamicTable.has_foreign_columns`` to  identify which columns are``DynamicTableRegion`` columns 
    - [X] ``DynamicTable.get_linked_tables`` to retrieve all tables linked to either directly or indirectly from
      the current table via ``DynamicTableRegion``
- [x] Implement the same helper functions also for ``AlignedDynamicTable``
    - [x] ``DynamicTable.get_foreign_columns`` to identify if the table contains ``DynamicTableRegion`` columns
    - [X] ``DynamicTable.has_foreign_columns`` to  identify which columns are``DynamicTableRegion`` columns 
    - [x] ``DynamicTable.get_linked_tables`` to retrieve all tables linked to either directly or indirectly from
      the current table via ``DynamicTableRegion``
- [X] Add new module ``hdmf.common.hierarchicaltable`` with helper functions to facilitate conversion of linked tables to a single Pandas dataframe. 
     - [X] ``to_hierarchical_dataframe`` to merge linked tables into a single consolidated pandas DataFrame.
     - [X] ``drop_id_columns`` to remove "id" columns from a DataFrame.
     - [X] ``flatten_column_index`` to replace a ``pandas.MultiIndex`` with a regular ``pandas.Index``
- [x]  Add test for DyanmicTableRegion pointing to AlignedDynamicTable to check that the all columns are used
- [x] Add tests for hierarchicaltable.py for 
    - [X] to_hierarchical_dataframe
    - [x] drop_id_columns
    - [x]  flatten_column_index functions
- [X] File issue tickets for open TODO items for future PRs 
    - [X] ``to_hierarchical_dataframe`` should be updated to support resolution of more than one DynamicTableRegion column. See #649
    - [x] Add tutorial for DynamicTableRegion and how to use for linking to tables and for creating linked tables. See #648
@oruebel oruebel removed the help wanted: deep dive request for community contributions that will involve many parts of the code base label Sep 11, 2021
@rly rly added the priority: low alternative solution already working and/or relevant to only specific user(s) label Jan 12, 2024
@mavaylon1
Copy link
Contributor

@oruebel Could you take this? If not, could you make me an example to run and I will take it.

@mavaylon1 mavaylon1 added this to the Next Major Release - 4.0 milestone Apr 16, 2024
@oruebel
Copy link
Contributor Author

oruebel commented Apr 16, 2024

Let's see if I find time to work on this during the Dev Days. Otherwise, I'd leave it for Future for now when we actually need this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior help wanted: good first issue request for community contributions that are good for new contributors priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

3 participants