Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tidy keyword to to_pandas? #405

Open
lilyminium opened this issue Nov 16, 2021 · 0 comments
Open

Add tidy keyword to to_pandas? #405

lilyminium opened this issue Nov 16, 2021 · 0 comments

Comments

@lilyminium
Copy link
Contributor

lilyminium commented Nov 16, 2021

I was surprised that .to_pandas converts to a wide format where each property type gets its own column and imposed unit. I would have thought it more intuitive to convert to a tidier format. i.e.

Instead of:

Index(['Id', 'Temperature (K)', 'Pressure (kPa)', 'Phase', 'N Components',
       'Component 1', 'Role 1', 'Mole Fraction 1', 'Exact Amount 1',
       'Component 2', 'Role 2', 'Mole Fraction 2', 'Exact Amount 2',
       'SolvationFreeEnergy Value (kJ / mol)',
       'SolvationFreeEnergy Uncertainty (kJ / mol)', 'Source'],
      dtype='object')

You could have:

Index(['Id', 'Temperature (K)', 'Pressure (kPa)', 'Phase', 'N Components',
       'Component 1', 'Role 1', 'Mole Fraction 1', 'Exact Amount 1',
       'Component 2', 'Role 2', 'Mole Fraction 2', 'Exact Amount 2',
       'Property type', 'Value', 'Value unit', 'Uncertainty', 'Uncertainty unit', 'Source'],
      dtype='object')

This would be more efficient memory-wise (edit: for mixed datasets), as you no longer have NaNs taking up a bunch of space, as well as help in filtering by property type. When working direclty with the dataframe it would be much easier to see how many of each property type you have and to group by it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant