Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to/from_pandas does not roundtrip #365

Closed
lilyminium opened this issue Aug 24, 2021 · 0 comments · Fixed by #367
Closed

to/from_pandas does not roundtrip #365

lilyminium opened this issue Aug 24, 2021 · 0 comments · Fixed by #367

Comments

@lilyminium
Copy link
Contributor

I cannot create a ThermoMLDataSet from a pandas dataframe that was created from a dataset.

>>> df = dataset.to_pandas()
>>> ThermoMLDataSet.from_pandas(df)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/var/folders/rv/j6lbln6j0kvb5svxj8wflc400000gn/T/ipykernel_34462/444742739.py in <module>
      1 df = dataset.to_pandas()
----> 2 ThermoMLDataSet.from_pandas(df)

~/anaconda3/envs/polymetrizer/lib/python3.9/site-packages/openff/evaluator/datasets/datasets.py in from_pandas(cls, data_frame)
    555         for match in property_header_matches:
    556 
--> 557             assert match
    558 
    559             property_type_string, property_unit_string = match.groups()

AssertionError: 

Diagnostics

It dies on matching ExcessMolarVolume Value (cm ** 3 / mol) because the match pattern does not have asterisks.

>>> import re
>>> property_header_matches = {
            (header, re.match(r"^([a-zA-Z]+) Value \(([a-zA-Z0-9+-/\s]*)\)$", header))
            for header in df
            if header.find(" Value ") >= 0
        }
>>> property_header_matches
{('Density Value (g / ml)',
  <re.Match object; span=(0, 22), match='Density Value (g / ml)'>),
 ('DielectricConstant Value ()',
  <re.Match object; span=(0, 27), match='DielectricConstant Value ()'>),
 ('EnthalpyOfMixing Value (kJ / mol)',
  <re.Match object; span=(0, 33), match='EnthalpyOfMixing Value (kJ / mol)'>),
 ('EnthalpyOfVaporization Value (kJ / mol)',
  <re.Match object; span=(0, 39), match='EnthalpyOfVaporization Value (kJ / mol)'>),
 ('ExcessMolarVolume Value (cm ** 3 / mol)', None)}

Suggestion

        property_header_matches = {
---            re.match(r"^([a-zA-Z]+) Value \(([a-zA-Z0-9+-/\s]*)\)$", header)
+++            re.match(r"^([a-zA-Z]+) Value \(([a-zA-Z0-9+*-/\s]*)\)$", header)
            for header in data_frame
            if header.find(" Value ") >= 0
        }

Or get rid of the check altogether, as new exciting units arise. (I notice no allowance for exponents, for example, even though kJ/mol and kJ mol^-1 should be equivalent.)

SimonBoothroyd added a commit that referenced this issue Aug 25, 2021
@SimonBoothroyd SimonBoothroyd mentioned this issue Aug 25, 2021
1 task
SimonBoothroyd added a commit that referenced this issue Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant