Skip to content

pyarrow integration with emmet-core #1243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open

Conversation

tsmathis
Copy link
Collaborator

@tsmathis tsmathis commented Jun 9, 2025

been through a few iterations of this at this point, happy enough with it for actual feedback now

scary line count deltas are due to having to rewrite some of the test files and the json getting flattened -_-

Contributor Checklist

  • I have broken down my PR scope into the following TODO tasks
    • get emmet document models parquet compatible
    • write a type introspection function for generating pyarrow types from all pydantic models and types defined in emmet-core (well, almost all of them, more below*)
    • write fully qualified types for all foreign (pmg, etc.) and internal types used in emmet pydantic models that are not self-describing
    • write supporting (de)serialization pydantic handlers where needed for
    • write tests for the core emmet models (used in the builders/api) asserting that round-tripping python(pydantic) -> arrow -> python(pydantic) doesn't mangle anything (except in special conditions, more below**)
  • I have run the tests locally and they passed.
  • I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR

Broad strokes:

  • all arrow related behavior is currently opt-in only (if pyarrow is installed I am assuming arrow related things are going to be happening)
  • arrow type generation function is arrowize in emmet.core.arrow, similar spirit to jsanitize
  • some helper decorators are defined in emmet.core.utils
    • skipping, overriding types, and auto type generation decorators
    • *models where I have no reference data or experience working with the model have been marked with @arrow_incompatible -> molecules, openff, etc.
  • specific type defs for pmg objects are in emmet.core.serialization_adapters
  • generic emmet type defs have been added to emmet.core.typing
  • **some models don't round trip well currently due to pmg related quirks in their .as_dict() methods, see here for one such example
    # can't assert doc == test_arrow_doc for two reasons:

I like the current completeness (complete in the sense of strictly conforming to the existing data models, whether internal or external) of this implementation. However, I dislike the amount of overhead and complexity I had to introduce in the form of all the custom (de)serialization functions and handlers, which are almost strictly applied to pmg objects.

Some notable examples:

/Users/tsmathis/miniconda3/envs/emmet-core-testing-311/lib/python3.11/site-packages/pymatgen/core/structure.py:2918: UserWarning: Not all sites have property charge. Missing values are set to None.
  return cls.from_sites(sites, charge=charge, properties=dct.get("properties"))
/Users/tsmathis/miniconda3/envs/emmet-core-testing-311/lib/python3.11/site-packages/pymatgen/core/structure.py:2918: UserWarning: Not all sites have property velocities. Missing values are set to None.
  return cls.from_sites(sites, charge=charge, properties=dct.get("properties"))
/Users/tsmathis/miniconda3/envs/emmet-core-testing-311/lib/python3.11/site-packages/pymatgen/core/structure.py:2918: UserWarning: Not all sites have property selective_dynamics. Missing values are set to None.
  return cls.from_sites(sites, charge=charge, properties=dct.get("properties"))
/Users/tsmathis/miniconda3/envs/emmet-core-testing-311/lib/python3.11/site-packages/pymatgen/core/structure.py:2918: UserWarning: Not all sites have property coordination_no. Missing values are set to None.
  return cls.from_sites(sites, charge=charge, properties=dct.get("properties"))
/Users/tsmathis/miniconda3/envs/emmet-core-testing-311/lib/python3.11/site-packages/pymatgen/core/structure.py:2918: UserWarning: Not all sites have property forces. Missing values are set to None.
  return cls.from_sites(sites, charge=charge, properties=dct.get("properties"))
/Users/tsmathis/miniconda3/envs/emmet-core-testing-311/lib/python3.11/site-packages/pymatgen/core/structure.py:2918: UserWarning: Not all sites have property magmom. Missing values are set to None.
  return cls.from_sites(sites, charge=charge, properties=dct.get("properties"))

Solving 1 is just removing the strict model equality checks in tests, solving 2. is more... sticky/tedious, would need changes to pmg, or we have to remove the pmg objects all together -> which we have discussed a bit


The changes here are significant enough I would aim this at landing as the bump to 0.85.x, but I would want the following PRs gotten over the line as well to wrap all the big changes up together:
#1226
#1232


I am going to open an RFC as a Discussion for the merits/long term implications of having a tighter coupling of pyarrow with emmet, strictness in typing in emmet, and the pmg type issues and what we want to do. These sorts of things will have implications for atomate2 developers/users

@tsmathis
Copy link
Collaborator Author

tsmathis commented Jun 9, 2025

@tschaume, @esoteric-ephemera, @kbuma

This one is a bit long, but there's no rush on getting this in. Would like to get it right first and foremost. Any review would be great.

Likely missed some stuff in the write up, happy to go deeper.

@tsmathis
Copy link
Collaborator Author

tsmathis commented Jun 9, 2025

hmm let me get the conflicts resolved so the tests run

git history is going to be rewritten before the final merge anyways

@codecov-commenter
Copy link

codecov-commenter commented Jun 9, 2025

Codecov Report

Attention: Patch coverage is 92.81729% with 103 lines in your changes missing coverage. Please review.

Project coverage is 88.89%. Comparing base (4bb6f72) to head (d8e7356).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
emmet-core/emmet/core/thermo.py 65.78% 26 Missing ⚠️
...e/serialization_adapters/grain_boundary_adapter.py 40.74% 16 Missing ⚠️
...et/core/serialization_adapters/molecule_adapter.py 42.85% 16 Missing ⚠️
emmet-core/emmet/core/phonon.py 91.25% 7 Missing ⚠️
...emmet/core/serialization_adapters/alloy_adapter.py 81.08% 7 Missing ⚠️
...met-builders/emmet/builders/vasp/task_validator.py 16.66% 5 Missing ⚠️
...re/serialization_adapters/phase_diagram_adapter.py 75.00% 4 Missing ⚠️
.../serialization_adapters/structure_graph_adapter.py 88.23% 4 Missing ⚠️
emmet-core/emmet/core/vasp/calculation.py 96.82% 4 Missing ⚠️
...zation_adapters/bandstructure_symm_line_adapter.py 78.57% 3 Missing ⚠️
... and 7 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1243      +/-   ##
==========================================
- Coverage   89.52%   88.89%   -0.63%     
==========================================
  Files         150      202      +52     
  Lines       15311    16903    +1592     
==========================================
+ Hits        13707    15026    +1319     
- Misses       1604     1877     +273     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -4,6 +4,10 @@
"""

from importlib.metadata import PackageNotFoundError, version
from importlib.util import find_spec

core_path = __path__[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if this could be removed or figured out in the unit test that uses it as it seems to be just used in that unit test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep agreed, changed here:

core_path = Path(__file__).parent.parent.joinpath("emmet/core")

Copy link
Contributor

@kbuma kbuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of files in this PR where the only changes are:

  • upgrading to use the emmet.core.utils utcnow function (which uses the recommended version of getting the utc time rather than the version that is deprecated as of 3.12)
  • moving from using the pre 3.10 Optional declarations

If it's not time-consuming it may be good to create a PR for those changes and get them in first. That way it'd be easier to review the arrow specific stuff.

@@ -1,10 +1,12 @@
from pydantic import BaseModel, Field
from typing import Dict

from emmet.core.utils import utcnow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of files in this PR where the only changes are:

  • upgrading to use the emmet.core.utils utcnow function (which uses the recommended version of getting the utc time rather than the version that is deprecated as of 3.12)
  • moving from using the pre 3.10 Optional declarations

If it's not time-consuming it may be good to create a PR for those changes and get them in first. That way it'd be easier to review the arrow specific stuff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can do that.

utcnow is easy enough
And I think I was trying to just update the type annotations to post 3.10 conventions as I was working on each model. Can just take the plunge and fix up all of emmet for that, ha

Copy link
Collaborator Author

@tsmathis tsmathis Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the noise is cleaned up now after rebasing on top of #1246

if typing.get_origin(obj) in (Mapping, dict):
args = typing.get_args(obj)

assert not isinstance(
Copy link
Contributor

@kbuma kbuma Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be done for the values too? (or let both fall through to the checks that happen when arrowize is called below)

Copy link
Collaborator Author

@tsmathis tsmathis Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both could be let through and they would get caught below.

This check was meant to provide a more specific hint to help users more quickly find a certain mistake (the map key):
output with this check:

>>> arrowize(dict[str | int, float])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tsmathis/dev/builders/emmet/emmet-core/emmet/core/arrow.py", line 80, in arrowize
    assert not isinstance(
           ^^^^^^^^^^^^^^^
AssertionError:
        Cannot construct arrow map type from: dict[str | int, float].
        Keys for maps must resolve to single primitive data type, not Union type: str | int

without this check:

>>> arrowize(dict[str | int, float])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tsmathis/dev/builders/emmet/emmet-core/emmet/core/arrow.py", line 87, in arrowize
    return pa.map_(arrowize(args[0]), arrowize(args[1]))
                   ^^^^^^^^^^^^^^^^^
  File "/Users/tsmathis/dev/builders/emmet/emmet-core/emmet/core/arrow.py", line 118, in arrowize
    assert all(
           ^^^^
AssertionError:
        (De)Serialization of Union types is not supported in pyarrow currently,
        narrow the types of str | int to resolve to a single primitive 

(imo) without the hint it's more difficult to trace back where that error might have occurred

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am broadly open to changing all of this up though, so anything that doesn't make sense, could be clearer, etc. I can take a crack at it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding the check for clarity in error output makes sense. in that case I would suggest adding it for value types as well there.

tsmathis added 22 commits June 13, 2025 14:25
clean up class def to reflect current utililty
+ utility decorators for customizing arrow type introspection behavior
arrow only additional context supported currently
…th arrow

+ add optimade to optional deps for testing - needed due to walking
emmet to import all pydantic models

bump pyarrow for 'maps_as_pydicts' kwarg
@tsmathis tsmathis force-pushed the arrow-compatibility branch from f440dac to 6373caa Compare June 13, 2025 21:27
tsmathis added 4 commits June 13, 2025 15:12
+ missing 'GGA' thermo type in ThermoType enum
field is now optional rather than defaulting to empty list
@mkhorton
Copy link
Member

Amazing! Looking forward to trying this @tsmathis :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants