-
Notifications
You must be signed in to change notification settings - Fork 77
pyarrow integration with emmet-core #1243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@tschaume, @esoteric-ephemera, @kbuma This one is a bit long, but there's no rush on getting this in. Would like to get it right first and foremost. Any review would be great. Likely missed some stuff in the write up, happy to go deeper. |
hmm let me get the conflicts resolved so the tests run git history is going to be rewritten before the final merge anyways |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1243 +/- ##
==========================================
- Coverage 89.52% 88.89% -0.63%
==========================================
Files 150 202 +52
Lines 15311 16903 +1592
==========================================
+ Hits 13707 15026 +1319
- Misses 1604 1877 +273 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
emmet-core/emmet/core/__init__.py
Outdated
@@ -4,6 +4,10 @@ | |||
""" | |||
|
|||
from importlib.metadata import PackageNotFoundError, version | |||
from importlib.util import find_spec | |||
|
|||
core_path = __path__[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice if this could be removed or figured out in the unit test that uses it as it seems to be just used in that unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep agreed, changed here:
emmet/emmet-core/tests/test_arrow.py
Line 20 in d8e7356
core_path = Path(__file__).parent.parent.joinpath("emmet/core") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of files in this PR where the only changes are:
- upgrading to use the emmet.core.utils utcnow function (which uses the recommended version of getting the utc time rather than the version that is deprecated as of 3.12)
- moving from using the pre 3.10 Optional declarations
If it's not time-consuming it may be good to create a PR for those changes and get them in first. That way it'd be easier to review the arrow specific stuff.
@@ -1,10 +1,12 @@ | |||
from pydantic import BaseModel, Field | |||
from typing import Dict | |||
|
|||
from emmet.core.utils import utcnow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of files in this PR where the only changes are:
- upgrading to use the emmet.core.utils utcnow function (which uses the recommended version of getting the utc time rather than the version that is deprecated as of 3.12)
- moving from using the pre 3.10 Optional declarations
If it's not time-consuming it may be good to create a PR for those changes and get them in first. That way it'd be easier to review the arrow specific stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can do that.
utcnow is easy enough
And I think I was trying to just update the type annotations to post 3.10 conventions as I was working on each model. Can just take the plunge and fix up all of emmet for that, ha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the noise is cleaned up now after rebasing on top of #1246
if typing.get_origin(obj) in (Mapping, dict): | ||
args = typing.get_args(obj) | ||
|
||
assert not isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be done for the values too? (or let both fall through to the checks that happen when arrowize is called below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both could be let through and they would get caught below.
This check was meant to provide a more specific hint to help users more quickly find a certain mistake (the map key):
output with this check:
>>> arrowize(dict[str | int, float])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/tsmathis/dev/builders/emmet/emmet-core/emmet/core/arrow.py", line 80, in arrowize
assert not isinstance(
^^^^^^^^^^^^^^^
AssertionError:
Cannot construct arrow map type from: dict[str | int, float].
Keys for maps must resolve to single primitive data type, not Union type: str | int
without this check:
>>> arrowize(dict[str | int, float])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/tsmathis/dev/builders/emmet/emmet-core/emmet/core/arrow.py", line 87, in arrowize
return pa.map_(arrowize(args[0]), arrowize(args[1]))
^^^^^^^^^^^^^^^^^
File "/Users/tsmathis/dev/builders/emmet/emmet-core/emmet/core/arrow.py", line 118, in arrowize
assert all(
^^^^
AssertionError:
(De)Serialization of Union types is not supported in pyarrow currently,
narrow the types of str | int to resolve to a single primitive
(imo) without the hint it's more difficult to trace back where that error might have occurred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am broadly open to changing all of this up though, so anything that doesn't make sense, could be clearer, etc. I can take a crack at it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding the check for clarity in error output makes sense. in that case I would suggest adding it for value types as well there.
clean up class def to reflect current utililty
+ utility decorators for customizing arrow type introspection behavior
arrow only additional context supported currently
…th arrow + add optimade to optional deps for testing - needed due to walking emmet to import all pydantic models bump pyarrow for 'maps_as_pydicts' kwarg
… to help with arrow roundtriping
bools -> TaskState enum strs
f440dac
to
6373caa
Compare
+ missing 'GGA' thermo type in ThermoType enum
field is now optional rather than defaulting to empty list
Amazing! Looking forward to trying this @tsmathis :) |
been through a few iterations of this at this point, happy enough with it for actual feedback now
scary line count deltas are due to having to rewrite some of the test files and the json getting flattened -_-
Contributor Checklist
Broad strokes:
arrowize
inemmet.core.arrow
, similar spirit tojsanitize
emmet.core.utils
@arrow_incompatible
-> molecules, openff, etc.emmet.core.serialization_adapters
emmet.core.typing
.as_dict()
methods, see here for one such exampleemmet/emmet-core/tests/test_electrodes.py
Line 155 in e545484
I like the current completeness (complete in the sense of strictly conforming to the existing data models, whether internal or external) of this implementation. However, I dislike the amount of overhead and complexity I had to introduce in the form of all the custom (de)serialization functions and handlers, which are almost strictly applied to pmg objects.
Some notable examples:
el_refs
field of a phase diagram, which is typed in pmg aslist[str, ComputedStructureEntry]
-> can't have heterogenous arrays in arrowemmet/emmet-core/emmet/core/thermo.py
Line 48 in d472a9b
energy_adjustments
field of any computed (structure) entry, which has a gnarly array of nested union variants and has to just be dumped to a string anywhere a computed entry shows up in a modelemmet/emmet-core/emmet/core/serialization_adapters/computed_entries_adapter.py
Line 87 in d472a9b
pop_empty_**()
handlersemmet/emmet-core/emmet/core/serialization_adapters/structure_adapter.py
Line 28 in d472a9b
which make the following behavior possible:
Structure
which I don't really see a way to silence (I guess it's actually the sites, but it stands) https://github.com/materialsproject/pymatgen/blob/98bdb66ed7ab454cb24ea4981529d72859962cdb/src/pymatgen/core/structure.py#L1263stderr
example for the warnings (for just one structure):Solving 1 is just removing the strict model equality checks in tests, solving 2. is more... sticky/tedious, would need changes to pmg, or we have to remove the pmg objects all together -> which we have discussed a bit
The changes here are significant enough I would aim this at landing as the bump to 0.85.x, but I would want the following PRs gotten over the line as well to wrap all the big changes up together:
#1226
#1232
I am going to open an RFC as a Discussion for the merits/long term implications of having a tighter coupling of pyarrow with emmet, strictness in typing in emmet, and the pmg type issues and what we want to do. These sorts of things will have implications for atomate2 developers/users