Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending serialization #255

Closed
wants to merge 21 commits into from
Closed

Extending serialization #255

wants to merge 21 commits into from

Conversation

anabiman
Copy link
Collaborator

@anabiman anabiman commented Mar 8, 2021

Description

  • Adds serialization/deserialization support for YAML to ProtoModel and the serialization module.
  • Modifies to_file method in Molecule for passing serialization arguments.

The general idea is to be able to do things like:
e.g.

mol.to_file(json_file, indent=4)  # prettify JSON
mol.to_file(yaml_file, canonical=True)  # serialization to YAML file in canonical format
mol = qcelemental.models.molecule.Molecule.from_file(yaml_file) 
stringified = mol.yaml()
mol = qcelemental.models.molecule.Molecule.from_data(
    stringified, dtype="yaml"
) 

Notes

  • ProtoModel.parse_raw() assumes encoding=json if the data type is str and encoding is unspecified. Related to Issue113. This method does not distinguish between a YAML and JSON str.
  • PR is compatible with the following parsers/emitters:
    • PyYAML
    • ruamel.yaml (supports YAML 1.2)
  • Keep track of the YAML version in the schema (relevant to Issue1478) ?

Changelog description

  • Additional serialization arguments can be passed to Molecule.to_file
  • Serialization/deserialization support for YAML

Status

  • Code base linted
  • Ready to go

@loriab
Copy link
Collaborator

loriab commented Mar 8, 2021

Neat, I'm fond of yaml, too!

Loading data into models from a yaml form is pretty local and seems fine by me. I'm a little hesitant about extending serialize. The existing json/json-ext/msgpack-ext all vary in how the numerical data is represented (plain vs binary). Would it be crippling yaml storage at all to serialize to json then convert to yaml? I worry about toml, etc.

@anabiman
Copy link
Collaborator Author

anabiman commented Mar 8, 2021

Would it be crippling yaml storage at all to serialize to json then convert to yaml? I worry about toml, etc.

That works too and would make things simpler, but it would restrict the YAML format to be strictly JSON (since YAML is a superset of JSON), and of course it is less efficient. Perhaps the efficiency is of minor significance for small molecules in QC, but I'm using qcelemental for large macromolecules (for molecular mechanics), so I'll do some bench marking and get back to this later.

@codecov
Copy link

codecov bot commented Apr 5, 2021

Codecov Report

Merging #255 (3916a91) into master (ff52f14) will decrease coverage by 0.68%.
The diff coverage is 48.86%.

@anabiman
Copy link
Collaborator Author

anabiman commented Apr 5, 2021

It seems JSON serialization is roughly faster by a factor of 20-50 compared to PyYAML & ruamel.yaml (installed via pip, version 5.4.1 & 0.17.2) for large data sets (1D numpy arrays storing random values), and it gets worse if serialization to JSON is involved as well (overhead mainly due to yaml_load, see code below).

The 1st table shows the results for direct YAML serialization while the 2nd includes the additional, intermediate step of JSON serialization, which makes process roughly ~ 3 times slower.

Direct serialization

Array length PyYAML ruamel.yaml JSON
4e5 0m10.640s 0m15.442s 0m0.469s
1e6 0m26.121s 0m37.502s 0m0.781s

Indirect serialization

Array length PyYAML ruamel.yaml JSON
4e5 0m30.333s 0m42.485s 0m0.458s
1e6 1m13.799s 1m51.725s 0m0.774s

Notes

  • I think the reason PyYAML and ruamel.yaml are much slower is because of the graph they construct, so the results might significantly change if one uses a fast emitter like rapidyaml. Also some of the slow performance is due to the underlying python implementation, so replacing SafeDumper with CSafeDumper makes a lot of difference (will follow up with another comment).
  • I don't see any real benefit in serializing to JSON before converting to YAML with the current implementation because the YAML safe dumper uses yaml_encode to encode numpy.ndarray (equivalent to JSONArrayEncoder). Binary representation in msgpack is unalterted and shouldn't be affected, unless I'm missing something @loriab ?

Code

if emit_json:
    stringified = json_dumps(data, **kwargs)
    data = yaml_load(stringified)
return yaml_dump(data, **kwargs)

@anabiman
Copy link
Collaborator Author

anabiman commented Apr 5, 2021

By replacing yaml.SafeDumper and ruamel.yaml.RoundTripDumper with CSafeDumper, we get about 5-6X speedup.

Direct serialization with CSafeDumper

Array length PyYAML ruamel.yaml JSON
4e5 0m2.125s 0m2.546s 0m0.428s
1e6 0m4.992s 0m6.348s 0m0.748s

Direct serialization (Python)

Array length PyYAML ruamel.yaml JSON
4e5 0m10.640s 0m15.442s 0m0.469s
1e6 0m26.121s 0m37.502s 0m0.781s

@loriab loriab added the wontfix This will not be worked on label Aug 26, 2021
@coltonbh
Copy link
Collaborator

coltonbh commented Mar 29, 2023

Hi @anabiman . I'm helping out with a little maintenance on the repo to address old issues and PRs. Thanks for your thoughts on the repo :) Could the following solution work for you? You only need one line of code to serialize any object to yaml (or your preferred serialization format).

import yaml

# Convert to YAML
yaml_data = yaml.dump(my_molecule.dict())

You can load your yaml file containing your Molecule (or any other QCSchema objects) with just one line of code too:

with open("my-yaml-data.yaml") as f:
    python_dict = yaml.load(f.read(), yaml.Loader)
    # Or cast to QCSchema obj in one line
    molecule = Molecule(**yaml.load(f.read(), yaml.Loader))

I'd like to avoid adding 200 lines of code to the library (that require testing, maintenance, etc.) to save a one-liner from the python built in library. I also want to be sensitive to use cases you have in mind in case I'm missing something here :) Thoughts?

@coltonbh
Copy link
Collaborator

I see this is already marked wontfix by @loriab . Closing issue. @anabiman please let me know if you have any questions regarding the solution I proposed above. Happy to help if there's something more bespoke you have in mind :)

@coltonbh coltonbh closed this Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants