Extending serialization #255

anabiman · 2021-03-08T16:25:15Z

Description

Adds serialization/deserialization support for YAML to ProtoModel and the serialization module.
Modifies to_file method in Molecule for passing serialization arguments.

The general idea is to be able to do things like:
e.g.

mol.to_file(json_file, indent=4)  # prettify JSON
mol.to_file(yaml_file, canonical=True)  # serialization to YAML file in canonical format
mol = qcelemental.models.molecule.Molecule.from_file(yaml_file) 
stringified = mol.yaml()
mol = qcelemental.models.molecule.Molecule.from_data(
    stringified, dtype="yaml"
)

Notes

ProtoModel.parse_raw() assumes encoding=json if the data type is str and encoding is unspecified. Related to Issue113. This method does not distinguish between a YAML and JSON str.
PR is compatible with the following parsers/emitters:
- PyYAML
- ruamel.yaml (supports YAML 1.2)
Keep track of the YAML version in the schema (relevant to Issue1478) ?

Changelog description

Additional serialization arguments can be passed to Molecule.to_file
Serialization/deserialization support for YAML

Status

Code base linted
Ready to go

loriab · 2021-03-08T17:21:45Z

Neat, I'm fond of yaml, too!

Loading data into models from a yaml form is pretty local and seems fine by me. I'm a little hesitant about extending serialize. The existing json/json-ext/msgpack-ext all vary in how the numerical data is represented (plain vs binary). Would it be crippling yaml storage at all to serialize to json then convert to yaml? I worry about toml, etc.

anabiman · 2021-03-08T18:22:38Z

Would it be crippling yaml storage at all to serialize to json then convert to yaml? I worry about toml, etc.

That works too and would make things simpler, but it would restrict the YAML format to be strictly JSON (since YAML is a superset of JSON), and of course it is less efficient. Perhaps the efficiency is of minor significance for small molecules in QC, but I'm using qcelemental for large macromolecules (for molecular mechanics), so I'll do some bench marking and get back to this later.

codecov · 2021-04-05T12:45:21Z

Codecov Report

Merging #255 (3916a91) into master (ff52f14) will decrease coverage by 0.68%.
The diff coverage is 48.86%.

anabiman · 2021-04-05T20:51:48Z

It seems JSON serialization is roughly faster by a factor of 20-50 compared to PyYAML & ruamel.yaml (installed via pip, version 5.4.1 & 0.17.2) for large data sets (1D numpy arrays storing random values), and it gets worse if serialization to JSON is involved as well (overhead mainly due to yaml_load, see code below).

The 1st table shows the results for direct YAML serialization while the 2nd includes the additional, intermediate step of JSON serialization, which makes process roughly ~ 3 times slower.

Direct serialization

Array length	PyYAML	ruamel.yaml	JSON
4e5	0m10.640s	0m15.442s	0m0.469s
1e6	0m26.121s	0m37.502s	0m0.781s

Indirect serialization

Array length	PyYAML	ruamel.yaml	JSON
4e5	0m30.333s	0m42.485s	0m0.458s
1e6	1m13.799s	1m51.725s	0m0.774s

Notes

I think the reason PyYAML and ruamel.yaml are much slower is because of the graph they construct, so the results might significantly change if one uses a fast emitter like rapidyaml. Also some of the slow performance is due to the underlying python implementation, so replacing SafeDumper with CSafeDumper makes a lot of difference (will follow up with another comment).
I don't see any real benefit in serializing to JSON before converting to YAML with the current implementation because the YAML safe dumper uses yaml_encode to encode numpy.ndarray (equivalent to JSONArrayEncoder). Binary representation in msgpack is unalterted and shouldn't be affected, unless I'm missing something @loriab ?

Code

if emit_json:
    stringified = json_dumps(data, **kwargs)
    data = yaml_load(stringified)
return yaml_dump(data, **kwargs)

anabiman · 2021-04-05T21:06:50Z

By replacing yaml.SafeDumper and ruamel.yaml.RoundTripDumper with CSafeDumper, we get about 5-6X speedup.

Direct serialization with CSafeDumper

Array length	PyYAML	ruamel.yaml	JSON
4e5	0m2.125s	0m2.546s	0m0.428s
1e6	0m4.992s	0m6.348s	0m0.748s

Direct serialization (Python)

Array length	PyYAML	ruamel.yaml	JSON
4e5	0m10.640s	0m15.442s	0m0.469s
1e6	0m26.121s	0m37.502s	0m0.781s

coltonbh · 2023-03-29T05:46:23Z

Hi @anabiman . I'm helping out with a little maintenance on the repo to address old issues and PRs. Thanks for your thoughts on the repo :) Could the following solution work for you? You only need one line of code to serialize any object to yaml (or your preferred serialization format).

import yaml

# Convert to YAML
yaml_data = yaml.dump(my_molecule.dict())

You can load your yaml file containing your Molecule (or any other QCSchema objects) with just one line of code too:

with open("my-yaml-data.yaml") as f:
    python_dict = yaml.load(f.read(), yaml.Loader)
    # Or cast to QCSchema obj in one line
    molecule = Molecule(**yaml.load(f.read(), yaml.Loader))

I'd like to avoid adding 200 lines of code to the library (that require testing, maintenance, etc.) to save a one-liner from the python built in library. I also want to be sensitive to use cases you have in mind in case I'm missing something here :) Thoughts?

coltonbh · 2023-03-29T05:52:12Z

I see this is already marked wontfix by @loriab . Closing issue. @anabiman please let me know if you have any questions regarding the solution I proposed above. Happy to help if there's something more bespoke you have in mind :)

anabiman added 17 commits March 4, 2021 10:40

add YAML encoders & serialization in ProtoModel

478b6d4

support for Molecule reading/writing YAML files

23d591c

serialize and loads/dumps functions take kwargs

30de168

support for yaml safe load/dump

4c39af7

serialize_extensions includes yaml

04253e3

serialize handles yaml encoding

899b74e

use which_import for pyyaml

40ba4cf

add yaml_encode and safe_dump

636f789

use yaml_import

45e5ccf

import yaml_import

0e9c2de

add yaml_import function

f3e4c02

add support for ruamel.yaml

4aee3b6

add yaml to extras_require & format code

ed68cf6

update format

938a352

local yaml import

438c2ca

pass kwargs to constructors in Molecule.to_file

4038dee

improve docstring in yaml_encode

d789910

anabiman added 4 commits April 5, 2021 08:38

change yaml_dump func name to yaml_safe_dump

8497ad3

merge with MolSSI#254

d1a63b2

add support for parsing yaml files

f0a54e7

Merge branch 'master' into serialize

3916a91

loriab added the wontfix This will not be worked on label Aug 26, 2021

coltonbh closed this Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending serialization #255

Extending serialization #255

anabiman commented Mar 8, 2021 •

edited

Loading

loriab commented Mar 8, 2021

anabiman commented Mar 8, 2021

codecov bot commented Apr 5, 2021 •

edited

Loading

anabiman commented Apr 5, 2021

anabiman commented Apr 5, 2021

coltonbh commented Mar 29, 2023 •

edited

Loading

coltonbh commented Mar 29, 2023

Extending serialization #255

Extending serialization #255

Conversation

anabiman commented Mar 8, 2021 • edited Loading

Description

Notes

Changelog description

Status

loriab commented Mar 8, 2021

anabiman commented Mar 8, 2021

codecov bot commented Apr 5, 2021 • edited Loading

Codecov Report

anabiman commented Apr 5, 2021

Direct serialization

Indirect serialization

Notes

Code

anabiman commented Apr 5, 2021

Direct serialization with CSafeDumper

Direct serialization (Python)

coltonbh commented Mar 29, 2023 • edited Loading

coltonbh commented Mar 29, 2023

anabiman commented Mar 8, 2021 •

edited

Loading

codecov bot commented Apr 5, 2021 •

edited

Loading

coltonbh commented Mar 29, 2023 •

edited

Loading