Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional write modes for WeldxFile #690

Closed
vhirtham opened this issue Jan 27, 2022 · 11 comments · Fixed by #697
Closed

Additional write modes for WeldxFile #690

vhirtham opened this issue Jan 27, 2022 · 11 comments · Fixed by #697
Labels
ASDF everything ASDF related (python + schemas) feature new API features file weldx file handling (WeldxFile) validation custom ASDF validators

Comments

@vhirtham
Copy link
Collaborator

Currently, we only have the modes r and for reading and rw for creating a new file if it doesn't exist or modifying a file if it does exist.
I suggest that we also add w to only write if the file doesn't exist and wf to overwrite an existing file.

Reason: When developing a file schema, the rw mode can have some undesired side effects. For example, you write a file, and afterward, you adjust the schema to be more strict and wouldn't allow the current data to be written. The next write attempt gives you the expected schema validation error. You correct the data and try to write it again, but you get the same validation error, even if the data seems valid. Why? Because WeldxFile reads the old file first, that was valid before the schema was updated. If you don't realize this, you have a rough time figuring out why your correct data still causes a validation error (or at least seems to).

So using w mode would raise a meaningful exception if the file already exists while wf (force) simply overwrites the old file without opening it.

@vhirtham vhirtham added ASDF everything ASDF related (python + schemas) feature new API features validation custom ASDF validators file weldx file handling (WeldxFile) labels Jan 27, 2022
@marscher
Copy link
Contributor

Thanks for bringing this up, but ain't that a real corner case?
I do not understand, why the third write (how did you actually do that, since there are several write mechanisms) would trigger the existing file to be read again.

If we consider implementing more file handling modes, I suggest we stick to the nomenclature of the builtin open function: https://docs.python.org/3/library/functions.html#open

@vhirtham
Copy link
Collaborator Author

I do not understand, why the third write (how did you actually do that, since there are several write mechanisms) would trigger the existing file to be read again.

If you restart the script, then the file will be read and validated again in "rw" mode before you get to the writing part. Due to the asdf caching, you can't even test schema modifications without restarting the script.

but ain't that a real corner case?

My case might be a corner case, but there might also be situations where a user doesn't want to modify a file and accidentally picks an existing name. In this case, an exception would be nice before it is too late ;)

If we consider implementing more file handling modes, I suggest we stick to the nomenclature of the builtin open function: https://docs.python.org/3/library/functions.html#open

👍

@marscher
Copy link
Contributor

If a file exists and the user does not want to modify it, he or she should have opened it with mode="r" (the default). So there is already some sanity checks about accidentally modifying files. What am I missing?

@vhirtham
Copy link
Collaborator Author

If a file exists and the user does not want to modify it, he or she should have opened it with mode="r" (the default). So there is already some sanity checks about accidentally modifying files. What am I missing?

That the user might want to write a new file without accidentally overwriting stuff in already existing files just because he forgot to increase a number or whatever else might be the reason he used an existing file name ;)

@CagtayFabry
Copy link
Member

I guess the problem comes down to this example:

# create a valid weldx file
with WeldxFile("debug.wx", tree={"a":42}, mode="rw") as wxfile:
    pass

# try creating the file again with new contents
with WeldxFile("debug.wx", tree={"b":19}, mode="rw", custom_schema="single_pass_weld-0.1.0") as wxfile:
    pass

>>> ValidationError: 'process' is a required property
>>> ...
>>> On instance:
>>>    {'a': 42, .... }

This might be a bit confusing since the validation error happens when reading the existing file. Maybe we can catch that case and raise a more specific error? To let the users know that the existing file is the problem (and they can delete it if they really want to create a file from scratch)

@marscher
Copy link
Contributor

marscher commented Feb 2, 2022

It is not clear weather the ValidationError is raised upon writing the new tree (b=19) or reading the existing data, which also violates the schema.
Anyhow we should add a truncation flag as in Python open for advanced users.

@CagtayFabry
Copy link
Member

It is not clear weather the ValidationError is raised upon writing the new tree (b=19) or reading the existing data, which also violates the schema. Anyhow we should add a truncation flag as in Python open for advanced users.

To me it looks like the ValidationError is created upon reading in the example since the data {'a': 42, is listed in the exception

But to be sure we should certainly run a test with valid data (or a different schema) when trying the second write operation

@marscher
Copy link
Contributor

marscher commented Feb 2, 2022

We could also specify, if the schema validation should happen upon opening or writing (or both).

@marscher
Copy link
Contributor

marscher commented Feb 3, 2022

In general I'm a great fan of "fail as early as possible", since it makes error tracking much easier. In this case this pattern just prohibits updating the file, if the schema is not being fulfilled. So besides of adding new file modes, I suggest to add a flag "early_validation", which is should be set to True by default (IMHO). This way it would be possible to load a non conforming files, update the tree and validate it later on during serialization/writing. I would also explicitly state, that this flag is most useful for schema development.

What do you think?

Anyhow I think this file modes is the one you are basically referring to achieve the overwrite the contents:

"w+" - open for writing, truncating the file first

@CagtayFabry
Copy link
Member

What do you think of the following approach:

we expand the allowed inputs for custom_schema to support a pair of schema files with this behavior

custom_schema = None # no validation (default)
custom_schema = "A" # validate against A on read and write (current behavior)
custom_schema = ["A", "B"] # validate with A on read, validate with B on write
custom_schema = [None, "B"] # no validation on read, validate with B on write
custom_schema = ["A", None] # validate with A on read, no validation on write

This would provide a neat wrapper when transforming/extracting data between different schemas

With this implementing w+ would be similar to [None, "B"]

@marscher
Copy link
Contributor

marscher commented Feb 4, 2022

That looks very useful. I will open up a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASDF everything ASDF related (python + schemas) feature new API features file weldx file handling (WeldxFile) validation custom ASDF validators
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants