Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add server config option to disable validation of outgoing data #1530

Merged
merged 3 commits into from
Feb 27, 2023

Conversation

ml-evs
Copy link
Member

@ml-evs ml-evs commented Feb 23, 2023

This PR adds the config option validate_api_response to the reference server, which is enabled by default. Disabling this will short-circuit the pydantic validation of outgoing data, which can be used to allow things like "X" in chemical formulae through the API. Currently there are no associated warnings raised in this case, but an intermediate setting could be added to do this (would still have the performance hit of validation, but this does not seem to be sizable).

The server tests are now run in the CI in both modes, but there are currently no tests that the lack of validation does indeed allow negative data through --- it does, and setting up tests for this negative case would be more effort than I can afford atm.

@ml-evs ml-evs marked this pull request as ready for review February 26, 2023 22:31
@codecov
Copy link

codecov bot commented Feb 26, 2023

Codecov Report

Merging #1530 (c1d9969) into master (c3ed95d) will increase coverage by 0.00%.
The diff coverage is 95.45%.

❗ Current head c1d9969 differs from pull request most recent head 6043131. Consider uploading reports for the commit 6043131 to get more accurate results

@@           Coverage Diff           @@
##           master    #1530   +/-   ##
=======================================
  Coverage   91.10%   91.10%           
=======================================
  Files          74       74           
  Lines        4519     4531   +12     
=======================================
+ Hits         4117     4128   +11     
- Misses        402      403    +1     
Flag Coverage Δ
project 91.10% <95.45%> (+<0.01%) ⬆️
validator 90.99% <95.45%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
optimade/server/mappers/entries.py 98.21% <ø> (-0.90%) ⬇️
...made/server/entry_collections/entry_collections.py 97.84% <90.00%> (+0.04%) ⬆️
optimade/server/config.py 93.61% <100.00%> (+0.06%) ⬆️
optimade/server/routers/utils.py 96.72% <100.00%> (+0.23%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ml-evs
Copy link
Member Author

ml-evs commented Feb 27, 2023

@JPBergsma I know you're busy with tutorials (too) so I think I'll double-check this, merge then release now so that @markus1978 can try it out.

@ml-evs ml-evs merged commit a22cddb into master Feb 27, 2023
@ml-evs ml-evs deleted the ml-evs/add_validation_shortcut branch February 27, 2023 11:56
@ml-evs ml-evs added enhancement New feature or request server Issues pertaining to the example server implementation labels Feb 27, 2023
Comment on lines 363 to 371
def deserialize(
cls, results: Union[dict, Iterable[dict]]
) -> Union[List[EntryResource], EntryResource]:
"""Converts the raw database entries for this class into serialized models,
mapping the data along the way.

"""
if isinstance(results, dict):
return cls.ENTRY_RESOURCE_CLASS(**cls.map_back(results))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these lines are no longer needed for our implementation, now we always pass a list.

Suggested change
def deserialize(
cls, results: Union[dict, Iterable[dict]]
) -> Union[List[EntryResource], EntryResource]:
"""Converts the raw database entries for this class into serialized models,
mapping the data along the way.
"""
if isinstance(results, dict):
return cls.ENTRY_RESOURCE_CLASS(**cls.map_back(results))
def deserialize(
cls, results: Iterable[dict]
) -> List[EntryResource]:
"""Converts the raw database entries for this class into serialized models,
mapping the data along the way.
"""

Copy link
Contributor

@JPBergsma JPBergsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still had a few remarks about this PR. Other that, it looks like a good change to me.

Comment on lines +119 to +122
try:
new_entry = new_entry.dict(exclude_unset=True, by_alias=True) # type: ignore[union-attr]
except AttributeError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not use try and except here. Handling an exception is very slow. So you should only use it when failure is rare (< 1%).
When validation is turned off the new_entry is however always a dictionary, so failure is not rare.
It is therefore better to do:

Suggested change
try:
new_entry = new_entry.dict(exclude_unset=True, by_alias=True) # type: ignore[union-attr]
except AttributeError:
pass
if not isinstance(new_entry, dict):
new_entry = new_entry.dict(exclude_unset=True, by_alias=True) # type: ignore[union-attr]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is so clear-cut; I just made an artificial benchmark with a very simple pydantic model with exception handling and isinstance checks.

If you use exception handling then the validate_api_response: true (our default) branch is about 1% faster using exceptions than not, and the isinstance check is about 2% faster when you are passing raw dictionaries, i.e., not much changes. This is also dwarfed by the difference between using dicts vs pydantic models, which is a factor of 20x.

I would rather avoid slowing down the "slower" method, i.e., using exception handling by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If performance is important, the database will probably turn off validation to speed things up.
If performance is less important, the database will use a slower method and leave the validation on.
So I would argue, it would be the best to make the fastest method as fast as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, though I'm not convinced that disabling validation provides any meaningful performance boost, and instead is just used to bypass some of the strict rules we have on databases where the effort is too much to apply them (e.g., NOMAD uses "X" in like 10 out of millions of chemical formulae, and trying to query them with validation on causes crashes).

Copy link
Contributor

@JPBergsma JPBergsma Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick try on my laptop with the test data, and it takes 25% longer to process the response request with validation. So it is not a huge performance increase, but definitively noticeable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, really? I tried via the validator and could only get 1-2% difference. I'll re-investigate if I get time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the total processing time of a request increases by 25% if I do the validation, compared to not validating.

I just did some more testing and it seems that the try except block takes about 1.5 times longer to execute than the "if" statement. Using "if" saves about 2.25 µs per entry. This is smaller than what I had expected. So for our example server we would only save 40 µs on 0.2 s so only 0.02%.
So it is probably not worth continuing this discussion.

optimade/server/routers/utils.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server Issues pertaining to the example server implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants