-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add server config option to disable validation of outgoing data #1530
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1530 +/- ##
=======================================
Coverage 91.10% 91.10%
=======================================
Files 74 74
Lines 4519 4531 +12
=======================================
+ Hits 4117 4128 +11
- Misses 402 403 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@JPBergsma I know you're busy with tutorials (too) so I think I'll double-check this, merge then release now so that @markus1978 can try it out. |
def deserialize( | ||
cls, results: Union[dict, Iterable[dict]] | ||
) -> Union[List[EntryResource], EntryResource]: | ||
"""Converts the raw database entries for this class into serialized models, | ||
mapping the data along the way. | ||
|
||
""" | ||
if isinstance(results, dict): | ||
return cls.ENTRY_RESOURCE_CLASS(**cls.map_back(results)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these lines are no longer needed for our implementation, now we always pass a list.
def deserialize( | |
cls, results: Union[dict, Iterable[dict]] | |
) -> Union[List[EntryResource], EntryResource]: | |
"""Converts the raw database entries for this class into serialized models, | |
mapping the data along the way. | |
""" | |
if isinstance(results, dict): | |
return cls.ENTRY_RESOURCE_CLASS(**cls.map_back(results)) | |
def deserialize( | |
cls, results: Iterable[dict] | |
) -> List[EntryResource]: | |
"""Converts the raw database entries for this class into serialized models, | |
mapping the data along the way. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still had a few remarks about this PR. Other that, it looks like a good change to me.
try: | ||
new_entry = new_entry.dict(exclude_unset=True, by_alias=True) # type: ignore[union-attr] | ||
except AttributeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not use try and except here. Handling an exception is very slow. So you should only use it when failure is rare (< 1%).
When validation is turned off the new_entry
is however always a dictionary, so failure is not rare.
It is therefore better to do:
try: | |
new_entry = new_entry.dict(exclude_unset=True, by_alias=True) # type: ignore[union-attr] | |
except AttributeError: | |
pass | |
if not isinstance(new_entry, dict): | |
new_entry = new_entry.dict(exclude_unset=True, by_alias=True) # type: ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is so clear-cut; I just made an artificial benchmark with a very simple pydantic model with exception handling and isinstance checks.
If you use exception handling then the validate_api_response: true
(our default) branch is about 1% faster using exceptions than not, and the isinstance check is about 2% faster when you are passing raw dictionaries, i.e., not much changes. This is also dwarfed by the difference between using dicts vs pydantic models, which is a factor of 20x.
I would rather avoid slowing down the "slower" method, i.e., using exception handling by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If performance is important, the database will probably turn off validation to speed things up.
If performance is less important, the database will use a slower method and leave the validation on.
So I would argue, it would be the best to make the fastest method as fast as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, though I'm not convinced that disabling validation provides any meaningful performance boost, and instead is just used to bypass some of the strict rules we have on databases where the effort is too much to apply them (e.g., NOMAD uses "X" in like 10 out of millions of chemical formulae, and trying to query them with validation on causes crashes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a quick try on my laptop with the test data, and it takes 25% longer to process the response request with validation. So it is not a huge performance increase, but definitively noticeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, really? I tried via the validator and could only get 1-2% difference. I'll re-investigate if I get time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the total processing time of a request increases by 25% if I do the validation, compared to not validating.
I just did some more testing and it seems that the try except block takes about 1.5 times longer to execute than the "if" statement. Using "if" saves about 2.25 µs per entry. This is smaller than what I had expected. So for our example server we would only save 40 µs on 0.2 s so only 0.02%.
So it is probably not worth continuing this discussion.
This PR adds the config option
validate_api_response
to the reference server, which is enabled by default. Disabling this will short-circuit the pydantic validation of outgoing data, which can be used to allow things like "X" in chemical formulae through the API. Currently there are no associated warnings raised in this case, but an intermediate setting could be added to do this (would still have the performance hit of validation, but this does not seem to be sizable).The server tests are now run in the CI in both modes, but there are currently no tests that the lack of validation does indeed allow negative data through --- it does, and setting up tests for this negative case would be more effort than I can afford atm.