-
-
Notifications
You must be signed in to change notification settings - Fork 601
Closed
Description
The doc example for lazy printing of error messages with iter_errors() uses Python's sorted. However, this forces the whole generator to be evaluated in memory, thereby defeating the benefit of lazy evaluation if you only want to print the first N errors. The snippet below consumes an additional 2.5 GB of memory on my machine once the errors are sorted
I would recommend removing the sorted call so users do not naively think they are getting the benefit of lazy evaluation. When error counts are small, the memory usage is not noticeable which leaves a dangerous blind spot if the error counts increase unexpectedly in the future (i.e. for data that is nowhere near the schema specs)
import jsonschema
import psutil
# Create a schema for a nested array and 1M mock examples that violate it
schema = {
'type': 'array',
'items': {
'type': 'array',
'minItems': 3,
'maxItems': 3,
'additionalItems': False,
'items': {'type': 'integer'}}}
data = [{'a': 'b'} for _ in range(1000000)]
# Track memory throughout validation and error printing process
mem = {}
max_errors = 5
mem['pre_val'] = psutil.virtual_memory().used
validator = jsonschema.Draft7Validator(schema)
mem['post_val'] = psutil.virtual_memory().used
errors = validator.iter_errors(data)
mem['pre_iter'] = psutil.virtual_memory().used
for i, e in enumerate(errors):
print(e.message)
if i + 1 >= max_errors:
break
mem['post_iter'] = psutil.virtual_memory().used
mem['pre_sort'] = psutil.virtual_memory().used
errors_sort = sorted(errors, key=lambda e: e.path)
mem['post_sort'] = psutil.virtual_memory().used
# Summarize usage
print(f'{len(errors_sort)} errors')
for k, v in mem.items():
print(f'{k}: {v / 1000000:,.2f} MB')
Metadata
Metadata
Assignees
Labels
No labels