Skip to content

Validator.iter_errors() doc should not recommend sorted #757

@addisonklinke

Description

@addisonklinke

The doc example for lazy printing of error messages with iter_errors() uses Python's sorted. However, this forces the whole generator to be evaluated in memory, thereby defeating the benefit of lazy evaluation if you only want to print the first N errors. The snippet below consumes an additional 2.5 GB of memory on my machine once the errors are sorted

I would recommend removing the sorted call so users do not naively think they are getting the benefit of lazy evaluation. When error counts are small, the memory usage is not noticeable which leaves a dangerous blind spot if the error counts increase unexpectedly in the future (i.e. for data that is nowhere near the schema specs)

import jsonschema
import psutil

# Create a schema for a nested array and 1M mock examples that violate it
schema = {
    'type': 'array',
    'items': {
        'type': 'array',
        'minItems': 3,
        'maxItems': 3,
        'additionalItems': False,
        'items': {'type': 'integer'}}}
data = [{'a': 'b'} for _ in range(1000000)]

# Track memory throughout validation and error printing process
mem = {}
max_errors = 5
mem['pre_val'] = psutil.virtual_memory().used
validator = jsonschema.Draft7Validator(schema)
mem['post_val'] = psutil.virtual_memory().used
errors = validator.iter_errors(data)
mem['pre_iter'] = psutil.virtual_memory().used
for i, e in enumerate(errors):
    print(e.message)
    if i + 1 >= max_errors:
        break
mem['post_iter'] = psutil.virtual_memory().used
mem['pre_sort'] = psutil.virtual_memory().used
errors_sort = sorted(errors, key=lambda e: e.path)
mem['post_sort'] = psutil.virtual_memory().used

# Summarize usage
print(f'{len(errors_sort)} errors')
for k, v in mem.items():
    print(f'{k}: {v / 1000000:,.2f} MB')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions