Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possible memory leak issue using jsonschema.validate() #2826

Closed
kevin-bates opened this issue Jul 11, 2022 · 1 comment · Fixed by #2829
Closed

Investigate possible memory leak issue using jsonschema.validate() #2826

kevin-bates opened this issue Jul 11, 2022 · 1 comment · Fixed by #2829
Assignees
Milestone

Comments

@kevin-bates
Copy link
Member

In last week's Jupyter Server meeting it was brought to our attention that the jsonschema package can exhibit a memory leak when the validate method is repeatedly called to validate an instance of a given schema. The leak isn't necessarily in the validation of the instance itself, but rather in the acquisition of the validator used to perform the validation and the validate method's documentation even states:

"If you know you have a valid schema already, especially if you intend to validate multiple instances with the same schema, you likely would prefer using the Validator.validate method directly on a specific validator (e.g. Draft7Validator.validate)."

Since Elyra validates all instances on retrieval (and save) and is intended to remain running for long periods of time, we should look into how this impacts Elyra vs. the cost to address this issue.

@kevin-bates
Copy link
Member Author

I wrote a simple unittest to exercise the validate() method on the MetadataManager:

def test_manager_validation_performance():
    import psutil
    metadata_mgr = MetadataManager(schemaspace=METADATA_TEST_SCHEMASPACE)
    metadata_dict = {**valid_metadata_json}
    metadata = Metadata.from_dict(METADATA_TEST_SCHEMASPACE_ID, metadata_dict)

    process = psutil.Process(os.getpid())
    # warm up
    metadata_mgr.validate("perf_test", metadata)

    memory_start = process.memory_info()
    for _ in range(0, 10000):
        metadata_mgr.validate("perf_test", metadata)

    memory_end = process.memory_info()
    diff = (memory_end.rss - memory_start.rss)/1024
    print(f"Memory used: {diff:,} kb, Memory start: {memory_start.rss/1024/1024:,.3f} mb, "
          f"Memory end: {memory_end.rss/1024/1024:,.3f} mb.")

MetadataManager.validate() essentially gets the schema referenced by the instance, then calls jsonschema.validate(). Here are results from two runs...

================ 1 passed, 10001 warnings in 103.34s (0:01:43) =================
Memory used: 1,860.0 kb, Memory start: 109.980 mb, Memory end: 111.797 mb.

================ 1 passed, 10001 warnings in 101.90s (0:01:41) =================
Memory used: 1,672.0 kb, Memory start: 109.609 mb, Memory end: 111.242 mb.

When updating the MetadataManager.validate() to use a validator (Draft7Validator(schema, format_checker=draft7_format_checker).validate(metadata_dict)), the results become:

============================== 1 passed in 16.58s ==============================
Memory used: 92.0 kb, Memory start: 107.562 mb, Memory end: 107.652 mb.

============================== 1 passed in 16.09s ==============================
Memory used: 52.0 kb, Memory start: 108.203 mb, Memory end: 108.254 mb.

indicating a 20-30X decrease in memory usage, and a 6X increase in performance.

I think we can improve things further by letting the SchemaManager create the validator instances for each schema and cache those instances, as this will avoid the creation of the validator instance on each validation request.

Because Elyra is a long-lived application, and given these changes (so far) appear to be straightforward, I believe we should make these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants