Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several accidentally quadratic functions #792

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jonjohnsonjr
Copy link

Fixes #790

I am by no means a python expert, please let me know if there are more idiomatic ways to solve this.

After this change:

$ time pyspdxtools -i sbom.spdx.json
pyspdxtools -i sbom.spdx.json  1.57s user 0.06s system 95% cpu 1.708 total

Note that this document previously took almost 9 minutes to validate, so this is ~300x faster for that particular document.

Signed-off-by: Jon Johnson <jon.johnson@chainguard.dev>
@jspeed-meyers
Copy link
Contributor

cc @meretp

This patch grew out of the use of ntia-conformance-checker. Thanks, @meretp, for helping with the transition to the new tools-python version for ntia-conformance-checker a few months back.

Copy link
Member

@maxhbr maxhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that hashing does not work well with mutable objects. It can cause inconsistencies and misses on lookup.

Also: not all properties are included in the hashing which causes false positive collisions.

@@ -81,3 +81,6 @@ def __init__(
relationships = [] if relationships is None else relationships
extracted_licensing_info = [] if extracted_licensing_info is None else extracted_licensing_info
check_types_and_set_values(self, locals())

def __hash__(self):
return id(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documents are not immutable and therefore hashing might run into inconsistencies. Most use-cases would probably expect a content dependent hash.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's a shame.

I wouldn't expect for the document to get mutated during validation. Would it make sense to have some frozen/immutable variants of document and relationship that could be passed around the validators?

Or do you have any ideas that would be a better approach to take?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that a hash should only be implemented on immutable objects when it uses mutable properties of the object - most "useful" hashes do. This hash is not based on a mutable property which would make it safe and consistent although not very useful (The default implementation of hash is just id(self) >> 4). Also note that eq should always be defined when hash is [see here]. Why define hash here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used functools.lru_cache in the past to speed up slow loops; it watches the parameters and replaces a function call with a table lookup when that makes sense. The nice thing is that it can catch mutations and know to call the function again without calling it every single time. Not sure if that would help in this case; I'll try to take a look this weekend.

@@ -73,3 +73,6 @@ def __init__(
comment: Optional[str] = None,
):
check_types_and_set_values(self, locals())

def __hash__(self):
return hash("{} -> {} ({})".format(self.spdx_element_id, str(self.related_spdx_element_id), str(self.relationship_type)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our Relationships are mutable and that does not work well with hashing. This can cause inconsistencies when used in hashsets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class needs a definition of eq as well.

paulgibert and others added 2 commits February 14, 2024 09:30
…s in parsing to remove hash methods.

Signed-off-by: paulgibert <paulgibert98@gmail.com>
Signed-off-by: Jon Johnson <jon.johnson@chainguard.dev>
@jonjohnsonjr
Copy link
Author

I've update the PR with changes from @paulgibert that seem better than what I came up with :)

@jspeed-meyers
Copy link
Contributor

@maxhbr or @meretp -- Any thoughts on this other approach?

Copy link
Member

@maxhbr maxhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the astuple approach looks promising.


def cached_function(document: Document):
key = id(document)
if key in cache.keys():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns an outdated cache, if the document was modified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or in other words: does the cache key id(document) change, if I mutate some stuff inside the document?

Maybe a less global caching would be more appropriate. So that its scope is controlled in a way so that no mutation can happen in between.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. From the docs:

id(object)

    Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime.
Two objects with non-overlapping lifetimes may have the same [id()](https://docs.python.org/3/library/functions.html#id) value.
``

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is exactly the problem I am afraid of

>>> a = {1}
>>> id(a)
139842939239360
>>> a.add(2)
>>> a
{1, 2}
>>> id(a)
139842939239360

The object has the same ID and it does not change even if its content is mutated. So it is maybe not a good cache key.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can adjust the cache key to be based on the fields you want but once the object is added to the cache it will still not reflect changes. If the goal is to build a fast search over mutable objects, a more complex approach/data structure may be necessary for representing collections of Documents and Relationships.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the library should allow for the following usecase:

  • Validation of the document
  • mutation of that document, or some of the contained elements
  • validation of the new state, without false values due to caching

@maxhbr
Copy link
Member

maxhbr commented Feb 15, 2024

Hey, maybe it would be worth to connect in a video call?

@jonjohnsonjr
Copy link
Author

How would you feel about something like this instead? #800

I suspect just changing those function signatures is probably not what we want to do (I assume it's a breaking change), but what if we kept the same signatures and introduced a parallel set of functions that takes a set instead of a list (list variant can just call set variant).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow for SBOMs with a large number of files + relationships
5 participants