Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use paths in codebase #42

Merged
merged 29 commits into from
Jul 29, 2022
Merged

Use paths in codebase #42

merged 29 commits into from
Jul 29, 2022

Conversation

pombredanne
Copy link
Member

This PR merges the latest skeleton and implements new internals for resource Codebase and Resource.
The key change is dropping using numeric resource ids and using a simpler map of path->Resource instead.
The not-yet-implemented part is focusing a codebase on the subset the new Codebase(paths) argument.
This is still a work in progress because of this. But early feedback is welcomed and needed.

JonoYang and others added 11 commits April 29, 2022 14:31
    * The variable `environment` is not used when fetching sdists

Signed-off-by: Jono Yang <jyang@nexb.com>
Ensure that site-package dir exists.
Other minor adjustments from a scancode-toolkit release

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
These were buggy in some corner cases.
They have been updated such that:
* --latest-version works.
* we can reliable fetch combinations of wheels and sdists for multiple
  OS combos at once
* we now support macOS universal wheels (for ARM CPUs)

Caching is now simpler: we have essentially a single file-based cache
under .cache. PyPI indexes are fetched and not cached, unless the new
--use-cached-index is used which can be useful when fetching many
thirdparty in a short timeframe.

The first PyPI repository in a list has precendence and we never fetch
from other repositories if we find wheels and sdsists there. This avoid
pounding too much on the self-hosted repo.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This is much faster

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Copy link
Member

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running into errors when using the VirtualCodebase. I am passing in two JSON scans into VirtualCodebase to test the functionality we have where we can create a VirtualCodebase from multiple scans. I am able to create the VirtualCodebase but using the walk() method causes an exception to be raised:

Traceback (most recent call last):
  File "/home/jono/nexb/src/commoncode/src/commoncode/resource.py", line 1228, in children
    return sorted(children, key=_sorter)
  File "/home/jono/nexb/src/commoncode/src/commoncode/resource.py", line 1220, in <lambda>
    _sorter = lambda r: (r.has_children(), r.name.lower(), r.name)
AttributeError: 'NoneType' object has no attribute 'has_children'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jono/nexb/src/commoncode/src/commoncode/resource.py", line 867, in walk
    for res in root.walk(self, topdown=topdown, ignored=ignored):
  File "/home/jono/nexb/src/commoncode/src/commoncode/resource.py", line 1196, in walk
    ignored=ignored,
  File "/home/jono/nexb/src/commoncode/src/commoncode/resource.py", line 1187, in walk
    for child in self.children(codebase):
  File "/home/jono/nexb/src/commoncode/src/commoncode/resource.py", line 1230, in children
    raise Exception(f'Cannot sort children: {children!r}:\n{children_paths!r}') from e
Exception: Cannot sort children: [None, None]:
['codebase/package', 'codebase/django-audit-tools-0.4.0']

I took a look at the paths of the VirtualCodebase I created by using the resources_by_path attribute and I saw that the paths root Resources of the two scans I used do not start with virtual_root, whereas all of the other Resources in the VirtualCodebase do. I think that the exception is occurring because the Resources doesn't have virtual_root in its path and it cannot find any other children because of the difference in the path prefix.

JonoYang and others added 8 commits May 13, 2022 18:35
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
These are no longer needed.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
- When you create a VirtualCodebase with multiple scans, we now prefix each
  scan path with a codebase-1/, codebase-2/, etc. directory in addition to the
  "virtual_root" shared root directory. Otherwise files data was overwritten
  and inconsistent when each location "files" were sharing leading path
  segments.

- When you create a VirtualCodebase with more than one Resource, we now recreate
  the directory tree for any intermediary directory used in a path that is
  otherwise missing from files path list.
  In particular this behaviour changed when you create a VirtualCodebase from
  a pervious Codebase created with a "full_root" argument. Previously, the
  missing paths of a "full_root" Codebase were kept unchanged.
  Noet that the VirtualCodebase has always ignored the "full_root" argument.

- The Resource has no rid (resource id) and no pid (parent id). Instead
  we now use internally a simpler mapping of {path: Resource} object.

- The Codebase and VirtualCodebase are now iterable. Iterating on a codebase
  is the same as a top-down walk.

- The Resource.path now never contains leading or trailing slash. We also
  normalize the path everywhere. In particular this behaviour is visible when
  you create a Codebase with a "full_root" argument. Previously, the paths of a
  "full_root" Codebase were prefixed with a slash "/".


Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

I'm running into errors when using the VirtualCodebase. I am passing in two JSON scans into VirtualCodebase to test the functionality we have where we can create a VirtualCodebase from multiple scans. I am able to create the VirtualCodebase but using the walk() method causes an exception to be raised:

@JonoYang the latest push is fixing this. And several other issues. Thanks!

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This was removed in a previous commit.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few nits.

memory. Beyond this number, Resource are saved on disk instead. -1 means
no memory is used and 0 means unlimited memory is used.

`max_depth` is the maximum depth of subdirectories to descend below and
``max_depth`` is the maximum depth of subdirectories to descend below and
including `location`.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a line about paths in the docstring.

src/commoncode/resource.py Outdated Show resolved Hide resolved
src/commoncode/resource.py Outdated Show resolved Hide resolved
Copy link
Member

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks okay. I like that you set the iter dunder on Codebase to yield its resources.

I feel that the solution we have for creating a codebase from multiple scans is a bit kludgy, but reasonable. Treating each scan as an individual codebase is easier than attempting to always merge scan directory structure/data together. I had not realized the implication of getting scancode to merge two directories named codebase from two different scans.

On a whim, I've updated commoncode in scancode and all the references to rid in scancode/cli.py will need to be updated.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
The Codebase and VirtualCodebase no longer have a "full_root" and
"strip_root" constructor arguments and attributes. These can still be
passed but they will be ignored. These were needed only for path output
and this is now were these arguments and code lives.


- Resource.path is now always the plain path where the first segment
  is the last segment of the root location, e.g. the root fiename.

- The Resource now has new "full_root_path" and "strip_root_path"
  properties that return the corresponding paths.

- The Resource.to_dict and the new Codebase.to_list both have a new
  "full_root" and "strip_root" arguments

- The Resource.get_path() method accepts "full_root" and "strip_root"
  arguments.

- The Resource.create_child() method has been removed.

- The "Codebase.original_location" attributed has been removed.
  No known users of commoncode used this.


Also format code and organize imports.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants