Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

411 toolkit upgrade 2 #435

Merged
merged 49 commits into from
Jun 14, 2022
Merged

411 toolkit upgrade 2 #435

merged 49 commits into from
Jun 14, 2022

Conversation

JonoYang
Copy link
Member

@JonoYang JonoYang commented May 6, 2022

This PR updates scancode-toolkit in scancode.io, expected test results, as well as updating code that uses deprecated scancode-toolkit functions.

@JonoYang JonoYang force-pushed the 411-toolkit-upgrade-2 branch 2 times, most recently from f0bc99c to 58d4909 Compare May 6, 2022 16:23
@tdruez
Copy link
Contributor

tdruez commented May 9, 2022

@JonoYang can we make the new dependency system as a separate feature from the toolkit upgrade?

@JonoYang
Copy link
Member Author

JonoYang commented May 9, 2022

@tdruez No problem, I'll revert the model changes.

@JonoYang JonoYang force-pushed the 411-toolkit-upgrade-2 branch from d62f75c to 774c5b7 Compare May 9, 2022 17:41
@JonoYang
Copy link
Member Author

JonoYang commented May 10, 2022

@tdruez

I've regenerated the test data using the new command you created and I run into some problems where I would like your opinion on what should be done.

  1. ProjectCodebase and the lack of a root codebase directory

I run regen_test_data.py and the tests for ProjectCodebase all fail now because the ProjectCodebase code has no good way of determining the root of a scan. Previous scancode.io Resource creation behavior was to create the codebase directory Resource, and then every Resource under codebase. The paths of all Resources created starts with codebase. The ProjectCodebase object considers the root directory
to be the codebase directory (https://github.com/nexB/scancode.io/blob/main/scanpipe/pipes/codebase.py#L71) Since I regenerated the test fixtures, the codebase directory Resource is no longer created and Resource paths no longer start with codebase

I have some previous attempts where I worked on different solutions. In the branch starting-paths, we had the idea of considering the directories under codebase directory as individual roots. (https://github.com/nexB/scancode.io/blob/fa6cd0c597bcb53d449c0c119208f0f314955ea2/scanpipe/pipes/codebase.py#L81) Should I expand on the work I did there or should I remove ProjectCodebase entirely until we come up with something satisfactory?

In a scancode JSON output, all Packages detected in a scan are present in a top-level attribute named packages. Likewise, all detected Dependencies are placed in the Dependencies attribute. Multiple copies of the same package can be present in the packages field, if a particular package was detected multiple times in the same codebase. Each copy of this package will have a different package_uid. A package_uid is the purl of that package with a qualifier named uuid that is specific to the scancode run. e.g. pkg:pypi/django-audit-tools@0.4.0?uuid=9c19275c-c3fe-43dd-b6ec-a4f2bf65810f

For each Resource that is for a Package, the for_packages for those Resources will be populated with the package_uid of the Package they are for.

When I first modified the pipes code to handle the new top-level packages field from a scan, I just put the package_uid in the extra_data field. When I wanted to assign a Resource to a Package, I would then query for that package_uid in the DiscoveredPackage table (https://github.com/nexB/scancode.io/blob/411-toolkit-upgrade-2/scanpipe/pipes/scancode.py#L392).

I started testing the pipeline using the package django-audit-tools-0.4.0.tar.gz and I was getting errors when I tried to look up a package_uid from a Resource's for_packages field in the DiscoveredPackage table. The issue was because the same package was detected multiple times throughout the scan, so there were multiple instances of that package (with different package_uids), but only 1 version of that package was saved to the database (since I am assuming that we can only have unique instances of packages in the DiscoveredPackage table).

I decided to update how I stored the package_uid in extra_data by creating the field package_uids whose value is a list of package_uids. (https://github.com/nexB/scancode.io/blob/774c5b727b52f7d93601e016a9c3a05941ec3d98/scanpipe/pipes/__init__.py#L95)

The extra_data probably isn't the best place to keep the package_uids, so I'd like to know how we should handle package_uids in DiscoveredPackages. What is the best way to store package_uids on DiscoveredPackages?

The scan output asgiref-3.3.0_scan.json created by regen_test_data.py is slightly different than the scan output generated by scancode-toolkit:

  • No top-level dependencies field
  • the for_packages field on Resources is populated with purls instead of package_uids

This causes tests like https://github.com/nexB/scancode.io/blob/411-toolkit-upgrade-2/scanpipe/tests/test_pipes.py#L521 to fail since we expect the json to be in the sctk json output format.

Properly handling package_uids in (2) will probably help with this.

@tdruez
Copy link
Contributor

tdruez commented May 10, 2022

ProjectCodebase and the lack of a root codebase directory

The test data was generated from a time where the codebase/ prefix was stored in the DB. The test data should have been updated at the time of the code change.

Should I expand on the work I did there or should I remove ProjectCodebase entirely until we come up with something satisfactory?

Why not completing the #378 (review) PR instead? It seems that my review points were never addressed but it looks like we were close to a merge. Will this PR still relevant to solve this?

The extra_data probably isn't the best place to keep the package_uids

Probably not the best place, but I think the goal of this first PR is to make upgrade the toolkit with the least amount of models changes to get it out there asap. We should then have tickets and more PRs to work on the optimizations, such as new models and fields for Packages and Dependencies

The scan output asgiref-3.3.0_scan.json created by regen_test_data.py is slightly different than the scan output generated by scancode-toolkit:

We probably have to update the JSONResultsGenerator to include the new format and data.

@JonoYang
Copy link
Member Author

@tdruez I'll revisit #378 and see if I can get it to work with what I have in this branch.

Probably not the best place, but I think the goal of this first PR is to make upgrade the toolkit with the least amount of models changes to get it out there asap. We should then have tickets and more PRs to work on the optimizations, such as new models and fields for Packages and Dependencies

ack. I will create tickets for some of the further work that needs to be done.

@JonoYang
Copy link
Member Author

@tdruez

I tagged the ProjectCodebase tests with expectedFailure and regenerated the asgiref-3.3.0 test files using regen_test_data.py

I've added scanpipe/tests/data/asgiref-3.3.0_scancode_scan.json for use in test_scanpipe_pipes_scancode_create_codebase_resources_inject_policy (https://github.com/nexB/scancode.io/blob/411-toolkit-upgrade-2/scanpipe/tests/test_pipes.py#L553) because scanpipe/tests/data/asgiref-3.3.0_scan.json (generated by regen_test_data.py) is not exactly in the same format as what scancode-toolkit outputs.

@JonoYang JonoYang force-pushed the 411-toolkit-upgrade-2 branch from 081726b to b25c2c8 Compare May 10, 2022 22:50
@tdruez
Copy link
Contributor

tdruez commented May 11, 2022

While running pipelines on this branch:

module 'packagedcode.debian' has no attribute 'get_installed_packages'

Traceback:
  File "/app/scanpipe/pipelines/__init__.py", line 115, in execute
    step(self)
  File "/app/scanpipe/pipelines/docker.py", line 93, in collect_and_create_system_packages
    docker.scan_image_for_system_packages(self.project, image)
  File "/app/scanpipe/pipes/docker.py", line 166, in scan_image_for_system_packages
    for i, (purl, package, layer) in enumerate(installed_packages):
  File "/usr/local/lib/python3.9/site-packages/container_inspector/image.py", line 446, in get_installed_packages
    for purl, package in layer.get_installed_packages(packages_getter):
  File "/app/scanpipe/pipes/debian.py", line 31, in package_getter
    packages = debian.get_installed_packages(
  1. our test suite did not catch this.
  2. The packager getter get_installed_packages system was changed on the toolkit side, this needs to be upgraded on the ScanCode.io too

See aboutcode-org/scancode-toolkit@fbdba2e#diff-29c4686d506744c5a6ae71725c1fa392956829a0c6009c53d26536ef54e01858L57

The following pipe modules are impacted:

  • alpine
  • debian
  • rpm

@tdruez
Copy link
Contributor

tdruez commented May 11, 2022

@JonoYang
Copy link
Member Author

While running pipelines on this branch:

module 'packagedcode.debian' has no attribute 'get_installed_packages'

Traceback:
  File "/app/scanpipe/pipelines/__init__.py", line 115, in execute
    step(self)
  File "/app/scanpipe/pipelines/docker.py", line 93, in collect_and_create_system_packages
    docker.scan_image_for_system_packages(self.project, image)
  File "/app/scanpipe/pipes/docker.py", line 166, in scan_image_for_system_packages
    for i, (purl, package, layer) in enumerate(installed_packages):
  File "/usr/local/lib/python3.9/site-packages/container_inspector/image.py", line 446, in get_installed_packages
    for purl, package in layer.get_installed_packages(packages_getter):
  File "/app/scanpipe/pipes/debian.py", line 31, in package_getter
    packages = debian.get_installed_packages(
  1. our test suite did not catch this.
  2. The packager getter get_installed_packages system was changed on the toolkit side, this needs to be upgraded on the ScanCode.io too

See nexB/scancode-toolkit@fbdba2e#diff-29c4686d506744c5a6ae71725c1fa392956829a0c6009c53d26536ef54e01858L57

The following pipe modules are impacted:

  • alpine
  • debian
  • rpm

The get_installed_packages function has been removed from the alpine, debian/ubuntu, and rpm package handlers from packagedcode. There are now multiple "datafile handlers" that handle the different OS files, like the debian status files. The datafile handler then assemble all of the detected packages, dependencies, etc it finds. For example, this is the assemble method for the datafile handler that deals with installed Debian packages. (https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/debian.py#L223)

There doesn't seem to be an easy way to get the previous get_installed_packages functionality back from the new way we process package data. I think the logic from the assemble method I linked earlier is what we want, but we'll have to modify it to fit our model in scancode.io.

@JonoYang
Copy link
Member Author

@tdruez

I've brought in the old code that handled processing installed Debian package files and updated enough of it to get the pipeline to work. I had some issues with license detection, but it was resolved when I ran scancode --reindex-licenses.

@tdruez
Copy link
Contributor

tdruez commented May 12, 2022

I've brought in the old code that handled processing installed Debian package files and updated enough of it to get the pipeline to work.

I don't think we want to bring old toolkit code over in ScanCode.io, this defeat the purpose of a toolkit upgrade, but we should rather make use of the new code.

We had discussion today with @pombredanne to discuss this and he will provide solution on the toolkit side.

In the mean time, we should:

key_files_packages.extend(packages_data)
for package in packages:
package_data = DiscoveredPackageSerializer(package).data
if package_data not in key_files_packages:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit much to test on a whole data structure such as package_data.
Could we use a more specific id instead?

JonoYang added a commit that referenced this pull request May 13, 2022
    * Avoid checking if package_data dictionary is already in the key_files_packages list
    * Keep track of package_uids instead

Signed-off-by: Jono Yang <jyang@nexb.com>
JonoYang added a commit that referenced this pull request May 13, 2022
Signed-off-by: Jono Yang <jyang@nexb.com>
tdruez and others added 12 commits May 17, 2022 15:58
Signed-off-by: Thomas Druez <tdruez@nexb.com>
Signed-off-by: Thomas Druez <tdruez@nexb.com>
Signed-off-by: Thomas Druez <tdruez@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
    * Remove create_discovered_packages2 and create_codebase_resources2

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Normalize package_uids before comparing results in tests
    * Update expected test results

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Mark ProjectCodebase tests with expectedFailure
    * We will revisit ProjectCodebase and update it to fit our current models

Signed-off-by: Jono Yang <jyang@nexb.com>
    * We are using a scancode scan results for tests since asgiref-3.3.0_scan.json is not exactly the same format as scancode's json output

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Update regen_test_data.py to generate asgiref-3.3.0_walk_test_fixtures.json

Signed-off-by: Jono Yang <jyang@nexb.com>
    * No need to explicity get license_clarity_score in make_results_summary()
    * Update expected test results

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Add .vscode to .gitignore

Signed-off-by: Jono Yang <jyang@nexb.com>
JonoYang added 4 commits June 6, 2022 18:31
    * Check for existence of installed_file attribute before using it

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Ensure both installed_file and codebase_resource have the same checksum field before comparing them

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Update mappings_keys_by_fieldname
    * Look for package data in package_data field instead of packages in save_scan_package_results

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Move get_installed_packages to rootfs.py
    * Use get_package_data instead of get_package_info
    * Rename all instances of packages to package_data when scanning for application packages
    * Update test docker images and test results
    * Add test for basic rootfs

Signed-off-by: Jono Yang <jyang@nexb.com>
Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JonoYang The code looks pretty good.
Do you have a list of what's left to do to complete this PR?

scanpipe/pipes/scancode.py Outdated Show resolved Hide resolved
scanpipe/pipes/__init__.py Outdated Show resolved Hide resolved
JonoYang added 3 commits June 9, 2022 10:42
Signed-off-by: Jono Yang <jyang@nexb.com>
    * Update expected test results

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Update expected test results
    * Remove old ubuntu.tar

Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang JonoYang force-pushed the 411-toolkit-upgrade-2 branch from faf6fba to 552bdb8 Compare June 9, 2022 21:21
JonoYang added 4 commits June 9, 2022 14:25
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
    * Add package_uid to test package data
    * Update expected test result

Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang JonoYang force-pushed the 411-toolkit-upgrade-2 branch from feca62b to b3f4656 Compare June 9, 2022 22:34
@JonoYang
Copy link
Member Author

JonoYang commented Jun 9, 2022

@tdruez

The last few things I finished up were adding the package_uid field on the DiscoveredPackage model and replacing the ubuntu and redhat test Docker images with smaller images. Other than that, I need to merge aboutcode-org/scancode-toolkit#2974 and aboutcode-org/scancode-toolkit#2988 into the develop branch and then create a new release of scancode so we can test this branch out on staging.

JonoYang added 2 commits June 13, 2022 15:52
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang JonoYang force-pushed the 411-toolkit-upgrade-2 branch from 942e06d to 9fcec67 Compare June 13, 2022 23:40
JonoYang added 2 commits June 13, 2022 16:55
    * In the LoadInventory pipeline, create the DiscoveredPackages from a scan before creating the CodebaseResources

Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang JonoYang force-pushed the 411-toolkit-upgrade-2 branch from bb44b4e to 784dbbc Compare June 14, 2022 00:13
tdruez added 4 commits June 14, 2022 13:07
Signed-off-by: Thomas Druez <tdruez@nexb.com>
Signed-off-by: Thomas Druez <tdruez@nexb.com>
Signed-off-by: Thomas Druez <tdruez@nexb.com>
Signed-off-by: Thomas Druez <tdruez@nexb.com>
@tdruez tdruez merged commit ba4695a into main Jun 14, 2022
@tdruez tdruez deleted the 411-toolkit-upgrade-2 branch June 14, 2022 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants