Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture and report package dependencies separate from actual packages #1237

Closed
pombredanne opened this issue Oct 30, 2018 · 18 comments
Closed

Comments

@pombredanne
Copy link
Member

pombredanne commented Oct 30, 2018

Today, we collect dependencies from the package manifests, but these are mostly potential, first-level dependencies. In contrast we have several cases such as lockfiles where we have concrete dependencies but we do not have much in terms of package metadata.

So in order to support Godep, Gemfile lock, pip requirements.txt, etc. (and we do have parsers for several of these) we should have a new file-level attribute (and scanner) that deals exclusively with dependencies and nothing else. We could even go as far as decoupling this from the base --package scan and return dependencies only when requested.

The could still be returned as package.dependencies as they are today when found in a package manifest or when they can be related clearly to a manifest... or just reported under dependencies when they come from a some lockfile or always.

This needs some design and thinking of course.

Some of the dependencies format we miss or track:

@KinXer you input would be welcomed since you reported #631
@DennisClark @tdruez @JonoYang @MaJuRG @sschuberth feedback welcomed too.

@sschuberth
Copy link
Collaborator

Obviously, I'd be curious whether we could leverage ORT's Analyzer component for this in some way instead of re-implementing much of the same logic in Python within ScanCode... we could modify the Analyzer to report back dependencies in whatever format ScanCode requires. Currently, we write out dependency information in YAML (or JSON) that looks like https://github.com/heremaps/oss-review-toolkit/blob/master/scanner/src/funTest/assets/analyzer-result.yml.

@pombredanne
Copy link
Member Author

Obviously, I'd be curious whether we could leverage ORT's Analyzer component for this in some way

That would be awesome indeed! Especially since the approaches complement each other nicely: ORT is a dynamically collecting deps from running the package managers proper command, whereas ScanCode does only a static analysis of the manifests and runs nothing: so the two combined would cover all the use cases I can ever think of!

Just be sure I get this right, the dependency section is this part here: https://github.com/heremaps/oss-review-toolkit/blob/0c42b9351edfcbdc699287f24c25f36f728e19ec/scanner/src/funTest/assets/analyzer-result.yml#L37 correct? and is fed by each analyzer such as here https://github.com/heremaps/oss-review-toolkit/blob/915cfb931297dfea1128c7f457de81dc92b2ae51/analyzer/src/main/kotlin/managers/Bower.kt#L63

And the main code is there https://github.com/heremaps/oss-review-toolkit/blob/a297595fce3763b0e30eec0ebbbe9abb420905d9/model/src/main/kotlin/Scope.kt#L28

And you PackageId are close enough to a package URL that the conversion will be 100% easy.

@sschuberth
Copy link
Collaborator

sschuberth commented Oct 30, 2018

the dependency section is this part here

Correct. We group dependencies by scope.

and is fed by each analyzer such as here

Right, except that line 63 you're quoting does not really contain any "feeding" code. Sticking to the example of Bower, probably a better line to quote is https://github.com/heremaps/oss-review-toolkit/blob/915cfb931297dfea1128c7f457de81dc92b2ae51/analyzer/src/main/kotlin/managers/Bower.kt#L105, which creates the actual Package entry, i.e. an entry such as starting at https://github.com/heremaps/oss-review-toolkit/blob/0c42b9351edfcbdc699287f24c25f36f728e19ec/scanner/src/funTest/assets/analyzer-result.yml#L78. The dependencies in the tree structure above are references to these packages.

And you PackageId are close enough to a package URL that the conversion will be 100% easy.

Yes. Also see oss-review-toolkit/ort#20.

@pombredanne pombredanne added this to the v3.0 milestone Oct 31, 2018
@steven-esser
Copy link
Contributor

@pombredanne I am in favour of this new scanner for certain package types (python especially).

@mjherzog
Copy link
Member

+1

@pombredanne
Copy link
Member Author

pombredanne commented Jun 7, 2019

I think than rather to just list bare dependencies there is something larger and more generic which is the notion of a project or environment . In contrast with a package which is well defined with a name and version and lives in some package repository, a project can have some attributes of a package, but can have more and miss some.

From a higher level point of view, there are about four sets of data we can collect on a project or package:

  1. metadata such as a name and version, description, keywords, license, etc. npm package.json and Maven POMs are providing this,
  2. dependencies either potential or resolved (e.g. a lockfile) and possibly "scoped". A bare example is a Python requirements.txt file. npm package.json and Maven POMs are also providing this but not only this.
  3. build instructions that can be succinct in many cases and leverage conventions or can be full fledged scripts. Makefile, CMake lists, ant, Grunt and many other fall in this category. npm package.json and Maven POMs are providing this, and so is a setup.py or .gemspec. A Visual Studio sln and .csproj and other IDE manifests would also be providing this.
  4. version control information such a .git or .svn dir, ignore files, .gitmodules, files for the Andoird repo tool and similar.

Each of these four data pieces may exist or not. Their presence should dictate how we organize the normalized data returned from a scan.

metadata are essential to the definition of what we call a package. So IMHO when we have metadata and that we can determine a type, a name and possibly a version we have a proper package that would be stored in a packages list.

Yet if there is no name, (say a nameless private Composer package) this would no longer be a package (it cannot be published nor consumed as such anywhere) but it is only project/application like.

dependencies are either for a package or a project. Their presence alone (without metadata) are the mark of a project. For instance we can infer from the presence of a requirements.txt file that we are in a Python project and we know its dependencies.

Like deps, build instructions alone are the mark of a project.

version control information are either for a package or a project.

Therefore, I want to add a new data structure and scanner called either project or development_environment that would capture:

  • bare dependency declarations outside of a package manifest such as Gemfile and Gemfile.lock, Go deps, Python requirements, etc.
  • various attributes to describe the technology environment such as the build tools, version control system and revisions, submodules, IDE, etc.

We could also separate entirely the dependencies, project data and version control data but I feel like a single top attribute is likely enough. I am open to either ways.

@mjherzog
Copy link
Member

mjherzog commented Jun 7, 2019

project and development_environment may not be equivalent in the case of an upstream project and a different downstream development_environment which can look quite different.

@sschuberth
Copy link
Collaborator

In contrast with a package which is well defined with a name and version and lives in some package repository, a project can have some attributes of a package, but can have more and miss some.

We think the same, which is why we have separate Project and Package classes in ORT, and they don't even inherit form each other although their properties are similar in large parts.

From a higher level point of view, there are about four sets of data we can collect on a project or package:

We have all of that except the build instructions in ORT.

Therefore, I want to add a new data structure and scanner called either project or development_environment

I'd prefer "project" for similarity to ORT 😉

@pombredanne
Copy link
Member Author

@sschuberth thanks! we think along... in hindsight I wonder if project may not be a tad overloaded as a term?

@sschuberth
Copy link
Collaborator

From a user / developer perspective, I believe that's simply that it is: a project. And we also simply couldn't think of a less overloaded but equally fitting term 😉

@mjherzog
Copy link
Member

mjherzog commented Jul 3, 2019

As much as it would be good to align with ORT terminology wherever possible, I think that project is not a good term in this context because what we are trying to name is typically a subset of a project where the most common uses of the term "project" in our domain seem to be:

  1. An open source project
  2. A unit of work within a Development Environment (as used in Eclipse or similar IDE)

What we are trying to name is a (sub)set of files that are logically related by origin, license and function within a project (as defined above).

A package represents the case where this set of files is grouped together by the original project - whether in a package created by a package manager or something as simple as an archive.
The alternative case we are trying to address is a set of files in a codebase - i.e. the files contained within a Development "project". This would usually be a set of source files and the associated configuration and build files, but there is no clear rule about how they are organized within a codebase once they have been extracted from their original "package". The classic use case here is C/C++ code.
I think that the term "component" is the best option here (although I will agree that the term "component" is also overloaded).

These points are separate from how we might add the definition of a Development Environment which is a much broader idea that would typically cover many Development projects.

@sschuberth
Copy link
Collaborator

What we are trying to name is a (sub)set of files that are logically related by origin, license and function within a project (as defined above).

That was not my understanding from reading the above text. What you describe here sounds a bit like what e.g. Gradle would call a "source set". But @pombredanne mentioned an additional attribute to capture: dependencies. As soon as dependencies come into the picture I find it less fitting to talk about sources / files, as usually dependencies are not managed / declared by the sources / files themselves, but by a high concept / wrapper, i.e. the build system / package manager.

I think that the term "component" is the best option here (although I will agree that the term "component" is also overloaded).

I wouldn't object to "component", simply because, like you said, it pretty much fits all and anything 😉

@pombredanne
Copy link
Member Author

pombredanne commented Aug 20, 2019

So here is where I will be going for now following the details of this conversation and #1237 (comment) :

  1. add (or rather add back since we had it before) package_manifest as a single file-level attribute that contains package manifest metadata. (Today returned as packages only and moved around to the package root. Add back file-level package manifest #1728

  2. keep packages as is as a file or directory level list of package manifest data aggregated from package_manifest to their root. Improve the manifest_path to be a full path and not a package root-relative path

  3. add dependencies as a list of deps for dependency-only data files and lockfiles such as Python requirements.txt. Deal with either potential or resolved, and possibly "scoped". deps as it is done today in the packages (actually this will be the exact same data structure)

  4. add build_script as a file-level attribute that can many of the same attributes as a package (including dep) but typically has no name and versions. Makefile, CMake lists, ant, gradle, Grunt and many other fall in this category. For some files that may server multiple purpose, prefer using a package_manifest instead: e.g. npm package.json, Maven POMs, setup.py or .gemspec are treated as manifests (with build instructions) but not build_script. A Visual Studio sln and .csproj and other IDE manifests would be build_script

  5. add version_control as a directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as ORT's vcs or SPDX vcs_url

The topic of projects is basically set aside by focusing instead on the file levels data first: project-like or component-like concepts are something that is derived from these files anyway and requires them first

@pombredanne
Copy link
Member Author

So here is where the latest on this. I am pushing this for comment in a branch:

  1. add manifests as a single file-level attribute that contains package manifest metadata. This will be renamed manifests. This is a list like packages was as a manifest may declare more than one package (e.g. an RPM spec file, some build script, etc). See also Add back file-level package manifest #1728 . This is essentially the same and replaces the packages attributes which is removed.

  2. add a root_path attribute to the manifest data structure that points to the root of a package. And remove the manifest_path attribute which was confusing. If we need to track which manifest a package data comes from that should not be inside the tracked data but outside.

  3. move dependencies from the manifests (formerly packages) data as a list of deps as a file-level attribute. This contains only dependency data. This is used both for package manifests, build scripts and lockfiles such as package-lock.json, Gemfile.lock, or Python requirements.txt.

  4. add is_build_script and is_package_manifest, is_ide_manifest as a file-level boolean attributes:

  • Makefile, CMake lists, ant, gradle, Grunt and many other fall in the is_build_script category.
  • package.json, Maven POMs, setup.py or .gemspec are is_package_manifest
  • A Visual Studio sln and .csproj and other IDE manifests would be is_ide_manifest
  1. add version_control as a file or directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as AboutCode TK vcs data, ORT's vcs or SPDX vcs_url

pombredanne added a commit that referenced this issue Oct 3, 2019
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Oct 3, 2019
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

And here is yet another updated proposal:

  1. keep packages as today as a file-level attribute that contains package manifest or dependency manifest or build script package-like metadata in a list. (e.g. an RPM spec file, some build script, etc). See also Add back file-level package manifest #1728 . The only change is that the packages attributes is NOT moved to its root location anymore, but stays with the manifest.

  2. add a root_path attribute to the manifest data structure that points to the root of a package. And remove the manifest_path attribute which was confusing. If we need to track which manifest a package data comes from that should not be inside the tracked data but outside.

  3. move dependencies from the manifests (formerly packages) data as a list of deps as a file-level attribute. This contains only dependency data. This is used both for package manifests, build scripts and lockfiles such as package-lock.json, Gemfile.lock, or Python requirements.txt.

  4. add is_build_script, is_package_manifest, is_ide_manifest and is_dependency_manifest as a file-level boolean attributes:

  • Makefile, CMake lists, ant, gradle, Grunt and many other fall in the is_build_script category. In some cases they could be is_dependency_manifest and/or is_package_manifest too
  • package.json, Maven POMs, setup.py or .gemspec are is_package_manifest and typically would also be is_dependency_manifest
  • A Visual Studio sln and .csproj and other IDE manifests would be is_ide_manifest
  • A yarn.lock, package-lock.json, requirements.txt are is_dependency_manifest
  1. add is_private to the Package model when a package is either marked as such OR is not found in a public package registry.

  2. add version_control as a file or directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as AboutCode TK vcs data, ORT's vcs or SPDX vcs_url

@pombredanne
Copy link
Member Author

ok one last round... back toward keeping things simple enough and making fewer changes:

  1. keep packages as today as a file-level attribute that contains package manifest or dependency manifest or build script package-like metadata in a list. (e.g. an RPM spec file, some build script, etc). See also Add back file-level package manifest #1728 . The only change is that the packages attributes is NOT moved to its root location anymore, but stays with the manifest.

  2. add a root_path attribute to the manifest data structure that points to the root of a package. And remove the manifest_path attribute which was confusing. If we need to track which manifest a package data comes from that should not be inside the tracked data but outside.

  3. add is_build_script, is_package_manifest, is_ide_manifest and is_dependency_manifest as boolean attributes of the Package data model:

  • Makefile, CMake lists, ant, gradle, Grunt and many other fall in the is_build_script category. In some cases they could be is_dependency_manifest and/or is_package_manifest too
  • package.json, Maven POMs, setup.py or .gemspec are is_package_manifest and typically would also be is_dependency_manifest
  • A Visual Studio sln and .csproj and other IDE manifests would be is_ide_manifest
  • A yarn.lock, package-lock.json, requirements.txt are is_dependency_manifest

These are for later:

  1. add is_private to the Package model when a package is either marked as such OR is not found in a public package registry.

  2. add version_control as a file or directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as AboutCode TK vcs data, ORT's vcs or SPDX vcs_url

@steven-esser
Copy link
Contributor

@pombredanne I think in the context of consolidation and other summation techniques this makes a lot of sense.

@aboutcode-org aboutcode-org deleted a comment from sesser Nov 12, 2019
@aboutcode-org aboutcode-org deleted a comment from sesser Nov 12, 2019
viragumathe5 pushed a commit to viragumathe5/scancode-toolkit that referenced this issue Mar 13, 2020
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne removed this from the v3.3 milestone Sep 24, 2021
@pombredanne pombredanne changed the title Add new dependencies attribute and scanner, separate from actual packages Report package dependencies separate from actual packages Feb 2, 2022
@pombredanne pombredanne changed the title Report package dependencies separate from actual packages Capture and report package dependencies separate from actual packages Feb 2, 2022
@pombredanne
Copy link
Member Author

This has been merged. Closing now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants