Capture and report package dependencies separate from actual packages #1237

pombredanne · 2018-10-30T20:13:43Z

Today, we collect dependencies from the package manifests, but these are mostly potential, first-level dependencies. In contrast we have several cases such as lockfiles where we have concrete dependencies but we do not have much in terms of package metadata.

So in order to support Godep, Gemfile lock, pip requirements.txt, etc. (and we do have parsers for several of these) we should have a new file-level attribute (and scanner) that deals exclusively with dependencies and nothing else. We could even go as far as decoupling this from the base --package scan and return dependencies only when requested.

The could still be returned as package.dependencies as they are today when found in a package manifest or when they can be related clearly to a manifest... or just reported under dependencies when they come from a some lockfile or always.

This needs some design and thinking of course.

Some of the dependencies format we miss or track:

Pypi Add dependencies for Pypi packages #653
RPMs Add dependencies to RPMs #649
NuGet Add dependencies to NuGet #648 (and these are rather complex)
.gitmodules Add support for parsing .gitmodules as packages/deps #681
C/C++ includes Add plugin to collect 'cpp_includes' data #1165
Gemfile/lock : https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/gemfile_lock.py
Godeps: https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/godeps.py
and of course all many other ones we have : Maven, npm, composer or do not have yet: Gradle, debian, FreeBSD, autotools, OSGi, SBT, etc, etc, etc.

@KinXer you input would be welcomed since you reported #631
@DennisClark @tdruez @JonoYang @MaJuRG @sschuberth feedback welcomed too.

The text was updated successfully, but these errors were encountered:

sschuberth · 2018-10-30T20:19:45Z

Obviously, I'd be curious whether we could leverage ORT's Analyzer component for this in some way instead of re-implementing much of the same logic in Python within ScanCode... we could modify the Analyzer to report back dependencies in whatever format ScanCode requires. Currently, we write out dependency information in YAML (or JSON) that looks like https://github.com/heremaps/oss-review-toolkit/blob/master/scanner/src/funTest/assets/analyzer-result.yml.

pombredanne · 2018-10-30T20:31:37Z

Obviously, I'd be curious whether we could leverage ORT's Analyzer component for this in some way

That would be awesome indeed! Especially since the approaches complement each other nicely: ORT is a dynamically collecting deps from running the package managers proper command, whereas ScanCode does only a static analysis of the manifests and runs nothing: so the two combined would cover all the use cases I can ever think of!

Just be sure I get this right, the dependency section is this part here: https://github.com/heremaps/oss-review-toolkit/blob/0c42b9351edfcbdc699287f24c25f36f728e19ec/scanner/src/funTest/assets/analyzer-result.yml#L37 correct? and is fed by each analyzer such as here https://github.com/heremaps/oss-review-toolkit/blob/915cfb931297dfea1128c7f457de81dc92b2ae51/analyzer/src/main/kotlin/managers/Bower.kt#L63

And the main code is there https://github.com/heremaps/oss-review-toolkit/blob/a297595fce3763b0e30eec0ebbbe9abb420905d9/model/src/main/kotlin/Scope.kt#L28

And you PackageId are close enough to a package URL that the conversion will be 100% easy.

sschuberth · 2018-10-30T20:43:06Z

the dependency section is this part here

Correct. We group dependencies by scope.

and is fed by each analyzer such as here

Right, except that line 63 you're quoting does not really contain any "feeding" code. Sticking to the example of Bower, probably a better line to quote is https://github.com/heremaps/oss-review-toolkit/blob/915cfb931297dfea1128c7f457de81dc92b2ae51/analyzer/src/main/kotlin/managers/Bower.kt#L105, which creates the actual Package entry, i.e. an entry such as starting at https://github.com/heremaps/oss-review-toolkit/blob/0c42b9351edfcbdc699287f24c25f36f728e19ec/scanner/src/funTest/assets/analyzer-result.yml#L78. The dependencies in the tree structure above are references to these packages.

And you PackageId are close enough to a package URL that the conversion will be 100% easy.

Yes. Also see oss-review-toolkit/ort#20.

steven-esser · 2018-10-31T17:07:47Z

@pombredanne I am in favour of this new scanner for certain package types (python especially).

mjherzog · 2018-10-31T17:20:28Z

+1

pombredanne · 2019-06-07T14:14:13Z

I think than rather to just list bare dependencies there is something larger and more generic which is the notion of a project or environment . In contrast with a package which is well defined with a name and version and lives in some package repository, a project can have some attributes of a package, but can have more and miss some.

From a higher level point of view, there are about four sets of data we can collect on a project or package:

metadata such as a name and version, description, keywords, license, etc. npm package.json and Maven POMs are providing this,
dependencies either potential or resolved (e.g. a lockfile) and possibly "scoped". A bare example is a Python requirements.txt file. npm package.json and Maven POMs are also providing this but not only this.
build instructions that can be succinct in many cases and leverage conventions or can be full fledged scripts. Makefile, CMake lists, ant, Grunt and many other fall in this category. npm package.json and Maven POMs are providing this, and so is a setup.py or .gemspec. A Visual Studio sln and .csproj and other IDE manifests would also be providing this.
version control information such a .git or .svn dir, ignore files, .gitmodules, files for the Andoird repo tool and similar.

Each of these four data pieces may exist or not. Their presence should dictate how we organize the normalized data returned from a scan.

metadata are essential to the definition of what we call a package. So IMHO when we have metadata and that we can determine a type, a name and possibly a version we have a proper package that would be stored in a packages list.

Yet if there is no name, (say a nameless private Composer package) this would no longer be a package (it cannot be published nor consumed as such anywhere) but it is only project/application like.

dependencies are either for a package or a project. Their presence alone (without metadata) are the mark of a project. For instance we can infer from the presence of a requirements.txt file that we are in a Python project and we know its dependencies.

Like deps, build instructions alone are the mark of a project.

version control information are either for a package or a project.

Therefore, I want to add a new data structure and scanner called either `project` or `development_environment` that would capture:

bare dependency declarations outside of a package manifest such as Gemfile and Gemfile.lock, Go deps, Python requirements, etc.
various attributes to describe the technology environment such as the build tools, version control system and revisions, submodules, IDE, etc.

We could also separate entirely the dependencies, project data and version control data but I feel like a single top attribute is likely enough. I am open to either ways.

mjherzog · 2019-06-07T16:25:59Z

project and development_environment may not be equivalent in the case of an upstream project and a different downstream development_environment which can look quite different.

sschuberth · 2019-07-01T11:48:08Z

In contrast with a package which is well defined with a name and version and lives in some package repository, a project can have some attributes of a package, but can have more and miss some.

We think the same, which is why we have separate Project and Package classes in ORT, and they don't even inherit form each other although their properties are similar in large parts.

From a higher level point of view, there are about four sets of data we can collect on a project or package:

We have all of that except the build instructions in ORT.

Therefore, I want to add a new data structure and scanner called either project or development_environment

I'd prefer "project" for similarity to ORT 😉

pombredanne · 2019-07-03T08:02:29Z

@sschuberth thanks! we think along... in hindsight I wonder if project may not be a tad overloaded as a term?

sschuberth · 2019-07-03T08:04:51Z

From a user / developer perspective, I believe that's simply that it is: a project. And we also simply couldn't think of a less overloaded but equally fitting term 😉

mjherzog · 2019-07-03T17:26:55Z

As much as it would be good to align with ORT terminology wherever possible, I think that project is not a good term in this context because what we are trying to name is typically a subset of a project where the most common uses of the term "project" in our domain seem to be:

An open source project
A unit of work within a Development Environment (as used in Eclipse or similar IDE)

What we are trying to name is a (sub)set of files that are logically related by origin, license and function within a project (as defined above).

A package represents the case where this set of files is grouped together by the original project - whether in a package created by a package manager or something as simple as an archive.
The alternative case we are trying to address is a set of files in a codebase - i.e. the files contained within a Development "project". This would usually be a set of source files and the associated configuration and build files, but there is no clear rule about how they are organized within a codebase once they have been extracted from their original "package". The classic use case here is C/C++ code.
I think that the term "component" is the best option here (although I will agree that the term "component" is also overloaded).

These points are separate from how we might add the definition of a Development Environment which is a much broader idea that would typically cover many Development projects.

sschuberth · 2019-07-03T18:43:35Z

What we are trying to name is a (sub)set of files that are logically related by origin, license and function within a project (as defined above).

That was not my understanding from reading the above text. What you describe here sounds a bit like what e.g. Gradle would call a "source set". But @pombredanne mentioned an additional attribute to capture: dependencies. As soon as dependencies come into the picture I find it less fitting to talk about sources / files, as usually dependencies are not managed / declared by the sources / files themselves, but by a high concept / wrapper, i.e. the build system / package manager.

I think that the term "component" is the best option here (although I will agree that the term "component" is also overloaded).

I wouldn't object to "component", simply because, like you said, it pretty much fits all and anything 😉

pombredanne · 2019-08-20T11:31:25Z

So here is where I will be going for now following the details of this conversation and #1237 (comment) :

add (or rather add back since we had it before) package_manifest as a single file-level attribute that contains package manifest metadata. (Today returned as packages only and moved around to the package root. Add back file-level package manifest #1728
keep packages as is as a file or directory level list of package manifest data aggregated from package_manifest to their root. Improve the manifest_path to be a full path and not a package root-relative path
add dependencies as a list of deps for dependency-only data files and lockfiles such as Python requirements.txt. Deal with either potential or resolved, and possibly "scoped". deps as it is done today in the packages (actually this will be the exact same data structure)
add build_script as a file-level attribute that can many of the same attributes as a package (including dep) but typically has no name and versions. Makefile, CMake lists, ant, gradle, Grunt and many other fall in this category. For some files that may server multiple purpose, prefer using a package_manifest instead: e.g. npm package.json, Maven POMs, setup.py or .gemspec are treated as manifests (with build instructions) but not build_script. A Visual Studio sln and .csproj and other IDE manifests would be build_script
add version_control as a directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as ORT's vcs or SPDX vcs_url

The topic of projects is basically set aside by focusing instead on the file levels data first: project-like or component-like concepts are something that is derived from these files anyway and requires them first

pombredanne · 2019-10-02T10:25:25Z

So here is where the latest on this. I am pushing this for comment in a branch:

add manifests as a single file-level attribute that contains package manifest metadata. This will be renamed manifests. This is a list like packages was as a manifest may declare more than one package (e.g. an RPM spec file, some build script, etc). See also Add back file-level package manifest #1728 . This is essentially the same and replaces the packages attributes which is removed.
add a root_path attribute to the manifest data structure that points to the root of a package. And remove the manifest_path attribute which was confusing. If we need to track which manifest a package data comes from that should not be inside the tracked data but outside.
move dependencies from the manifests (formerly packages) data as a list of deps as a file-level attribute. This contains only dependency data. This is used both for package manifests, build scripts and lockfiles such as package-lock.json, Gemfile.lock, or Python requirements.txt.
add is_build_script and is_package_manifest, is_ide_manifest as a file-level boolean attributes:

Makefile, CMake lists, ant, gradle, Grunt and many other fall in the is_build_script category.
package.json, Maven POMs, setup.py or .gemspec are is_package_manifest
A Visual Studio sln and .csproj and other IDE manifests would be is_ide_manifest

add version_control as a file or directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as AboutCode TK vcs data, ORT's vcs or SPDX vcs_url

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2019-10-03T13:35:20Z

And here is yet another updated proposal:

keep packages as today as a file-level attribute that contains package manifest or dependency manifest or build script package-like metadata in a list. (e.g. an RPM spec file, some build script, etc). See also Add back file-level package manifest #1728 . The only change is that the packages attributes is NOT moved to its root location anymore, but stays with the manifest.
add a root_path attribute to the manifest data structure that points to the root of a package. And remove the manifest_path attribute which was confusing. If we need to track which manifest a package data comes from that should not be inside the tracked data but outside.
move dependencies from the manifests (formerly packages) data as a list of deps as a file-level attribute. This contains only dependency data. This is used both for package manifests, build scripts and lockfiles such as package-lock.json, Gemfile.lock, or Python requirements.txt.
add is_build_script, is_package_manifest, is_ide_manifest and is_dependency_manifest as a file-level boolean attributes:

Makefile, CMake lists, ant, gradle, Grunt and many other fall in the is_build_script category. In some cases they could be is_dependency_manifest and/or is_package_manifest too
package.json, Maven POMs, setup.py or .gemspec are is_package_manifest and typically would also be is_dependency_manifest
A Visual Studio sln and .csproj and other IDE manifests would be is_ide_manifest
A yarn.lock, package-lock.json, requirements.txt are is_dependency_manifest

add is_private to the Package model when a package is either marked as such OR is not found in a public package registry.
add version_control as a file or directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as AboutCode TK vcs data, ORT's vcs or SPDX vcs_url

pombredanne · 2019-10-03T15:53:47Z

ok one last round... back toward keeping things simple enough and making fewer changes:

keep packages as today as a file-level attribute that contains package manifest or dependency manifest or build script package-like metadata in a list. (e.g. an RPM spec file, some build script, etc). See also Add back file-level package manifest #1728 . The only change is that the packages attributes is NOT moved to its root location anymore, but stays with the manifest.
add a root_path attribute to the manifest data structure that points to the root of a package. And remove the manifest_path attribute which was confusing. If we need to track which manifest a package data comes from that should not be inside the tracked data but outside.
add is_build_script, is_package_manifest, is_ide_manifest and is_dependency_manifest as boolean attributes of the Package data model:

Makefile, CMake lists, ant, gradle, Grunt and many other fall in the is_build_script category. In some cases they could be is_dependency_manifest and/or is_package_manifest too
package.json, Maven POMs, setup.py or .gemspec are is_package_manifest and typically would also be is_dependency_manifest
A Visual Studio sln and .csproj and other IDE manifests would be is_ide_manifest
A yarn.lock, package-lock.json, requirements.txt are is_dependency_manifest

These are for later:

add is_private to the Package model when a package is either marked as such OR is not found in a public package registry.
add version_control as a file or directory-level attribute to capture information such a .git or .svn dir, .gitmodules. using an object with the same definition as AboutCode TK vcs data, ORT's vcs or SPDX vcs_url

steven-esser · 2019-10-03T21:30:14Z

@pombredanne I think in the context of consolidation and other summation techniques this makes a lot of sense.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2022-04-14T15:33:01Z

This has been merged. Closing now!

pombredanne added new feature package scan labels Oct 30, 2018

pombredanne added this to the v3.0 milestone Oct 31, 2018

pombredanne added the Priority: high label Oct 31, 2018

pombredanne modified the milestones: v3.0, v3.1 Nov 4, 2018

pombredanne modified the milestones: v3.1 Documentation, documentation, documentation, v3.2 Feb 16, 2019

pombredanne mentioned this issue Jun 7, 2019

Group related files together (such as the files of a package, build scripts, etc) #1524

Closed

pombredanne mentioned this issue Aug 20, 2019

Address packages without a name in order to generate a valid purl #1514

Open

This was referenced Oct 2, 2019

manifest_path is mostly empty for packages #1718

Closed

Add back file-level package manifest #1728

Closed

pombredanne added a commit that referenced this issue Oct 3, 2019

Merge latest develop in 1728-package-manifests #1237

8a261dd

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit that referenced this issue Oct 3, 2019

Use packages attribute throughout #1237

53c89c0

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

aboutcode-org deleted a comment from sesser Nov 12, 2019

viragumathe5 pushed a commit to viragumathe5/scancode-toolkit that referenced this issue Mar 13, 2020

Use packages attribute throughout aboutcode-org#1237

aa4550a

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne removed this from the v3.3 milestone Sep 24, 2021

pombredanne changed the title ~~Add new dependencies attribute and scanner, separate from actual packages~~ Report package dependencies separate from actual packages Feb 2, 2022

pombredanne added the dependencies label Feb 2, 2022

pombredanne changed the title ~~Report package dependencies separate from actual packages~~ Capture and report package dependencies separate from actual packages Feb 2, 2022

pombredanne mentioned this issue Feb 2, 2022

Fetch details remotely for Go dependencies. #2495

Open

pombredanne closed this as completed Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture and report package dependencies separate from actual packages #1237

Capture and report package dependencies separate from actual packages #1237

pombredanne commented Oct 30, 2018 •

edited

Loading

sschuberth commented Oct 30, 2018

pombredanne commented Oct 30, 2018

sschuberth commented Oct 30, 2018 •

edited

Loading

steven-esser commented Oct 31, 2018

mjherzog commented Oct 31, 2018

pombredanne commented Jun 7, 2019 •

edited by mjherzog

Loading

mjherzog commented Jun 7, 2019

sschuberth commented Jul 1, 2019

pombredanne commented Jul 3, 2019

sschuberth commented Jul 3, 2019

mjherzog commented Jul 3, 2019

sschuberth commented Jul 3, 2019

pombredanne commented Aug 20, 2019 •

edited

Loading

pombredanne commented Oct 2, 2019

pombredanne commented Oct 3, 2019

pombredanne commented Oct 3, 2019

steven-esser commented Oct 3, 2019

pombredanne commented Apr 14, 2022

Capture and report package dependencies separate from actual packages #1237

Capture and report package dependencies separate from actual packages #1237

Comments

pombredanne commented Oct 30, 2018 • edited Loading

sschuberth commented Oct 30, 2018

pombredanne commented Oct 30, 2018

sschuberth commented Oct 30, 2018 • edited Loading

steven-esser commented Oct 31, 2018

mjherzog commented Oct 31, 2018

pombredanne commented Jun 7, 2019 • edited by mjherzog Loading

Therefore, I want to add a new data structure and scanner called either project or development_environment that would capture:

mjherzog commented Jun 7, 2019

sschuberth commented Jul 1, 2019

pombredanne commented Jul 3, 2019

sschuberth commented Jul 3, 2019

mjherzog commented Jul 3, 2019

sschuberth commented Jul 3, 2019

pombredanne commented Aug 20, 2019 • edited Loading

pombredanne commented Oct 2, 2019

pombredanne commented Oct 3, 2019

pombredanne commented Oct 3, 2019

steven-esser commented Oct 3, 2019

pombredanne commented Apr 14, 2022

pombredanne commented Oct 30, 2018 •

edited

Loading

sschuberth commented Oct 30, 2018 •

edited

Loading

pombredanne commented Jun 7, 2019 •

edited by mjherzog

Loading

Therefore, I want to add a new data structure and scanner called either `project` or `development_environment` that would capture:

pombredanne commented Aug 20, 2019 •

edited

Loading