Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom License Rules folder #2471

Closed
1 of 7 tasks
tardyp opened this issue Mar 31, 2021 · 8 comments
Closed
1 of 7 tasks

Custom License Rules folder #2471

tardyp opened this issue Mar 31, 2021 · 8 comments

Comments

@tardyp
Copy link
Contributor

tardyp commented Mar 31, 2021

Short Description

Our internal code has copyright headers that we would like to properly categorize.
We don't think it make sense to upstream those rules, and we want to avoid forking scancode.

Thus we would like to add an option to scan code to provide a folder path which would contain custom .yml + RULE files.

Possible Labels

  • license scan

Select Category

  • Enhancement
  • Add License/Copyright
  • Scan Feature
  • Packaging
  • Documentation
  • Expand Support
  • Other

How This Feature will help you/your organization

This would help us to use scancode to categorize proprietary code we get from subcontractors

Possible Solution/Implementation Details

User would say

scancode -clip --json-pp --custom_licenses=/path/to/licenses --custom_rules=/path/to/rules - path/to/code

Can you help with this Feature

We are willing to provide a PR for this feature

@mjherzog
Copy link
Member

This feature should be useful for many ScanCode users.

@pombredanne
Copy link
Member

This makes sense. @richardfontana requested this feature in #480 and I reckon I have been slow to act as I was fearing fragmenting the database of licenses. In hindsight, this is unlikely a (unexpressed) valid concern I had then.

Now if these are just a few proprietary license and headers, it could be well worth adding them to scancode anyway.

And to implement this feature here are some thoughts:

A) the base approach to get these the extra rules in scancode:

  1. a directory that contains extra rules and you could point to with some command line argument
  2. a "plugin" where we package extra licenses and rules in a Python package and that can be installed as some private extra locally.

I am leaning towards 2. as otherwise this may be complicated to deploy this.

B) how these rules and license would be consumed:

  1. they could be merged in scancode main index
  2. they could be included in their own secondary index (with either A.1 or A.2) and the detection would run using this (or these) extra indexes either before of after the main index, and the matched results merged

I am not sure which is best.

@tardyp We could have a quick live session to iron out a path!

@tardyp
Copy link
Contributor Author

tardyp commented Apr 1, 2021

I didn't see #480, as I only focused my search on the keyword RULES.

I like what I see there, especially the idea from @DennisClark to automatically create this custom folder based on Unknown License findings.

In my first scans with scancode, we end up with big pile of unknown license, which is normal as we want to use scancode to make sure our proprietary software is not mixed up with open-source, and that our devs use packaging techniques to compose software.

I spent some time yesterday to experiment with the source code of scancode, and indeed dicovered the huge license library and the need to cache the index.

I am not sure if for custom license there is really a usecase where those number will be so big that they need to be cached as well. The needed cache module refactoring seems quite scary to me.

What I like with secondary index is that we could skip primary matching all together if the secondary index match score is high enough.

This could open the path to a quick scan mode that we could put in the pre-commit CI.

@pombredanne
Copy link
Member

In my first scans with scancode, we end up with big pile of unknown license, which is normal as we want to use scancode to make sure our proprietary software is not mixed up with open-source, and that our devs use packaging techniques to compose software.

FWIW, any incorrect detection is treated as a bug (so tickets are mucho welcome!) AND @AyanSinhaMahapatra 's https://github.com/nexB/scancode-analyzer/ is a new, emerging tool to spot and potentially fix these issues using multiple approaches including some ML.

I am not sure if for custom license there is really a usecase where those number will be so big that they need to be cached as well. The needed cache module refactoring seems quite scary to me.

No worries there, it's not that complicated

What I like with secondary index is that we could skip primary matching all together if the secondary index match score is high enough.
This could open the path to a quick scan mode that we could put in the pre-commit CI.

Question: if you were to use a secondary index in your case, would you see an exclusive us of that index for a given scan run and not the main one? or would you see the use of boths at the same time?

@tardyp
Copy link
Contributor Author

tardyp commented Apr 1, 2021

incorrect detection is treated as a bug

I don't say it is incorrect detection, as those are mostly files, which are our proprietary license, and I don't expect scancode to magically detect it.
We have our own spdx identifiers, and scancode detects that as unknown-spdx, maybe that could be enhanced

Question: if you were to use a secondary index in your case, would you see an exclusive us of that index for a given scan run and not the main one? or would you see the use of boths at the same time?

I would use both.

For each file, if the secondary index detects with 100% score that this is our copyright, don't bother run the rest of the rules.
If a file does not match one of our proprietary licenses, we try to detect based on the primary database (and we afford this takes 250ms per file)

@codeakki
Copy link

codeakki commented Apr 4, 2021

Hey May I know how scancode-toolkit create dataset for agent

@tardyp
Copy link
Contributor Author

tardyp commented Apr 4, 2021

Hi @codeakki ,
the dataset is stored inside the source code:
https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data

@pombredanne
Copy link
Member

I am closing this in favor of the older #480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants