Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanning multiple directories scans too much #3452

Open
rspier opened this issue Jul 12, 2023 · 6 comments
Open

Scanning multiple directories scans too much #3452

rspier opened this issue Jul 12, 2023 · 6 comments
Labels

Comments

@rspier
Copy link

rspier commented Jul 12, 2023

Description

Please leave a brief description of the bug or feature request:

How To Reproduce

Tell us how to reproduce the issue.

We have a giant third_party/ directory. GIANT! Trying to scan one package works fine. But trying to scan two at once, it scans things outside of those directories

$ scancode -n 129 --copyright --license --package --json /tmp/out.json  --max-in-memory 0 third_party/curl third_party/zlib
Setup plugins...
Collect file inventory...

It appears to hang there, but strace shows that it's actually scanning things outside of the curl and zlib directories, which will take a long time.

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? (Windows/MacOS/Linux): Debian testing based system.
  • What version of scancode-toolkit was used to generate the scan file? 32.0.4
  • What installation method was used to install/run scancode? (pip/source download/other)
@rspier rspier added the bug label Jul 12, 2023
@pombredanne
Copy link
Member

Ah, that's a flaw alright. When passing multiple input paths, I think that the current behaviour is to find the shared common root ancestor directory and "ignore" all parts that are not in the provided paths. That's a bad and stupid behaviour indeed.

@pombredanne
Copy link
Member

@JonoYang @AyanSinhaMahapatra what do you think could be the way to improve this?

@AyanSinhaMahapatra
Copy link
Member

@pombredanne there's the new paths you added to the Codebase model in aboutcode-org/commoncode#42, instead of using the include plugin to handle multiple paths, can't we use this directly?
Looking into this more.

AyanSinhaMahapatra added a commit that referenced this issue Aug 3, 2023
Reference: #3452
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@rspier
Copy link
Author

rspier commented Oct 20, 2023

I had some time to poke at this this afternoon, and it's not straightforward.

@pombredanne @AyanSinhaMahapatra Do you have any documentation on how paths is supposed to work. If I'm understanding properly, it's intended to be a set of subdirectories of the root (common_prefix) to filter to. On the surface, this seems more complicated than just iterating over multiple directories and concatenating the results. (So I'm trying to understand the rationale.)

It also looks like this isn't fully wired up yet. I started with commit 822cc91, and started working through failures. There seem to be some mismatched assumptions about absolute vs relative paths and representation.

I went looking for tests for _create_resources_from_paths (which I think is where the main issues are), but there aren't any that look quite like what I'm looking for. (Although there are some for Codebase).

Anyway, wanted to reach out before I went any deeper...

Thanks-

@pombredanne
Copy link
Member

On the surface, this seems more complicated than just iterating over multiple directories and concatenating the results. (So I'm trying to understand the rationale.)

that's an inherited technical wart and debt. The original design was to say that a scan would always have a single root directory.

@AyanSinhaMahapatra
Copy link
Member

Related: aboutcode-org/commoncode#35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants