-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report detailed Git provenance information for matches #16
Comments
Hello, I am very curious. Do you already know how are you going to use git features in order to find this kind of information ? Thanks! |
Hi @Coruscant11. Good question! So, Nosey Parker does its scan of Git history not by looking at commit metadata at all, but by simply scanning all blobs within the Git repository. This allows fast scanning of the content, but doesn't provide any metadata such as pathname for a blob or the commit(s) that introduced it. This makes the reporting of findings rather unfriendly to humans. So the problem that needs to be solved is this: for a given Git blob, determine the set of The information needed to solve this problem is present only in indirect for in a Git repository: pathnames of blobs are encoded in a DAG of tree objects, and commit ancestry is encoded in a DAG of commit objects. The problem can be posed as a graph problem. I've implemented this twice before in other non-public codebases, both times solving the above problem in one big operation for all blobs in a Git repo. This does indeed end up being expensive on certain large repositories, likely more expensive than the actual scanning step. In Nosey Parker, instead of solving this For a quick hack, there are ways to determine this information using Git command-line tools, like this. But that sort of approach is slow. I'm planning to work on this next, probably in the next couple weeks. |
Thank you for your excellent answer! Because it was related to some questions I had for #4. I was a little worried about the details of the missing commits. But I think I'll do exactly as the I was just wondering, when you say more expensive than actual scanning step. Let's take for example the linux kernel scan, which take let's say two minutes. If we want to find all information about commits and path names, how long could it take? Because I have the actually the idea that it could take few hours but I do not know if I am mistaking. This seems to be a very difficult problem, good luck! Don't hesitate to let me know if I can help you. |
In one older prototype implementation I have, it takes about 6 minutes on my laptop to find that information for 500 blobs. I suspect that code could be made a few times faster, also. So, a few times slower than actually scanning, but certainly not hours. |
A stopgap until this feature is implemented: a little shell script that can tell which commits in a repo contain a given blob, found on Stack Overflow: #!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=tformat:'%T %h %s' \
| while read tree commit subject ; do
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo $commit "$subject"
fi
done I named this
This could be easily adapted to also print the pathname of the object. |
Nice! |
This commit adds blob metadata to Nosey Parker. The scan command now collects and records some basic metadata about blobs (size in bytes, guessed mime type, guessed charset). The guessed metadata is based on path names, and at present only works on plain file inputs and not blobs found in Git history (see #16). If Nosey Parker is built with the libmagic feature, blob metadata is collected an recorded using an additional content-based mechanism that uses libmagic, which collects this information even for blobs found in Git history that do not have pathnames. This feature slows down scanning time something like 6-10x, and requires additional system-installed libraries to build, and so is not enabled by default. When scanning, by default, the metadata is collected and recorded only for blobs that have rule matches within them. The collection of blob metadata can be controlled slightly by the new `--record-all-blobs <BOOL>` command-line option; a true value causes all discovered blobs to have metadata collected and recorded, not just those with rule matches. The report command makes use of the newly collected metadata. In all output formats, the metadata is included. Additionally in this pull request: the performance of scanning on certain match-heavy workloads has been improved as much as 2x. This was achieved through using fewer sqlite transactions in the datastore implementation.
@Coruscant11 I'm working on adding native support to Nosey Parker to collect pathname and commit information for blobs. Hoping to merge that back soon. In the meantime, I have discovered a different workaround for determining this information from
This seems to work faster than the shell script I posted above. |
Awesome ! I knew this workaround, and I think it is working in a pretty similar way as what you did in the past. I have to check what's happening under the hood. For me this feature should be optional, used by user demand. Because one of the main advantage of NoseyParker, the performances, would be drastically decreased (resolving the blob provenance is very slow in huge repositories). Pretty curious to hear your opinion on this 😄 |
I have a pull request I'm actively working on (#66) that efficiently collects a bunch of additional metadata for all blobs. This includes the commit(s) that introduced the blob, the pathname it first appeared with, commit timestamps, messages, etc. This metadata is computed using a novel (?) algorithm combining Kahn's algorithm for topological graph traversal with a priority queue to minimize memory use. In most cases the total runtime overhead of computing this information is small, perhaps 15%. This is probably many orders of magnitude faster than running the On large/unusual repos I have tried (Linux and Homebrew Core) the overhead is much more noticeable, but still usable. In either case, the PR includes command-line options to disable metadata collection—and avoid its overhead—if desired. Support for computing this information natively will be merged back into Nosey Parker's |
The `scan` command now collects additional metadata about blobs found within Git repositories. Specifically, for each blob found in Git repository history, the set of commits where it was introduced and the accompanying pathname for the blob is collected. This is enabled by default, but can be controlled using the new `--git-blob-provenance={first-seen,minimal}` parameter. Fixes #16. There are several other improvements and new features in this commit: * Add a new rule to detect Amazon ARN strings * Fix a typo in the Okta API Key rule * Update dependencies
Like #15, this was also demoed at Black Hat EU 2022.
When a match is found within a blob in a Git repository, detailed provenance information should be reported, including:
With all this information, it is possible to generate permalinks to GitHub for matches.
The text was updated successfully, but these errors were encountered: