Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework input enumeration #216

Merged
merged 5 commits into from
Aug 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,11 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
## Unreleased

### Changes
- Inputs are now enumerated incrementally as scanning proceeds rather than done in an initial batch step ([#216](https://github.com/praetorian-inc/noseyparker/pull/216)).
This reduces peak memory use and CPU time 10-20%, particularly in environments with slow I/O.
A consequence of this change is that the total amount of data to scan is not known until it has actually been scanned, and so the scanning progress bar no longer shows a completion percentage.

- When scanning, the progress bar for cloning Git repositories now includes the current repository URL ([#212](https://github.com/praetorian-inc/noseyparker/pull/212)).
- When cloning Git repositories while scanning, the progress bar for now includes the current repository URL ([#212](https://github.com/praetorian-inc/noseyparker/pull/212)).

- When scanning, automatically cloned Git repositories are now recorded with the path given on the command line instead of the canonicalized path ([#212](https://github.com/praetorian-inc/noseyparker/pull/212)).
This makes datastores slightly more portable across different environments, such as within a Docker container and on the host machine, as relative paths can now be recorded.
Expand All @@ -20,11 +23,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- The built-in support for enumerating and interacting with GitHub is now a compile time-selectable feature that is enabled by default ([#213](https://github.com/praetorian-inc/noseyparker/pull/213)).
This makes it possible to build a slimmer release for environments where GitHub functionality is unused.

### Fixes

- The `Google OAuth Credentials` rule has been revised to avoid runtime errors about an empty capture group.


## [v0.19.0](https://github.com/praetorian-inc/noseyparker/releases/v0.19.0) (2024-07-30)

### Additions

- The `scan` and `github repos list` commands offer a new `--github-repo-type={all,source,fork}` option to select a subset of repositories ([#204](https://github.com/praetorian-inc/noseyparker/pull/204)).

- A category mechanism is now provided for rules ([#208](https://github.com/praetorian-inc/noseyparker/pull/208)).
Expand All @@ -42,7 +48,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
The category information is included in output in the `rules list` command.

### Changes

- The `scan` and `github repos list` commands now only consider non-forked repositories by default ([#204](https://github.com/praetorian-inc/noseyparker/pull/204)).
This behavior can be reverted to the previous behavior using the `--github-repo-type=all` option.

Expand Down Expand Up @@ -84,7 +89,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
## [v0.18.1](https://github.com/praetorian-inc/noseyparker/releases/v0.18.1) (2024-07-12)

### Fixes

- Nosey Parker no longer crashes upon startup when running in environments with less than 4 GiB of RAM ([#202](https://github.com/praetorian-inc/noseyparker/pull/202)).

- The `Base64-PEM-Encoded Private Key` rule has been refined to reduce false positives and avoid a rare performance pitfall.
Expand All @@ -93,7 +97,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
## [v0.18.0](https://github.com/praetorian-inc/noseyparker/releases/v0.18.0) (2024-06-27)

### Additions

- The README now includes several animated GIFs that demonstrate simple example use cases ([#154](https://github.com/praetorian-inc/noseyparker/pull/154)).

- The `report` command now offers a new `--finding-status=STATUS` filtering option ([#162](https://github.com/praetorian-inc/noseyparker/pull/162)).
Expand Down Expand Up @@ -151,7 +154,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
This helps avoid complete loss of scan results in the rare event of a crash.

### Fixes

- A rare crash when parsing malformed Git commit timestamps has been fixed by updating the `gix-date` dependency ([#185](https://github.com/praetorian-inc/noseyparker/pull/185)).

- Upon `noseyparker` startup, if resource limits cannot be adjusted, instead of crashing, a warning is printed and the process attempts to continue ([#170](https://github.com/praetorian-inc/noseyparker/issues/170)).
Expand Down
11 changes: 6 additions & 5 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion crates/input-enumerator/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ publish.workspace = true
anyhow = { version = "1.0" }
bstr = { version = "1.0", features = ["serde"] }
bstring-serde = { path = "../bstring-serde" }
crossbeam-channel = "0.5"
fixedbitset = "0.5"
gix = { version = "0.64", features = ["max-performance", "serde"] }
ignore = "0.4"
petgraph = "0.6"
progress = { path = "../progress" }
roaring = "0.10"
schemars = { version = "0.8" }
serde = { version = "1.0", features = ["derive"] }
Expand Down
23 changes: 10 additions & 13 deletions crates/input-enumerator/src/git_metadata_graph.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ use std::time::Instant;
use tracing::{debug, error, warn};

use crate::bstring_table::BStringTable;
use progress::Progress;

type Symbol = crate::bstring_table::Symbol<u32>;

Expand Down Expand Up @@ -310,7 +309,7 @@ pub struct RepoMetadata {
}

impl GitMetadataGraph {
pub fn repo_metadata(&self, progress: &Progress) -> Result<Vec<RepoMetadata>> {
pub fn get_repo_metadata(&self) -> Result<Vec<RepoMetadata>> {
let t1 = Instant::now();
let symbols = &self.symbols;
let cg = &self.commits;
Expand Down Expand Up @@ -511,17 +510,15 @@ impl GitMetadataGraph {
assert_eq!(num_commits_visited, num_commits);
assert_eq!(visited_commits.len(), num_commits);

progress.suspend(|| {
debug!(
"{num_commits_visited} commits visited; \
{max_frontier_size} max entries in frontier; \
{max_live_seen_sets} max live seen sets; \
{num_trees_introduced} trees introduced; \
{num_blobs_introduced} blobs introduced; \
{:.6}s",
t1.elapsed().as_secs_f64()
);
});
debug!(
"{num_commits_visited} commits visited; \
{max_frontier_size} max entries in frontier; \
{max_live_seen_sets} max live seen sets; \
{num_trees_introduced} trees introduced; \
{num_blobs_introduced} blobs introduced; \
{:.6}s",
t1.elapsed().as_secs_f64()
);

// Massage intermediate accumulated results into output format
let commit_metadata = cg
Expand Down
62 changes: 24 additions & 38 deletions crates/input-enumerator/src/git_repo_enumerator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,14 @@ use gix::{hashtable::HashMap, ObjectId, OdbHandle, Repository};
use ignore::gitignore::Gitignore;
use smallvec::SmallVec;
use std::path::{Path, PathBuf};
use std::time::Instant;
// use std::time::Instant;
use tracing::{error_span, warn};
use tracing::{debug, debug_span, warn};

use crate::blob_appearance::{BlobAppearance, BlobAppearanceSet};
use crate::git_commit_metadata::CommitMetadata;
use crate::git_metadata_graph::GitMetadataGraph;

use progress::Progress;

// -------------------------------------------------------------------------------------------------
// implementation helpers
// -------------------------------------------------------------------------------------------------
Expand Down Expand Up @@ -43,26 +42,27 @@ pub struct ObjectCounts {
}

// TODO: measure how helpful or pointless it is to count the objects in advance
// FIXME: if keeping the pre-counting step, add some new kind of progress indicator
fn count_git_objects(odb: &OdbHandle, progress: &Progress) -> Result<ObjectCounts> {
fn count_git_objects(odb: &OdbHandle) -> Result<ObjectCounts> {
let t1 = Instant::now();

use gix::object::Kind;
use gix::odb::store::iter::Ordering;
use gix::prelude::*;

let mut num_commits = 0;
let mut num_trees = 0;
let mut num_blobs = 0;
let mut num_objects = 0;

for oid in odb
.iter()
.context("Failed to iterate object database")?
.with_ordering(Ordering::PackAscendingOffsetThenLooseLexicographical)
{
let oid = unwrap_or_continue!(oid, |e| {
progress.suspend(|| warn!("Failed to read object id: {e}"))
});
num_objects += 1;
let oid = unwrap_or_continue!(oid, |e| { warn!("Failed to read object id: {e}") });
let hdr = unwrap_or_continue!(odb.header(oid), |e| {
progress.suspend(|| warn!("Failed to read object header for {oid}: {e}"))
warn!("Failed to read object header for {oid}: {e}")
});
match hdr.kind() {
Kind::Commit => num_commits += 1,
Expand All @@ -71,6 +71,9 @@ fn count_git_objects(odb: &OdbHandle, progress: &Progress) -> Result<ObjectCount
Kind::Tag => {}
}
}

debug!("Counted {num_objects} objects in {:.6}s", t1.elapsed().as_secs_f64());

Ok(ObjectCounts {
num_commits,
num_trees,
Expand Down Expand Up @@ -129,25 +132,18 @@ impl<'a> GitRepoWithMetadataEnumerator<'a> {
}
}

pub fn run(&self, progress: &mut Progress) -> Result<GitRepoResult> {
pub fn run(&self) -> Result<GitRepoResult> {
let t1 = Instant::now();

use gix::object::Kind;
use gix::odb::store::iter::Ordering;
use gix::prelude::*;

let _span = error_span!("enumerate_git_with_metadata", "{}", self.path.display()).entered();

macro_rules! warn {
($($arg:expr),*) => {
progress.suspend(|| {
tracing::warn!($($arg),*);
})
}
}
let _span = debug_span!("enumerate_git_with_metadata", "{}", self.path.display()).entered();

let odb = &self.repo.objects;

// TODO: measure how helpful or pointless it is to count the objects in advance
// FIXME: if keeping the pre-counting step, add some new kind of progress indicator

// First count the objects to figure out how big to allocate data structures.
// We're assuming that the repository doesn't change in the meantime.
Expand All @@ -156,7 +152,7 @@ impl<'a> GitRepoWithMetadataEnumerator<'a> {
num_commits,
num_trees,
num_blobs,
} = count_git_objects(odb, progress)?;
} = count_git_objects(odb)?;

let mut blobs: Vec<(ObjectId, u64)> = Vec::with_capacity(num_blobs);
let mut metadata_graph = GitMetadataGraph::with_capacity(num_commits, num_trees, num_blobs);
Expand All @@ -183,7 +179,6 @@ impl<'a> GitRepoWithMetadataEnumerator<'a> {
let obj_size = hdr.size();
metadata_graph.get_blob_idx(oid);
blobs.push((oid, obj_size));
progress.inc(obj_size);
}

Kind::Commit => {
Expand Down Expand Up @@ -244,8 +239,10 @@ impl<'a> GitRepoWithMetadataEnumerator<'a> {
}
}

debug!("Built metadata graph in {:.6}s", t1.elapsed().as_secs_f64());

let path = self.path.to_owned();
match metadata_graph.repo_metadata(progress) {
match metadata_graph.get_repo_metadata() {
Err(e) => {
warn!("Failed to compute reachable blobs; ignoring metadata: {e}");
let blobs = blobs
Expand Down Expand Up @@ -281,10 +278,8 @@ impl<'a> GitRepoWithMetadataEnumerator<'a> {
// Build blobs result set.
//
// Apply any path-based ignore rules to blobs here, like the filesystem enumerator,
// filtering out blobs that have paths to ignore.
//
// Note that the behavior of ignoring blobs from Git repositories may be
// surprising.
// filtering out blobs that have paths to ignore. Note that the behavior of
// ignoring blobs from Git repositories may be surprising:
//
// A blob may appear within a Git repository under many different paths.
// Nosey Parker doesn't compute the *entire* set of paths that each blob
Expand Down Expand Up @@ -375,20 +370,12 @@ impl<'a> GitRepoEnumerator<'a> {
Self { path, repo }
}

pub fn run(&self, progress: &mut Progress) -> Result<GitRepoResult> {
pub fn run(&self) -> Result<GitRepoResult> {
use gix::object::Kind;
use gix::odb::store::iter::Ordering;
use gix::prelude::*;

let _span = error_span!("enumerate_git", "{}", self.path.display()).entered();

macro_rules! warn {
($($arg:expr),*) => {
progress.suspend(|| {
tracing::warn!($($arg),*);
})
}
}
let _span = debug_span!("enumerate_git", "{}", self.path.display()).entered();

let odb = &self.repo.objects;

Expand All @@ -406,7 +393,6 @@ impl<'a> GitRepoEnumerator<'a> {
if hdr.kind() == Kind::Blob {
let obj_size = hdr.size();
blobs.push((oid, obj_size));
progress.inc(obj_size);
}
}

Expand Down
Loading