Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Pack for deep, ignored directories #50

Merged
merged 2 commits into from
Dec 4, 2023

Conversation

brandonc
Copy link
Contributor

@brandonc brandonc commented Nov 23, 2023

We see circumstances where monorepos containing many deep subdirectories with broad .terraformignore rules take a lot of time to pack because, before now, the rules had to be exhaustively checked for rule exclusions. This change adds an optimization for .terraformignore where, if no exclusions are present, ignored directories can be skipped during the tree walk that pack takes. By "if no exclusions are present", I mean when absolutely no exclusion rules of any kind follow a matching rule. For rules where this is the case, the rule is said to be 'dominant' in the result. Dominant rules can be certain that they won't be undone by following rules because all the following rules ignore more files.

I couldn't expand the definition of 'Dominant' because globbing, character classes, and non-absolute rules conspire to make it difficult to know whether a particular exclusion rule would countermand a particular previous rule.

Benchmark Results

I created a directory containing 200,000 files within 2,000 random subdirectories with a maximum depth of 5. I then set up a .terraformignore file to ignore all subdirectories unless they started with the digit 9:

/subdir_1*/**/*
/subdir_2*/**/*
/subdir_3*/**/*
/subdir_4*/**/*
/subdir_5*/**/*
/subdir_6*/**/*
/subdir_7*/**/*
/subdir_8*/**/*

These rules exclude 97% of the possible files in the directory. Before this optimization, go-slug would descend into each subdirectory, regardless of whether it was possible to reach a match. This operation took about 30 seconds on my computer. With this change, the operation took about 2 seconds, because all the subdirectories within the top level could be skipped since there were no possible exclusions that follow. It should be noted that if I added any exclusion rule whatsoever to the end, the optimization is defeated and the operation returns to the original duration.

Performance comparison with and without a trailing exclusion clause:

$ go test -bench=. -count 1
goos: darwin
goarch: amd64
pkg: github.com/brandonc/deepdir/bench
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkPackGeneratedFiles-16    	       1	27741266300 ns/op
--- BENCH: BenchmarkPackGeneratedFiles-16
    bench_test.go:32: Packed 6176 files to 163 bytes
PASS
ok  	github.com/brandonc/deepdir/bench	27.890s

With no exclusions as described above:

go test -bench=. -count 1
goos: darwin
goarch: amd64
pkg: github.com/brandonc/deepdir/bench
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkPackGeneratedFiles-16    	       1	1600486149 ns/op
--- BENCH: BenchmarkPackGeneratedFiles-16
    bench_test.go:32: Packed 6073 files to 145 bytes
    bench_test.go:24: Removed "/var/folders/mk/509hnk695wz03fbt21q__h_m0000gq/T/benchslug1905125818"
PASS
ok  	github.com/brandonc/deepdir/bench	1.748s

Closes #19

@brandonc
Copy link
Contributor Author

@apparentlymart while I was pondering this change, I found a potentially problematic aspect of sourcebundle, where matching directories are deleted regardless of potential exclusion rules that follow it. I left a comment in the code.

@brandonc brandonc force-pushed the brandonc/maybe_skip_pack_descent branch from 14b57d6 to ff9113c Compare November 24, 2023 13:27
@apparentlymart
Copy link
Contributor

I guess it's not really clear to me what it means to ignore a parent directory but to "un-ignore" something in that directory.

I guess you're interpreting it as "ignore everything except what matches the exclusions", which I can see as a rule that makes sense when building a tar because the directory tree gets flattened anyway and so we can presumably just skip over the tar entries that don't match the rule, and trust that the extracting program will still make some sense of it by filling in what was missing? 🤔

Seems trickier when trying to apply the ignore rules to a physical directory tree on disk by deleting, since in that case we're interpreting the rules "backwards", describing what to keep rather than what to skip. However, the source bundler stuff is all experimental right now anyway, so I think the comment you added here is sufficient for now and we can have a deeper think about what to do with this in a separate PR later.

Copy link
Member

@nfagerlund nfagerlund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, this optimization makes logical sense, and I especially appreciate how cautiously scoped it is.

I have not smoke-tested it, but it looks good and your tests are convincing. Let's go for it.

When ignore rules match a directory, and certainly no exclusion rules would include subdirectories, short circuit further descent into that directory.

The optimization supports a certain case where no exclusion rules of any kind follow a matching rule. For rules where this is the case, the rule is said to be 'dominant' in the result. Dominant rules can be certain that they won't be undone by additional rules because all the following rules ignore more files.

I couldn't expand the definition of Dominant because globbing, character classes, and non-absolute rules conspire to make it difficult to know whether a particular exclusion rule would countermand a particular previous rule.
Exclusion rules modified with the word "exclusion" to represent the opposite (negated exclusion, or "inclusion") was a very misleading name. I renamed it throughout as "negation" and "negated"

This does not modify any public package API
@brandonc brandonc force-pushed the brandonc/maybe_skip_pack_descent branch from 10d01bb to 101bfff Compare December 4, 2023 16:51
Copy link
Member

@nfagerlund nfagerlund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it even better after that rename.

@brandonc brandonc merged commit 4f09527 into main Dec 4, 2023
9 checks passed
@brandonc brandonc deleted the brandonc/maybe_skip_pack_descent branch December 4, 2023 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid entering excluded directories
3 participants