-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Pack for deep, ignored directories #50
Conversation
@apparentlymart while I was pondering this change, I found a potentially problematic aspect of sourcebundle, where matching directories are deleted regardless of potential exclusion rules that follow it. I left a comment in the code. |
14b57d6
to
ff9113c
Compare
I guess it's not really clear to me what it means to ignore a parent directory but to "un-ignore" something in that directory. I guess you're interpreting it as "ignore everything except what matches the exclusions", which I can see as a rule that makes sense when building a tar because the directory tree gets flattened anyway and so we can presumably just skip over the tar entries that don't match the rule, and trust that the extracting program will still make some sense of it by filling in what was missing? 🤔 Seems trickier when trying to apply the ignore rules to a physical directory tree on disk by deleting, since in that case we're interpreting the rules "backwards", describing what to keep rather than what to skip. However, the source bundler stuff is all experimental right now anyway, so I think the comment you added here is sufficient for now and we can have a deeper think about what to do with this in a separate PR later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, this optimization makes logical sense, and I especially appreciate how cautiously scoped it is.
I have not smoke-tested it, but it looks good and your tests are convincing. Let's go for it.
When ignore rules match a directory, and certainly no exclusion rules would include subdirectories, short circuit further descent into that directory. The optimization supports a certain case where no exclusion rules of any kind follow a matching rule. For rules where this is the case, the rule is said to be 'dominant' in the result. Dominant rules can be certain that they won't be undone by additional rules because all the following rules ignore more files. I couldn't expand the definition of Dominant because globbing, character classes, and non-absolute rules conspire to make it difficult to know whether a particular exclusion rule would countermand a particular previous rule.
Exclusion rules modified with the word "exclusion" to represent the opposite (negated exclusion, or "inclusion") was a very misleading name. I renamed it throughout as "negation" and "negated" This does not modify any public package API
10d01bb
to
101bfff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it even better after that rename.
We see circumstances where monorepos containing many deep subdirectories with broad .terraformignore rules take a lot of time to pack because, before now, the rules had to be exhaustively checked for rule exclusions. This change adds an optimization for .terraformignore where, if no exclusions are present, ignored directories can be skipped during the tree walk that pack takes. By "if no exclusions are present", I mean when absolutely no exclusion rules of any kind follow a matching rule. For rules where this is the case, the rule is said to be 'dominant' in the result. Dominant rules can be certain that they won't be undone by following rules because all the following rules ignore more files.
I couldn't expand the definition of 'Dominant' because globbing, character classes, and non-absolute rules conspire to make it difficult to know whether a particular exclusion rule would countermand a particular previous rule.
Benchmark Results
I created a directory containing 200,000 files within 2,000 random subdirectories with a maximum depth of 5. I then set up a .terraformignore file to ignore all subdirectories unless they started with the digit 9:
These rules exclude 97% of the possible files in the directory. Before this optimization, go-slug would descend into each subdirectory, regardless of whether it was possible to reach a match. This operation took about 30 seconds on my computer. With this change, the operation took about 2 seconds, because all the subdirectories within the top level could be skipped since there were no possible exclusions that follow. It should be noted that if I added any exclusion rule whatsoever to the end, the optimization is defeated and the operation returns to the original duration.
Performance comparison with and without a trailing exclusion clause:
With no exclusions as described above:
Closes #19