Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparations for v1.0 Launch #355

Merged
merged 131 commits into from
Feb 25, 2021
Merged

Preparations for v1.0 Launch #355

merged 131 commits into from
Feb 25, 2021

Conversation

domanchi
Copy link
Contributor

@domanchi domanchi commented Nov 8, 2020

Preface

I've been getting several questions about when this branch will be launched. I too am excited for this change, however, these are the features that need to be worked on before we can deploy this:

There are also some features that I think will be fantastic to add, but can wait till v1.1. These include:

  • Adding integration with gibberish-detector to reduce false positives with HighEntropyStrings
  • Adding --slim functionality to create baselines with the objective of minimal pre-commit modifications (removing line numbers, and generated_at key)
  • Refactor KeywordDetector so that keywords can be surfaced through filters, rather than built into the plugin itself

We're almost there!

Disclaimer

This PR is ridiculously long. Don't even bother. Seriously. The best way of checking this is to checkout my branch, play around with the tool, and read the source straight (rather than looking at this change log).

I really did try with keeping a nice git commit history. However, after some time, I realized that I was going to do fundamental architecture changes, and it will just be faster if each commit did not have to be backwards compatible by themselves. Due to this, the git history is more like mini-milestones towards this change for a better future.

My plan is that when we think the new version is ready for production, we'll just merge this PR. While you're reading the code, if you have any comments on it, create new issues and assign it to me -- I'll fix up the changes and submit a smaller, cleaner PR that bases off this new branch (essentially, a "second" master).

This PR therefore, is an attempt at providing more context, and detailing the changes I made at a high level, to make reading the changes a little more digestible.

Summary

This branch is a re-imagined architecture of the initial tool. After several years of using this tool in a variety of different workflows, as well as reading about several different pain points felt by the community, I redesigned this tool to be modular-by-design, and easily extensible.

On the successful merge of this branch, we will release version 1.0.0. Needless to say, this is a breaking change.

This new design introduces several new elements, including:

  • Global settings object, to eliminate all the pass-through parameters needed to be initialized for false positive identification
  • Modular filters, to provide a consistent (clean) approach of tuning out false positives
  • Dependency injection framework for better designed code
  • File scanners (e.g. config files, YAML files) now are file parsers (i.e. transformers) that convert special files into a compatible format for line-based scanning

I will make sure to reflect this in the CHANGELOG upon successful merge.

NOTE: There are still minor features to implement, but this should be in a pretty stable state to start playing with it.

Overview of Changes

New Features

  • Added NpmDetector and AzureStorageKeyDetector.

User-Facing Changes

I tried not to write additional features during this PR, since I wanted to swap out the underlying architecture, before continuing to add more things. However, these are the following changes I made to the interface of this tool:

  • Removed individual plugin disablement flags (e.g. --no-base64-string-scan). This list just became unnecessarily long, especially with the community adding many more plugins than initially expected (which is fantastic news!). Hopefully, the community continues to contribute, and it won't be at the expense of usability.

  • Added --disabled-plugins flag, so that you can list the names you want to disable. e.g.

    Before: detect-secrets scan --no-base64-string-scan --no-slack-scan
    After: detect-secrets scan --disabled-plugins Base64HighEntropyString,SlackDetector

  • Added --list-all-plugins flag, so that you can quickly list all plugins used in a scan.

  • Changed --use-all-plugins to --force-use-all-plugins, as I felt the latter more accurately described what was being done.

  • Changed detect-secrets scan --update <baseline> to detect-secrets scan --baseline <baseline> to keep it in sync with the pre-commit hook usage.

  • Changed detect-secrets audit --display-results to detect-secrets audit --stats. Also, --stats can be used with the audit --diff mode.

  • Changed --custom-plugins to --plugin

  • Stopped support for py35. I really wanted f-strings, and looks like it's already hit EOL (source)?

Baseline Changes

The details of the baseline changes can be found under detect_secrets.core.upgrades.*.

  • Removal of exclude, word_list and custom_plugin_paths (one-off keys)
  • Addition of filters_used to list all filters used to eliminate false positives
  • Renamed base64_limit and hex_limit to just limit

Developer Velocity

Documentation

OMG, so much better documentation! Check out docs/* for yourself.

Organization

/detect_secrets               # This is where the main code lives
    /audit                    # Powers `detect-secrets audit`
    /core                     # Powers the detect-secrets engine
        /upgrades             # For version bumps that modify the baseline, instructions
                              # on how to apply automated upgrades are found here

    /filters                  # Functions that allow for filtering of false positives
    /plugins                  # All plugins live here, modularized.
    /transformers             # Converters that transform special file formats into line proxies
    /util                     # utility functionality
    main.py                   # Entrypoint for console use
    pre_commit_hook.py        # Entrypoint for pre-commit hook
    settings.py               # Global settings object

Yay to not having crazy long files, and actually separating them out into easy-to-understand pieces (ironic for this PR, I know).

Decoupled SecretsCollection and baseline logic

Perhaps one of the most fundamental changes about this design is the introduction of the global Settings object, in detect_secrets.settings. This module is responsible for one thing: the serialization (and deserialization) of the engine's settings (aka. which plugins and filters are used). In pulling this out into it's own globally accessible module, we can avoid the ever-growing pass-through parameter list that had polluted this codebase (e.g. source)

After the configuration logic was separated, we could drastically simplify our plugin initialization code (found in detect_secrets.core.plugins.initialize). Be sure to check out the wild differences between the former version (invocation, definition) and the current version (invocation, definition). This allows us to move it out of setting them up in usage.py, and allows other developers to much more easily configure their plugins as part of scripts, as compared to being forced to rewrite parts of main.py.

This brings me to a huge paradigm shift: SecretsCollection is merely a container of secrets. Previously, the code was written in a way that the baseline was essentially a serialized SecretsCollection, which made it difficult to reason about. Now, the baseline is a combination of Settings and secrets (the interface of which is SecretsCollection) -- and this allows us to do things like detect_secrets.core.baseline.load.

Serialized, customizable heuristics to filter false positives

Scanning for secrets should not be a complicated task, however, the former code had evolved to making it so. Part of the reason is that as we've developed more ways of filtering out false positives, we've had to implement that logic in several different parts of the code (like, why the heck does a file parser need to know the regex used to exclude files?)

A better solution leverages dependency injection (DI) to supply parameters to filters on demand. For example, detect_secrets.filters.heuristic.is_sequential_string takes in the secret value, and returns True if the secret should be skipped. By aggregating all such configured filters (see detect_secrets.core.scan.get_filters), we can pass in any and all values that the filters may be looking for to make a decision (source).

Check out the currently implemented "out-of-the-box" filters in detect_secrets.filters.*

File Transformers

The thing about filters is that in its current design, they work best with line-based secrets. However, as you know, HighEntropyStrings is the special snowflake that handles YAML and config files differently. How should we resolve this discrepancy?

After playing around with several ideas, I decided that we could have a custom parser for the file, and convert the file into "proxy" lines -- that is, the lines are not an exact match to the original content, but they do allow us to run our line-based plugins on them. With this new change, all plugins are now line-based, and special files will be transformed into lines. Not only does that allow them to take advantage of line-based filters, but now all plugins don't need to re-implement their own parser for the special files!

This was a complicated change -- check out detect_secrets.transformers.* for more details.

Upgrade Infrastructure

With these changes, I realize that the baseline format cannot be depended upon to remain static. Rather, we needed a formalized method to upgrade baselines, that does a better job than merge_baseline. Thus, I created a framework for formalizing the changes made to baseline formats across versions, inspired by Django/Pyramid (I don't remember which).

detect_secrets.core.baseline.upgrade(...) will take your baseline version, and run it through all the necessary changes so that its format is compatible with the latest version. In doing so, it makes it easier for users to upgrade, as well as building in support for audits between versions.

Cleaner Tests

A good coding style heuristic is to structure your code so that it's easily testable. And from the looks of things, our core tests were really ugly and hard to test. I gutted a lot of tests, and opted for cleaner, straight-forward yet equally effective tests (which was easier to do, because of the better designed architecture).

Here's to hoping that these improvements make it much easier to work on this codebase in the future, and allow the community to continue working to make this tool useful.

Aaron Loo and others added 8 commits November 9, 2020 09:51
Co-authored-by: Dariusz Porowski <Dariusz.Porowski@microsoft.com>
Co-authored-by: Dariusz Porowski <Dariusz.Porowski@microsoft.com>
Co-authored-by: Dariusz Porowski <Dariusz.Porowski@microsoft.com>
Co-authored-by: Dariusz Porowski <Dariusz.Porowski@microsoft.com>
@domanchi domanchi marked this pull request as ready for review February 25, 2021 00:16
@domanchi
Copy link
Contributor Author

All individual commits have been reviewed, and I have verified that this passes tests locally (since we currently don't have a working CI for this).

We're finally going live!

@domanchi domanchi merged commit f7b8b03 into master Feb 25, 2021
daparm added a commit to daparm/code-with-engineering-playbook that referenced this pull request Mar 16, 2021
Minor change in detect-secrets:
The Parameter "scan --update <baseline>" has changed to "scan --baseline <baseline>" in the v1.0 

Reference: 
Yelp/detect-secrets#355
At header "User-Facing Changes"
TessFerrandez pushed a commit to microsoft/code-with-engineering-playbook that referenced this pull request Mar 19, 2021
Minor change in detect-secrets:
The Parameter "scan --update <baseline>" has changed to "scan --baseline <baseline>" in the v1.0 

Reference: 
Yelp/detect-secrets#355
At header "User-Facing Changes"
jimmyhlee94 pushed a commit to jimmyhlee94/detect-secrets that referenced this pull request Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants