Scan a repository

How to scan a repository

Install the dependencies (possibly using a virtualenv)

Instantiate the client (either Postgres or sqlite)

from credentialdigger import PgClient

c = PgClient(dbhost='xxx.xxx.xxx.xxx', dbport=NUM, dbname='mydbname', dbuser='myusername', dbpassword='mypassword')

or

from credentialdigger import SqliteClient

c = SqliteClient(path='/path/to/data.db')

[OPTIONAL] Add the repository

c.add_repo(url='https://github.com/user/repo')

Launch the scan of the repo

new_discoveries = c.scan(repo_url=REPO_URL,
                         category=CATEGORY,
                         models=MODELS,
                         force=FORCE,
                         local_repo=LOCAL_REPO,
                         similarity=SIMILARITY,
                         git_username=GIT_USERNAME,
                         git_token=GIT_TOKEN,
                         debug=DEBUG)

Arguments

REPO_URL: the url of the repo we want to scan.
CATEGORY: the category of rules to be used for the scan. If no category is selected, the scanner uses all the rules that are actually stored in the database
MODELS: A list of models that we want to apply to auto-classify false positives (the models are applied in cascade, sequentially). If no models are specified, then do not use any. Refer to Models page to know more on models
FORCE: True if we want to force the complete scan of a repository. Indeed, in case the repository has already been scanned, we would consider only the new commits
LOCAL_REPO: if True, get the repository from a local directory instead of the web
GENERATOR: True if we want to generate an adapted extractor for the snippet model. This only works if the SnippetModel is in MODELS, and if there are still discoveries to classify when the time for the SnippetModel comes [DEPRECATED IN v4.4]
SIMILARITY: if True, build the embedding model, compute and store embeddings of all discoveries, and allow for automatic update of similar discoveries
DEBUG: True if we want visual feedbacks (progress bars) when the scan is in progress, False otherwise (the default choice)
GIT_USERNAME: the git username to be used to authenticate (only enforced if also the git_token is set)
GIT_TOKEN: git personal access token used to authenticate (needed for private repos or for some git server where authentication is mandatory)

Output

new_discoveries is a list of ids of discoveries that have automatically been inserted into the db as new. If we set MODELS, then the discoveries classified as false positives are automatically updated in the db (as false_positive) without user intervention, and do not appear in new_discoveries.

new_discoveries are supposed to be analyzed manually by the user, and their state will be manually changed by the user.

for disc_id in new_discoveries:
    this_discovery = c.get_discovery(disc_id)
    # Analyze it
    # Change its state (if needed)
    c.update(disc_id, 'new state')

Refer to States for the states supported by the system.

Scan the snapshot of a repository

Credential Digger also provides a method to scan the whole repository at a given point in time, i.e., either at a specific commit id or at the last commit of a specific branch. A subsequent scan_snapshot on the same repository (but at a different commit id, obviously), will only take into consideration the diff between the new snapshot and the previously scanned once (unless the force parameter is set to True). Moreover, when scanning a snapshot, the timestamp for the last_scan of the repo is set to the timestamp of the commit id of the chosen snapshot instead of the date of the scan. This way, users can have a (more useful) indication on the coverage of a repo.

After instantiating the client, this scan can be run as follows:

   new_discoveries = c.scan_snapshot(repo_url=REPO_URL,
                                     branch_or_commit=BRANCH_OR_COMMIT,
                                     category=CATEGORY,
                                     models=MODELS,
                                     force=FORCE,
                                     similarity=SIMILARITY,
                                     git_username=GIT_USERNAME,
                                     git_token=GIT_TOKEN,
                                     debug=DEBUG,
                                     max_depth=MAX_DEPTH,
                                     ignore_list=IGNORE_LIST)

Arguments

CATEGORY, MODELS, SIMILARITY, GIT_USERNAME, GIT_TOKEN, and DEBUG work same as in scan method.
BRANCH_OR_COMMIT: the branch name or the commit id where the whole repository will be scanned.
MAX_DEPTH: The maximum depth to which traverse the subdirectories tree. A negative value will not affect the scan. The default value is -1.
IGNORE_LIST: A list of paths to ignore during the scan.

Scan a pull request

Credential Digger also provides a method to scan a pull request, i.e., all the new lines of code introduced in the commits part of a pull request. The scan of a pull request requires the repo has never been scanned before (not to clash with the definition of "diff" of scan and scan_snapshot).

After instantiating the client, this scan mode can be run as follows:

   new_discoveries = c.scan_pull_request(repo_url=REPO_URL,
                                         pr_number=PR_NUMBER,
                                         api_endpoint=API_ENDPOINT,
                                         category=CATEGORY,
                                         models=MODELS,
                                         force=FORCE,
                                         similarity=SIMILARITY,
                                         git_token=GIT_TOKEN,
                                         debug=DEBUG)

Arguments

CATEGORY, MODELS, SIMILARITY, GIT_TOKEN, and DEBUG work same as in scan method.
PR_NUMBER: the id of the pull request that will be scanned.
API_ENDPOINT: The github endpoint the PR has been opened on. The default value is https://api.github.com i.e., the public github.com platform.

Scan user's repositories

Credential Digger also provides a method to scan all the repositories belonging to a user.

After instantiating the client, this scan can be run as follows:

new_repos_discoveries = c.scan_user(username=GITHUB_USERNAME,
                                    category=CATEGORY,
                                    models=MODELS,
                                    similarity=SIMILARITY,
                                    debug=DEBUG,
                                    git_token=GIT_TOKEN,
                                    api_endpoint=API_ENDPOINT,
                                    forks=FORKS)

Arguments

CATEGORY, MODELS, SIMILARITY, GIT_TOKEN, and DEBUG work same as in scan method.
USERNAME: the username as appearing on GitHub. All the repositories in this account will be considered for the scan. Please note that this parameter is different from GIT_USERNAME in the scan.
FORKS: True if we want to scan also forked repositories, False otherwise (the default choice)
API_ENDPOINT: it's the api endpoint for the git server. If not set, github.com api endpoint, i.e., https://api.github.com, is set.

Output

new_repos_discoveries is a dictionary, where keys are urls of the repositories scanned, and, for each repository, its value is a list of ids of discoveries that have automatically been inserted into the db as new (i.e., the return value from the scan function).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan a repository

How to scan a repository

Arguments

Output

Scan the snapshot of a repository

Arguments

Scan a pull request

Arguments

Scan user's repositories

Arguments

Output

Setup

User interfaces

Features

Architecture

Clone this wiki locally