Project Matt: AWS S3 PII Scanner

This project was created to help to scan your AWS S3 buckets for PII data. The app leverages the scale and cost of AWS services ensuring that you only pay for what you use.

When deployed, it scans your AWS S3 bucket (you can also set prefixes to limit scan to specific paths) and detects file types automatically and extracts possible PII using regular expressions.

The scan summary is loaded to your Elasticsearch cluster with which you can create Kibana dashboards to report your DLP exposure.

Classifiers

Regex: Currently the app detects of some key European personal data regexp patterns. However, you can fork the project and add more regex expressions. You can read more on the available classifiers here.
Keyword Matching: This is currently in development. This is not released yet. This is due large amount of domain expertise is required on this topic.
Convolutional Neural Networks: This is active development and will be released with the next major update. The project will use CNN to detect sensitive or PII words in scanned files.

Supported File Formats

Project Matt uses Apache Tika under the hood for file parsing. Hence, all file formats supported out-of-the-box by Apache Tika are supported - including media files. ~~Currently, we cannot guarantee support for parsing parquet file formats. This is in active development and would be released within the next minor upgrades.~~

Reading Parquet Files is now Supported

All compression file formats supported by Apache Tika are available.

Deployment

An AWS Cloudformation template that deploys the jar app as an AWS Batch job is available. .

NOTE: You can only scan S3 buckets in the same region as where your template is deployed.

Requirements

Elasticsearch Cluster with HTTPS enabled: This is used to save scan reports
Kibana: For dashboards and visualizations
Redis: This is used to maintain the state of the application. Keeps track of last scanned files and some other application metadata.

Deployment

You will need to set some environment variables which are set via the cloudformation template. They include:

Elasticsearch Host URL ES_HOST (Uses HTTPS Client)
Elasticsearch Username ES_USERNAME (if http auth is enabled)
Elasticsearch Username ES_PASSWD (if http auth is enabled)
AWS S3 Bucket to scan MY_S3_BUCKET
AWS S3 Prefix to scan MY_S3_PREFIX (MUST BE IN THE S3 BUCKET)
Redis Host URL REDIS_HOST
Redis Username REDIS_USERNAME (if auth is enabled)
Redis Username REDIS_PASSWD(if auth is enabled)

By default, maximum number of AWS S3 objects to be scanned is set to 2000. This we assume is more ideal for performance purposes.

Cost

Project Matt only performs S3 GET requests, hence you pay $0.0008 for every job execution for S3 charges. For AWS Batch, you only pay by the instance type you select when deploying the template. By default, the template makes use of spot instances also to save cost.

Cost estimation does not cover supporting infrastructure such as Elasticsearch and Redis instances.

License

Usage is licensed under MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
deploy_app		deploy_app
docs		docs
project		project
src		src
travis		travis
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Matt: AWS S3 PII Scanner

Classifiers

Supported File Formats

Deployment

Requirements

Deployment

Cost

License

About

Releases

Packages

Languages

License

oelesinsc24/project-matt

Folders and files

Latest commit

History

Repository files navigation

Project Matt: AWS S3 PII Scanner

Classifiers

Supported File Formats

Deployment

Requirements

Deployment

Cost

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages