Data Linter

Summary

This code accompanies the NIPS 2017 ML Systems Workshop paper/poster, "The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets."

The Data Linter identifies potential issues (lints) in your ML training data.

Using the Data Linter

Prerequisites

You'll need the following installed to use the Data Linter:

Data Linter Demo

The easiest way to see how to use the Data Linter is to follow the demo instructions found in demo/README.md.

Running the Data Linter

Running the Data Linter requires the following steps:

Encoding your data in TFRecord format.
Generating summary statistics for those data, using Facets.
Running the Data Linter.
Using the Lint Explorer to produce the lint results.

Creating Data in the TFRecord Format

To see how to convert CSV files to the TFRecord format, look at the example code in demo/convert_to_tfrecord.py.

Summarizing Your Data Using Facets

To see how to generate summary statistics for your data, see the example code in demo/summarize_data.py.

Executing the Data Linter

Once you have both the data and summary statistics, you can run the Data Linter as such:

python data_linter_main.py --dataset_path PATH_TO_TFRECORDS \
  --stats_path PATH_TO_FACETS_SUMMARIES --results_path PATH_FOR_SAVING_RESULTS

For example, if you follow the instructions in the demo folder, you'll invoke the Data Linter like this:

python data_linter_main.py --dataset_path /tmp/adult.tfrecords \
  --stats_path /tmp/adult_summary.bin \
  --results_path /tmp/datalinter/results/lint_results.bin

Viewing Results with the Lint Explorer

After the Data Linter is done examining your data, you can view the results using this command:

python lint_explorer_main.py --results_path PATH_TO_RESULTS

For example:

python lint_explorer_main.py --results_path \
  /tmp/datalinter/results/lint_results.bin

Notes

The code makes use of Google's protobuf format. The protos are defined in protos/.

To make it easier to run the code, we include protobuf definitions from TensorFlow and Facets in this distribution.

Support

This is not an official Google project. This project will not be supported or maintained, and we will not accept any pull requests.

Authors

The Data Linter was created by Nick Hynes (nhynes@berkeley.edu) during an internship at Google with Michael Terry (michaelterry@google.com).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
demo		demo
protos		protos
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
data_linter.py		data_linter.py
data_linter_main.py		data_linter_main.py
data_linter_utils.py		data_linter_utils.py
example_pb2.py		example_pb2.py
explanations.py		explanations.py
feature_pb2.py		feature_pb2.py
feature_statistics_pb2.py		feature_statistics_pb2.py
lint_explorer.py		lint_explorer.py
lint_explorer_main.py		lint_explorer_main.py
lint_result_pb2.py		lint_result_pb2.py
linters.py		linters.py
make_protos.sh		make_protos.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Linter

Summary

Using the Data Linter

Prerequisites

Data Linter Demo

Running the Data Linter

Creating Data in the TFRecord Format

Summarizing Your Data Using Facets

Executing the Data Linter

Viewing Results with the Lint Explorer

Notes

Support

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

brain-research/data-linter

Folders and files

Latest commit

History

Repository files navigation

Data Linter

Summary

Using the Data Linter

Prerequisites

Data Linter Demo

Running the Data Linter

Creating Data in the TFRecord Format

Summarizing Your Data Using Facets

Executing the Data Linter

Viewing Results with the Lint Explorer

Notes

Support

Authors

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages