Skip to content

Commit

Permalink
[DONT AUTOMERGE]: Generate Docs for 0.7.7 (#451)
Browse files Browse the repository at this point in the history
* Update: updated the documentation for 0.7.6

* Fix: made a small fix in the README

* Update: updated index.html with latest version and README.md with more instruction details

* Resolved README issues

* updated readme

* Empty Commit

* Generated the 0.7.7 docs
  • Loading branch information
micdavis authored and JGSweets committed Oct 4, 2022
1 parent 9599de1 commit dc4c415
Show file tree
Hide file tree
Showing 254 changed files with 63,780 additions and 1 deletion.
Binary file added docs/0.7.7/doctrees/API.doctree
Binary file not shown.
Binary file not shown.
Binary file added docs/0.7.7/doctrees/data_labeling.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/data_reader.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/data_readers.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.7.7/doctrees/dataprofiler.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.7.7/doctrees/dataprofiler.labelers.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.7.7/doctrees/dataprofiler.reports.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.7.7/doctrees/dataprofiler.version.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/environment.pickle
Binary file not shown.
Binary file added docs/0.7.7/doctrees/examples.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/graphs.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/index.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/install.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/labeler.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/modules.doctree
Binary file not shown.
438 changes: 438 additions & 0 deletions docs/0.7.7/doctrees/nbsphinx/add_new_model_to_data_labeler.ipynb

Large diffs are not rendered by default.

641 changes: 641 additions & 0 deletions docs/0.7.7/doctrees/nbsphinx/data_reader.ipynb

Large diffs are not rendered by default.

650 changes: 650 additions & 0 deletions docs/0.7.7/doctrees/nbsphinx/labeler.ipynb

Large diffs are not rendered by default.

470 changes: 470 additions & 0 deletions docs/0.7.7/doctrees/nbsphinx/overview.ipynb

Large diffs are not rendered by default.

577 changes: 577 additions & 0 deletions docs/0.7.7/doctrees/nbsphinx/profiler_example.ipynb

Large diffs are not rendered by default.

444 changes: 444 additions & 0 deletions docs/0.7.7/doctrees/nbsphinx/regex_labeler_from_scratch.ipynb

Large diffs are not rendered by default.

405 changes: 405 additions & 0 deletions docs/0.7.7/doctrees/nbsphinx/unstructured_profiler_example.ipynb

Large diffs are not rendered by default.

Binary file added docs/0.7.7/doctrees/overview.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/profiler.doctree
Binary file not shown.
Binary file added docs/0.7.7/doctrees/profiler_example.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
4 changes: 4 additions & 0 deletions docs/0.7.7/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 069dd74c21791bbcf1aac658337c995c
tags: 645f666f9bcd5a90fca523b33c5a78b7
284 changes: 284 additions & 0 deletions docs/0.7.7/html/API.html

Large diffs are not rendered by default.

Binary file added docs/0.7.7/html/_images/DL-Flowchart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/0.7.7/html/_images/histogram_example_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/0.7.7/html/_images/histogram_example_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/0.7.7/html/_images/histogram_example_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 16 additions & 0 deletions docs/0.7.7/html/_sources/API.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _API:

API
***

The API is split into 4 main components: Profilers, Labelers, Data Readers, and
Validators.

.. toctree::
:maxdepth: 1
:caption: Contents:

dataprofiler.data_readers
dataprofiler.profilers
dataprofiler.labelers
dataprofiler.validators
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"path": "../../feature_branch/examples/add_new_model_to_data_labeler.ipynb"
}
365 changes: 365 additions & 0 deletions docs/0.7.7/html/_sources/data_labeling.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,365 @@
.. _data_labeling:

Labeler (Sensitive Data)
************************

In this library, the term *data labeling* refers to entity recognition.

Builtin to the data profiler is a classifier which evaluates the complex data types of the dataset.
For structured data, it determines the complex data type of each column. When
running the data profile, it uses the default data labeling model builtin to the
library. However, the data labeler allows users to train their own data labeler
as well.

*Data Labels* are determined per cell for structured data (column/row when
the *profiler* is used) or at the character level for unstructured data. This
is a list of the default labels.

* UNKNOWN
* ADDRESS
* BAN (bank account number, 10-18 digits)
* CREDIT_CARD
* EMAIL_ADDRESS
* UUID
* HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
* IPV4
* IPV6
* MAC_ADDRESS
* PERSON
* PHONE_NUMBER
* SSN
* URL
* US_STATE
* DRIVERS_LICENSE
* DATE
* TIME
* DATETIME
* INTEGER
* FLOAT
* QUANTITY
* ORDINAL


Identify Entities in Structured Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Makes predictions and identifying labels:

.. code-block:: python
import dataprofiler as dp
# load data and data labeler
data = dp.Data("your_data.csv")
data_labeler = dp.DataLabeler(labeler_type='structured')
# make predictions and get labels per cell
predictions = data_labeler.predict(data)
Identify Entities in Unstructured Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Predict which class characters belong to in unstructured text:

.. code-block:: python
import dataprofiler as dp
data_labeler = dp.DataLabeler(labeler_type='unstructured')
# Example sample string, must be in an array (multiple arrays can be passed)
sample = ["Help\tJohn Macklemore\tneeds\tfood.\tPlease\tCall\t555-301-1234."
"\tHis\tssn\tis\tnot\t334-97-1234. I'm a BAN: 000043219499392912.\n"]
# Prediction what class each character belongs to
model_predictions = data_labeler.predict(
sample, predict_options=dict(show_confidences=True))
# Predictions / confidences are at the character level
final_results = model_predictions["pred"]
final_confidences = model_predictions["conf"]
It's also possible to change output formats, output similar to a **SpaCy** format:

.. code-block:: python
import dataprofiler as dp
data_labeler = dp.DataLabeler(labeler_type='unstructured', trainable=True)
# Example sample string, must be in an array (multiple arrays can be passed)
sample = ["Help\tJohn Macklemore\tneeds\tfood.\tPlease\tCall\t555-301-1234."
"\tHis\tssn\tis\tnot\t334-97-1234. I'm a BAN: 000043219499392912.\n"]
# Set the output to the NER format (start position, end position, label)
data_labeler.set_params(
{ 'postprocessor': { 'output_format':'ner', 'use_word_level_argmax':True } }
)
results = data_labeler.predict(sample)
print(results)
Train a New Data Labeler
~~~~~~~~~~~~~~~~~~~~~~~~

Mechanism for training your own data labeler on their own set of structured data
(tabular):

.. code-block:: python
import dataprofiler as dp
# Will need one column with a default label of UNKNOWN
data = dp.Data("your_file.csv")
data_labeler = dp.train_structured_labeler(
data=data,
save_dirpath="/path/to/save/labeler",
epochs=2
)
data_labeler.save_to_disk("my/save/path") # Saves the data labeler for reuse
Load an Existing Data Labeler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mechanism for loading an existing data_labeler:

.. code-block:: python
import dataprofiler as dp
data_labeler = dp.DataLabeler(
labeler_type='structured', dirpath="/path/to/my/labeler")
# get information about the parameters/inputs/output formats for the DataLabeler
data_labeler.help()
Extending a Data Labeler with Transfer Learning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Extending or changing labels of a data labeler w/ transfer learning:
Note: By default, **a labeler loaded will not be trainable**. In order to load a
trainable DataLabeler, the user must set `trainable=True` or load a labeler
using the `TrainableDataLabeler` class.

The following illustrates how to change the labels:

.. code-block:: python
import dataprofiler as dp
labels = ['label1', 'label2', ...] # new label set can also be an encoding dict
data = dp.Data("your_file.csv") # contains data with new labels
# load default structured Data Labeler w/ trainable set to True
data_labeler = dp.DataLabeler(labeler_type='structured', trainable=True)
# this will use transfer learning to retrain the data labeler on your new
# dataset and labels.
# NOTE: data must be in an acceptable format for the preprocessor to interpret.
# please refer to the preprocessor/model for the expected data format.
# Currently, the DataLabeler cannot take in Tabular data, but requires
# data to be ingested with two columns [X, y] where X is the samples and
# y is the labels.
model_results = data_labeler.fit(x=data['samples'], y=data['labels'],
validation_split=0.2, epochs=2, labels=labels)
# final_results, final_confidences are a list of results for each epoch
epoch_id = 0
final_results = model_results[epoch_id]["pred"]
final_confidences = model_results[epoch_id]["conf"]
The following illustrates how to extend the labels:

.. code-block:: python
import dataprofiler as dp
new_labels = ['label1', 'label2', ...]
data = dp.Data("your_file.csv") # contains data with new labels
# load default structured Data Labeler w/ trainable set to True
data_labeler = dp.DataLabeler(labeler_type='structured', trainable=True)
# this will maintain current labels and model weights, but extend the model's
# labels
for label in new_labels:
data_labeler.add_label(label)
# NOTE: a user can also add a label which maps to the same index as an existing
# label
# data_labeler.add_label(label, same_as='<label_name>')
# For a trainable model, the user must then train the model to be able to
# continue using the labeler since the model's graph has likely changed
# NOTE: data must be in an acceptable format for the preprocessor to interpret.
# please refer to the preprocessor/model for the expected data format.
# Currently, the DataLabeler cannot take in Tabular data, but requires
# data to be ingested with two columns [X, y] where X is the samples and
# y is the labels.
model_results = data_labeler.fit(x=data['samples'], y=data['labels'],
validation_split=0.2, epochs=2)
# final_results, final_confidences are a list of results for each epoch
epoch_id = 0
final_results = model_results[epoch_id]["pred"]
final_confidences = model_results[epoch_id]["conf"]
Changing pipeline parameters:

.. code-block:: python
import dataprofiler as dp
# load default Data Labeler
data_labeler = dp.DataLabeler(labeler_type='structured')
# change parameters of specific component
data_labeler.preprocessor.set_params({'param1': 'value1'})
# change multiple simultaneously.
data_labeler.set_params({
'preprocessor': {'param1': 'value1'},
'model': {'param2': 'value2'},
'postprocessor': {'param3': 'value3'}
})
Build Your Own Data Labeler
===========================

The DataLabeler has 3 main components: preprocessor, model, and postprocessor.
To create your own DataLabeler, each one would have to be created or an
existing component can be reused.

Given a set of the 3 components, you can construct your own DataLabeler:

.. code-block:: python
from dataprofiler.labelers.base_data_labeler import BaseDataLabeler, \
TrainableDataLabeler
from dataprofiler.labelers.character_level_cnn_model import CharacterLevelCnnModel
from dataprofiler.labelers.data_processing import \
StructCharPreprocessor, StructCharPostprocessor
# load a non-trainable data labeler
model = CharacterLevelCnnModel(...)
preprocessor = StructCharPreprocessor(...)
postprocessor = StructCharPostprocessor(...)
data_labeler = BaseDataLabeler.load_with_components(
preprocessor=preprocessor, model=model, postprocessor=postprocessor)
# check for basic compatibility between the processors and the model
data_labeler.check_pipeline()
# load trainable data labeler
data_labeler = TrainableDataLabeler.load_with_components(
preprocessor=preprocessor, model=model, postprocessor=postprocessor)
# check for basic compatibility between the processors and the model
data_labeler.check_pipeline()
Option for swapping out specific components of an existing labeler.

.. code-block:: python
import dataprofiler as dp
from dataprofiler.labelers.character_level_cnn_model import \
CharacterLevelCnnModel
from dataprofiler.labelers.data_processing import \
StructCharPreprocessor, StructCharPostprocessor
model = CharacterLevelCnnModel(...)
preprocessor = StructCharPreprocessor(...)
postprocessor = StructCharPostprocessor(...)
data_labeler = dp.DataLabeler(labeler_type='structured')
data_labeler.set_preprocessor(preprocessor)
data_labeler.set_model(model)
data_labeler.set_postprocessor(postprocessor)
# check for basic compatibility between the processors and the model
data_labeler.check_pipeline()
Model Component
~~~~~~~~~~~~~~~

In order to create your own model component for data labeling, you can utilize
the `BaseModel` class from `dataprofiler.labelers.base_model` and
overriding the abstract class methods.

Reviewing `CharacterLevelCnnModel` from
`dataprofiler.labelers.character_level_cnn_model` illustrates the functions
which need an override.

#. `__init__`: specifying default parameters and calling base `__init__`
#. `_validate_parameters`: validating parameters given by user during setting
#. `_need_to_reconstruct_model`: flag for when to reconstruct a model (i.e.
parameters change or labels change require a model reconstruction)
#. `_construct_model`: initial construction of the model given the parameters
#. `_reconstruct_model`: updates model architecture for new label set while
maintaining current model weights
#. `fit`: mechanism for the model to learn given training data
#. `predict`: mechanism for model to make predictions on data
#. `details`: prints a summary of the model construction
#. `save_to_disk`: saves model and model parameters to disk
#. `load_from_disk`: loads model given a path on disk


Preprocessor Component
~~~~~~~~~~~~~~~~~~~~~~

In order to create your own preprocessor component for data labeling, you can
utilize the `BaseDataPreprocessor` class
from `dataprofiler.labelers.data_processing` and override the abstract class
methods.

Reviewing `StructCharPreprocessor` from
`dataprofiler.labelers.data_processing` illustrates the functions which
need an override.

#. `__init__`: passing parameters to the base class and executing any
extraneous calculations to be saved as parameters
#. `_validate_parameters`: validating parameters given by user during
setting
#. `process`: takes in the user data and converts it into an digestible,
iterable format for the model
#. `set_params` (optional): if a parameter requires processing before setting,
a user can override this function to assist with setting the parameter
#. `_save_processor` (optional): if a parameter is not JSON serializable, a
user can override this function to assist in saving the processor and its
parameters
#. `load_from_disk` (optional): if a parameter(s) is not JSON serializable, a
user can override this function to assist in loading the processor

Postprocessor Component
~~~~~~~~~~~~~~~~~~~~~~~

The postprocessor is nearly identical to the preprocessor except it handles
the output of the model for processing. In order to create your own
postprocessor component for data labeling, you can utilize the
`BaseDataPostprocessor` class from `dataprofiler.labelers.data_processing`
and override the abstract class methods.

Reviewing `StructCharPostprocessor` from
`dataprofiler.labelers.data_processing` illustrates the functions which
need an override.

#. `__init__`: passing parameters to the base class and executing any
extraneous calculations to be saved as parameters
#. `_validate_parameters`: validating parameters given by user during
setting
#. `process`: takes in the output of the model and processes for output to
the user
#. `set_params` (optional): if a parameter requires processing before setting,
a user can override this function to assist with setting the parameter
#. `_save_processor` (optional): if a parameter is not JSON serializable, a
user can override this function to assist in saving the processor and its
parameters
#. `load_from_disk` (optional): if a parameter(s) is not JSON serializable, a
user can override this function to assist in loading the processor
3 changes: 3 additions & 0 deletions docs/0.7.7/html/_sources/data_reader.nblink.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"path": "../../feature_branch/examples/data_readers.ipynb"
}
Loading

0 comments on commit dc4c415

Please sign in to comment.