Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.6.0 #330

Merged
merged 7 commits into from
Jul 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/0.6.0/doctrees/API.doctree
Binary file not shown.
Binary file not shown.
Binary file added docs/0.6.0/doctrees/data_labeling.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/data_reader.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/data_readers.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.6.0/doctrees/dataprofiler.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.6.0/doctrees/dataprofiler.labelers.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.6.0/doctrees/dataprofiler.settings.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/0.6.0/doctrees/dataprofiler.version.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/environment.pickle
Binary file not shown.
Binary file added docs/0.6.0/doctrees/examples.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/index.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/install.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/labeler.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/modules.doctree
Binary file not shown.
438 changes: 438 additions & 0 deletions docs/0.6.0/doctrees/nbsphinx/add_new_model_to_data_labeler.ipynb

Large diffs are not rendered by default.

621 changes: 621 additions & 0 deletions docs/0.6.0/doctrees/nbsphinx/data_reader.ipynb

Large diffs are not rendered by default.

622 changes: 622 additions & 0 deletions docs/0.6.0/doctrees/nbsphinx/labeler.ipynb

Large diffs are not rendered by default.

463 changes: 463 additions & 0 deletions docs/0.6.0/doctrees/nbsphinx/overview.ipynb

Large diffs are not rendered by default.

451 changes: 451 additions & 0 deletions docs/0.6.0/doctrees/nbsphinx/profiler_example.ipynb

Large diffs are not rendered by default.

388 changes: 388 additions & 0 deletions docs/0.6.0/doctrees/nbsphinx/unstructured_profiler_example.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,388 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f37ca393",
"metadata": {},
"source": [
"# Unstructured Profilers"
]
},
{
"cell_type": "markdown",
"id": "ff9bd095",
"metadata": {},
"source": [
"**Data profiling** - *is the process of examining a dataset and collecting statistical or informational summaries about said dataset.*\n",
"\n",
"The Profiler class inside the DataProfiler is designed to generate *data profiles* via the Profiler class, which ingests either a Data class or a Pandas DataFrame. \n",
"\n",
"Currently, the Data class supports loading the following file formats:\n",
"\n",
"* Any delimited (CSV, TSV, etc.)\n",
"* JSON object\n",
"* Avro\n",
"* Parquet\n",
"* Text files\n",
"* Pandas Series/Dataframe\n",
"\n",
"Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc.\n",
"\n",
"This example will look at specifically the unstructured data types for unstructured profiling. This means that only text files, lists of strings, single column pandas dataframes/series, or DataProfile Data objects in string format will work with the unstructured profiler. "
]
},
{
"cell_type": "markdown",
"id": "de58b9c4",
"metadata": {},
"source": [
"## Reporting"
]
},
{
"cell_type": "markdown",
"id": "8001185a",
"metadata": {},
"source": [
"One of the primary purposes of the Profiler are to quickly identify what is in the dataset. This can be useful for analyzing a dataset prior to use or determining which columns could be useful for a given purpose.\n",
"\n",
"In terms of reporting, there are multiple reporting options:\n",
"\n",
"* **Pretty**: Floats are rounded to four decimal places, and lists are shortened.\n",
"* **Compact**: Similar to pretty, but removes detailed statistics\n",
"* **Serializable**: Output is json serializable and not prettified\n",
"* **Flat**: Nested Output is returned as a flattened dictionary\n",
"\n",
"The **Pretty** and **Compact** reports are the two most commonly used reports and includes `global_stats` and `data_stats` for the given dataset. `global_stats` contains overall properties of the data such as samples used and file encoding. `data_stats` contains specific properties and statistics for each text sample.\n",
"\n",
"For unstructured profiles, the report looks like this:\n",
"\n",
"```\n",
"\"global_stats\": {\n",
" \"samples_used\": int,\n",
" \"empty_line_count\": int,\n",
" \"file_type\": string,\n",
" \"encoding\": string\n",
"},\n",
"\"data_stats\": {\n",
" \"data_label\": {\n",
" \"entity_counts\": {\n",
" \"word_level\": dict(int),\n",
" \"true_char_level\": dict(int),\n",
" \"postprocess_char_level\": dict(int)\n",
" },\n",
" \"times\": dict(float)\n",
" },\n",
" \"statistics\": {\n",
" \"vocab\": list(char),\n",
" \"words\": list(string),\n",
" \"word_count\": dict(int),\n",
" \"times\": dict(float)\n",
" }\n",
"}\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fcb5447",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import json\n",
"sys.path.insert(0, '..')\n",
"import dataprofiler as dp\n",
"\n",
"data_path = \"../dataprofiler/tests/data\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7fc2df6",
"metadata": {},
"outputs": [],
"source": [
"data = dp.Data(os.path.join(data_path, \"txt/discussion_reddit.txt\"))\n",
"profile = dp.Profiler(data)\n",
"\n",
"report = profile.report(report_options={\"output_format\": \"pretty\"})\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "4d183992",
"metadata": {},
"source": [
"## Profiler Type"
]
},
{
"cell_type": "markdown",
"id": "d7ec39d2",
"metadata": {},
"source": [
"It should be noted, in addition to reading the input data from text files, DataProfiler allows the input data as a pandas dataframe, a pandas series, a list, and Data objects (when an unstructured format is selected) if the Profiler is explicitly chosen as unstructured."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "29737f25",
"metadata": {},
"outputs": [],
"source": [
"# run data profiler and get the report\n",
"import pandas as pd\n",
"data = dp.Data(os.path.join(data_path, \"csv/SchoolDataSmall.csv\"), options={\"data_format\": \"records\"})\n",
"profile = dp.Profiler(data, profiler_type='unstructured')\n",
"\n",
"report = profile.report(report_options={\"output_format\":\"pretty\"})\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "fe02ad64",
"metadata": {},
"source": [
"## Profiler options"
]
},
{
"cell_type": "markdown",
"id": "40804cc9",
"metadata": {},
"source": [
"The DataProfiler has the ability to turn on and off components as needed. This is accomplished via the `ProfilerOptions` class.\n",
"\n",
"For example, if a user doesn't require vocab count information they may desire to turn off the word count functionality.\n",
"\n",
"Below, let's remove the vocab count and set the stop words. \n",
"\n",
"Full list of options in the Profiler section of the [DataProfiler documentation](https://capitalone.github.io/DataProfiler)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d25d899",
"metadata": {},
"outputs": [],
"source": [
"data = dp.Data(os.path.join(data_path, \"txt/discussion_reddit.txt\"))\n",
"\n",
"profile_options = dp.ProfilerOptions()\n",
"\n",
"# Setting multiple options via set\n",
"profile_options.set({ \"*.vocab.is_enabled\": False, \"*.is_case_sensitive\": True })\n",
"\n",
"# Set options via directly setting them\n",
"profile_options.unstructured_options.text.stop_words = [\"These\", \"are\", \"stop\", \"words\"]\n",
"\n",
"profile = dp.Profiler(data, options=profile_options)\n",
"report = profile.report(report_options={\"output_format\": \"compact\"})\n",
"\n",
"# Print the report\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "2052415a",
"metadata": {},
"source": [
"## Updating Profiles"
]
},
{
"cell_type": "markdown",
"id": "7e02f746",
"metadata": {},
"source": [
"Beyond just profiling, one of the unique aspects of the DataProfiler is the ability to update the profiles. To update appropriately, the schema (columns / keys) must match appropriately."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ab8022f",
"metadata": {},
"outputs": [],
"source": [
"# Load and profile a CSV file\n",
"data = dp.Data(os.path.join(data_path, \"txt/sentence-3x.txt\"))\n",
"profile = dp.Profiler(data)\n",
"\n",
"# Update the profile with new data:\n",
"new_data = dp.Data(os.path.join(data_path, \"txt/sentence-3x.txt\"))\n",
"profile.update_profile(new_data)\n",
"\n",
"# Take a peek at the data\n",
"print(data.data)\n",
"print(new_data.data)\n",
"\n",
"# Report the compact version of the profile\n",
"report = profile.report(report_options={\"output_format\": \"compact\"})\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "66ec6dc5",
"metadata": {},
"source": [
"## Merging Profiles"
]
},
{
"cell_type": "markdown",
"id": "e2265fe9",
"metadata": {},
"source": [
"Merging profiles are an alternative method for updating profiles. Particularly, multiple profiles can be generated seperately, then added together with a simple `+` command: `profile3 = profile1 + profile2`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc68ca07",
"metadata": {},
"outputs": [],
"source": [
"# Load a CSV file with a schema\n",
"data1 = dp.Data(os.path.join(data_path, \"txt/sentence-3x.txt\"))\n",
"profile1 = dp.Profiler(data1)\n",
"\n",
"# Load another CSV file with the same schema\n",
"data2 = dp.Data(os.path.join(data_path, \"txt/sentence-3x.txt\"))\n",
"profile2 = dp.Profiler(data2)\n",
"\n",
"# Merge the profiles\n",
"profile3 = profile1 + profile2\n",
"\n",
"# Report the compact version of the profile\n",
"report = profile3.report(report_options={\"output_format\":\"compact\"})\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "7ea07dc6",
"metadata": {},
"source": [
"As you can see, the `update_profile` function and the `+` operator function similarly. The reason the `+` operator is important is that it's possible to *save and load profiles*, which we cover next."
]
},
{
"cell_type": "markdown",
"id": "30868000",
"metadata": {},
"source": [
"## Saving and Loading a Profile"
]
},
{
"cell_type": "markdown",
"id": "f2858072",
"metadata": {},
"source": [
"Not only can the Profiler create and update profiles, it's also possible to save, load then manipulate profiles."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ad9ca57",
"metadata": {},
"outputs": [],
"source": [
"# Load data\n",
"data = dp.Data(os.path.join(data_path, \"txt/sentence-3x.txt\"))\n",
"\n",
"# Generate a profile\n",
"profile = dp.Profiler(data)\n",
"\n",
"# Save a profile to disk for later (saves as pickle file)\n",
"profile.save(filepath=\"my_profile.pkl\")\n",
"\n",
"# Load a profile from disk\n",
"loaded_profile = dp.Profiler.load(\"my_profile.pkl\")\n",
"\n",
"# Report the compact version of the profile\n",
"report = profile.report(report_options={\"output_format\":\"compact\"})\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "8f9859c2",
"metadata": {},
"source": [
"With the ability to save and load profiles, profiles can be generated via multiple machines then merged. Further, profiles can be stored and later used in applications such as change point detection, synthetic data generation, and more. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3571f2d0",
"metadata": {},
"outputs": [],
"source": [
"# Load a multiple files via the Data class\n",
"filenames = [\"txt/sentence-3x.txt\",\n",
" \"txt/sentence.txt\"]\n",
"data_objects = []\n",
"for filename in filenames:\n",
" data_objects.append(dp.Data(os.path.join(data_path, filename)))\n",
"\n",
"print(data_objects)\n",
"# Generate and save profiles\n",
"for i in range(len(data_objects)):\n",
" profile = dp.Profiler(data_objects[i])\n",
" report = profile.report(report_options={\"output_format\":\"compact\"})\n",
" print(json.dumps(report, indent=4))\n",
" profile.save(filepath=\"data-\"+str(i)+\".pkl\")\n",
"\n",
"\n",
"# Load profiles and add them together\n",
"profile = None\n",
"for i in range(len(data_objects)):\n",
" if profile is None:\n",
" profile = dp.Profiler.load(\"data-\"+str(i)+\".pkl\")\n",
" else:\n",
" profile += dp.Profiler.load(\"data-\"+str(i)+\".pkl\")\n",
"\n",
"\n",
"# Report the compact version of the profile\n",
"report = profile.report(report_options={\"output_format\":\"compact\"})\n",
"print(json.dumps(report, indent=4))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Binary file added docs/0.6.0/doctrees/overview.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/profiler.doctree
Binary file not shown.
Binary file added docs/0.6.0/doctrees/profiler_example.doctree
Binary file not shown.
Binary file not shown.
4 changes: 4 additions & 0 deletions docs/0.6.0/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 0332489bea7efb50b4338f0ab6b4ff05
tags: 645f666f9bcd5a90fca523b33c5a78b7
Loading