Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating profile example to have unstructured #252

Merged
merged 7 commits into from
Jun 8, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions examples/data_readers.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@
"id": "e2285f19-9b34-4484-beaa-79df890b2825",
"metadata": {},
"source": [
"# A deeper dive into `CSVData`\n",
"## A deeper dive into `CSVData`\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has needed to be changed for a while. The auto-documentation gets messed up if there are two of the largest headings in the same example.

"\n",
"The rest of this tutorial will focus on how to use the data reader class: `CSVData`. The `CSVData` class is used for reading delimited data. Delimited data are datasets which have their columns specified by a specific character, commonly the `,`. E.g. from the `diamonds.csv` dataset:\n",
"```\n",
Expand Down Expand Up @@ -599,9 +599,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "data_profiler",
"display_name": "Python 3",
"language": "python",
"name": "data_profiler"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
38 changes: 38 additions & 0 deletions examples/intro_data_profiler.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,44 @@
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "84a06312",
"metadata": {},
"source": [
"## Structured Profiler vs. Unstructured Profiler"
]
},
{
"cell_type": "markdown",
"id": "4c0ea925",
"metadata": {},
"source": [
"The profiler will infer what type of statistics to generate (structured or unstructured) based on the input. However, you can explicitly specify profile type as well. Here is an example of the the profiler explicitly calling the structured profile and the unstructured profile."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f4565d8",
"metadata": {},
"outputs": [],
"source": [
"# Using the structured profiler\n",
"data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
"profile = dp.Profiler(data, profiler_type='structured')\n",
"\n",
"report = profile.report(report_options={\"output_format\": \"pretty\"})\n",
"print(json.dumps(report, indent=4))\n",
"\n",
"# Using the unstructured profiler\n",
"my_dataframe = pd.DataFrame([[\"Sample1\"],[\"Sample2\"],[\"Sample3\"]], columns=[\"Text_Samples\"])\n",
"profile = dp.Profiler(my_dataframe, profiler_type='unstructured')\n",
"\n",
"report = profile.report(report_options={\"output_format\":\"pretty\"})\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "b16648ba",
Expand Down
51 changes: 39 additions & 12 deletions examples/profilers.ipynb → examples/structured_profilers.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"id": "f37ca393",
"metadata": {},
"source": [
"# DataProfiler - Profilers"
"# Structured Profilers"
]
},
{
Expand All @@ -25,7 +25,9 @@
"* Parquet\n",
"* Text files\n",
"\n",
"Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc."
"Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc.\n",
"\n",
"This example will look at specifically the structured data types for structured profiling. "
]
},
{
Expand All @@ -50,9 +52,9 @@
"* **Serializable**: Output is json serializable and not prettified\n",
"* **Flat**: Nested Output is returned as a flattened dictionary\n",
"\n",
"The **Pretty** and **Compact** reports are the two most commonly used reports and includes `global_stats` and `data_stats` for the given dataset. \n",
"The **Pretty** and **Compact** reports are the two most commonly used reports and includes `global_stats` and `data_stats` for the given dataset. `global_stats` contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio. `data_stats` contains specific properties and statistics for each column file such as min, max, mean, variance, etc.\n",
"\n",
"`global_stats` contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio:\n",
"For structured profiles, the report looks like this:\n",
"\n",
"```\n",
"\"global_stats\": {\n",
Expand All @@ -65,13 +67,7 @@
" \"duplicate_row_count\": int,\n",
" \"file_type\": string,\n",
" \"encoding\": string,\n",
"}\n",
"```\n",
"\n",
"`data_stats` contains specific properties and statistics for each column such as min, max, mean, variance, etc.\n",
"\n",
"\n",
"```\n",
"},\n",
"\"data_stats\": {\n",
" <column name>: {\n",
" \"column_name\": string,\n",
Expand Down Expand Up @@ -181,6 +177,37 @@
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "241f6e3e",
"metadata": {},
"source": [
"# Profiler Type"
]
},
{
"cell_type": "markdown",
"id": "5b20879b",
"metadata": {},
"source": [
"The profiler will infer what type of statistics to generate (structured or unstructured) based on the input. However, you can explicitly specify profile type as well. Here is an example of the the profiler explicitly calling the structured profile."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc44eb47",
"metadata": {},
"outputs": [],
"source": [
"data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
"profile = dp.Profiler(data, profiler_type='structured')\n",
"\n",
"# print the report using json to prettify.\n",
"report = profile.report(report_options={\"output_format\": \"pretty\"})\n",
"print(json.dumps(report, indent=4))"
]
},
{
"cell_type": "markdown",
"id": "fe02ad64",
Expand Down Expand Up @@ -415,7 +442,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
"version": "3.8.7"
}
},
"nbformat": 4,
Expand Down
Loading