capitalone · JGSweets · Jun 8, 2021 · Jun 1, 2021 · Jun 2, 2021 · Jun 7, 2021
@@ -272,7 +272,7 @@
    "id": "e2285f19-9b34-4484-beaa-79df890b2825",
    "metadata": {},
    "source": [
-    "# A deeper dive into `CSVData`\n",
+    "## A deeper dive into `CSVData`\n",
     "\n",
     "The rest of this tutorial will focus on how to use the data reader class: `CSVData`. The `CSVData` class is used for reading delimited data. Delimited data are datasets which have their columns specified by a specific character, commonly the `,`. E.g. from the `diamonds.csv` dataset:\n",
     "```\n",
@@ -599,9 +599,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "data_profiler",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "data_profiler"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {

@@ -218,6 +218,44 @@
     "print(json.dumps(report, indent=4))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "84a06312",
+   "metadata": {},
+   "source": [
+    "## Structured Profiler vs. Unstructured Profiler"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c0ea925",
+   "metadata": {},
+   "source": [
+    "The profiler will infer what type of statistics to generate (structured or unstructured) based on the input. However, you can explicitly specify profile type as well. Here is an example of the the profiler explicitly calling the structured profile and the unstructured profile."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5f4565d8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Using the structured profiler\n",
+    "data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
+    "profile = dp.Profiler(data, profiler_type='structured')\n",
+    "\n",
+    "report = profile.report(report_options={\"output_format\": \"pretty\"})\n",
+    "print(json.dumps(report, indent=4))\n",
+    "\n",
+    "# Using the unstructured profiler\n",
+    "my_dataframe = pd.DataFrame([[\"Sample1\"],[\"Sample2\"],[\"Sample3\"]], columns=[\"Text_Samples\"])\n",
+    "profile = dp.Profiler(my_dataframe, profiler_type='unstructured')\n",
+    "\n",
+    "report  = profile.report(report_options={\"output_format\":\"pretty\"})\n",
+    "print(json.dumps(report, indent=4))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "b16648ba",

@@ -5,7 +5,7 @@
    "id": "f37ca393",
    "metadata": {},
    "source": [
-    "# DataProfiler - Profilers"
+    "# Structured Profilers"
    ]
   },
   {
@@ -25,7 +25,9 @@
     "* Parquet\n",
     "* Text files\n",
     "\n",
-    "Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc."
+    "Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc.\n",
+    "\n",
+    "This example will look at specifically the structured data types for structured profiling. "
    ]
   },
   {
@@ -50,9 +52,9 @@
     "* **Serializable**: Output is json serializable and not prettified\n",
     "* **Flat**: Nested Output is returned as a flattened dictionary\n",
     "\n",
-    "The **Pretty** and **Compact** reports are the two most commonly used reports and includes `global_stats` and `data_stats` for the given dataset. \n",
+    "The **Pretty** and **Compact** reports are the two most commonly used reports and includes `global_stats` and `data_stats` for the given dataset. `global_stats` contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio. `data_stats` contains specific properties and statistics for each column file such as min, max, mean, variance, etc.\n",
     "\n",
-    "`global_stats` contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio:\n",
+    "For structured profiles, the report looks like this:\n",
     "\n",
     "```\n",
     "\"global_stats\": {\n",
@@ -65,13 +67,7 @@
     "    \"duplicate_row_count\": int,\n",
     "    \"file_type\": string,\n",
     "    \"encoding\": string,\n",
-    "}\n",
-    "```\n",
-    "\n",
-    "`data_stats` contains specific properties and statistics for each column such as min, max, mean, variance, etc.\n",
-    "\n",
-    "\n",
-    "```\n",
+    "},\n",
     "\"data_stats\": {\n",
     "    <column name>: {\n",
     "        \"column_name\": string,\n",
@@ -181,6 +177,37 @@
     "print(json.dumps(report, indent=4))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "241f6e3e",
+   "metadata": {},
+   "source": [
+    "# Profiler Type"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5b20879b",
+   "metadata": {},
+   "source": [
+    "The profiler will infer what type of statistics to generate (structured or unstructured) based on the input. However, you can explicitly specify profile type as well. Here is an example of the the profiler explicitly calling the structured profile."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc44eb47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
+    "profile = dp.Profiler(data, profiler_type='structured')\n",
+    "\n",
+    "# print the report using json to prettify.\n",
+    "report = profile.report(report_options={\"output_format\": \"pretty\"})\n",
+    "print(json.dumps(report, indent=4))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "fe02ad64",
@@ -415,7 +442,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.9"
+   "version": "3.8.7"
   }
  },
  "nbformat": 4,