ploomber · edublancas · Mar 20, 2023 · Feb 27, 2023 · Feb 27, 2023 · Feb 27, 2023
@@ -1,7 +1,7 @@
 # CHANGELOG
 
 ## 0.7.0dev
-
+* [Feature] Adds `%sqlcmd profile` (#66)
 * [API Change] Deprecates old SQL parametrization: `$var`, `:var`, and `{var}` in favor of `{{var}}`
 * [Feature] Adds sql magic test to list of possible magics to test datasets
 

@@ -14,6 +14,7 @@ parts:
     - file: user-guide/tables-columns
     - file: plot-legacy
     - file: user-guide/template
+    - file: user-guide/data-profiling
 
   - caption: Integrations
     chapters:

@@ -0,0 +1,157 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.14.4
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+---
+
+# Data profiling
+
+When dealing with a new dataset, it's crucial for practitioners to have a comprehensive understanding of the data in a timely manner. This involves exploring and summarizing the dataset efficiently to extract valuable insights. However, this can be a time-consuming process. Fortunately, `%sqlcmd profile` offers an easy way to generate statistics and descriptive information, enabling practitioners to quickly gain a deeper understanding of the dataset.
+
+Availble statistics:
+
+* The count of non empty values
+* The number of unique values
+* The top (most frequent) value
+* The frequency of your top value
+* The mean, standard deviation, min and max values
+* The percentiles of your data: 25%, 50% and 75%.
+
+## Examples
+
+### DuckDB
+
+In this example we'll demonstrate the process of profiling a sample dataset that contains historical taxi data from NYC, using DuckDB. However, the code used here is compatible with all major databases.
+
+Download the data
+
+```{code-cell} ipython3
+from pathlib import Path
+from urllib.request import urlretrieve
+
+url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
+
+if not Path("yellow_tripdata_2021-01.parquet").is_file():
+    urlretrieve(url, "yellow_tripdata_2021-01.parquet")
+```
+
+Setup
+
+```{note}
+This example requires duckdb-engine: `pip install duckdb-engine`
+```
+
+Load the extension and connect to an in-memory DuckDB database:
+
+```{code-cell} ipython3
+%load_ext sql
+```
+
+```{code-cell} ipython3
+%sql duckdb://
+```
+
+Profile table
+
+```{code-cell} ipython3
+%sqlcmd profile --table "yellow_tripdata_2021-01.parquet"
+```
+
+### SQLite
+
+We can easily explore large SQLite database using DuckDB.
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+import urllib.request
+from pathlib import Path
+
+if not Path("example.db").is_file():
+    url = "https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"  # noqa
+    urllib.request.urlretrieve(url, "example.db")
+```
+
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+%%sql duckdb:///
+INSTALL 'sqlite_scanner';
+LOAD 'sqlite_scanner';
+CALL sqlite_attach('example.db');
+```
+
+```{code-cell} ipython3
+%sqlcmd profile -t track
+```
+
+### Saving report as HTML
+
+To save the generated report as an HTML file, use the `--output`/`-o` attribute followed by the desired file name
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+%sqlcmd profile -t track --output my-report.html
+```
+
+```{code-cell} ipython3
+from IPython.display import HTML
+HTML("my-report.html")
+```
+
+### Use schemas
+
+To profile a specific table from various tables in different schemas, we can use the `--schema/-s` attribute.
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+import sqlite3
+
+with sqlite3.connect("a.db") as conn:
+    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
+    conn.execute("INSERT INTO my_numbers VALUES (1)")
+    conn.execute("INSERT INTO my_numbers VALUES (2)")
+    conn.execute("INSERT INTO my_numbers VALUES (3)")
+```
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+%%sql
+ATTACH DATABASE 'a.db' AS a_schema
+```
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+import sqlite3
+
+with sqlite3.connect("b.db") as conn:
+    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
+    conn.execute("INSERT INTO my_numbers VALUES (11)")
+    conn.execute("INSERT INTO my_numbers VALUES (22)")
+    conn.execute("INSERT INTO my_numbers VALUES (33)")
+```
+
+```{code-cell} ipython3
+:tags: [hide-output]
+
+%%sql
+ATTACH DATABASE 'b.db' AS b_schema
+```
+
+Let's profile `my_numbers` of `b_schema`
+
+```{code-cell} ipython3
+%sqlcmd profile --table my_numbers --schema b_schema
+```
@@ -24,7 +24,7 @@
     "sqlglot",
     "jinja2",
     "ploomber-core>=0.2.4",
-    'importlib-metadata;python_version<"3.8"',
+    'importlib-metadata;python_version<"3.8"'
 ]
 
 DEV = [

@@ -1,9 +1,11 @@
 from sqlalchemy import inspect
 from prettytable import PrettyTable
 from ploomber_core.exceptions import modify_exceptions
-
 from sql.connection import Connection
 from sql.telemetry import telemetry
+import sql.run
+import math
+from sql.util import convert_to_scientific
 
 
 def _get_inspector(conn):
@@ -73,6 +75,167 @@ def __init__(self, name, schema, conn=None) -> None:
         self._table_txt = self._table.get_string()
 
 
+@modify_exceptions
+class TableDescription(DatabaseInspection):
+    """
+    Generates descriptive statistics.
+
+    Descriptive statistics are:
+
+    Count - Number of all not None values
+
+    Mean - Mean of the values
+
+    Max - Maximum of the values in the object.
+
+    Min - Minimum of the values in the object.
+
+    STD - Standard deviation of the observations
+
+    25h, 50h and 75h percentiles
+
+    Unique - Number of not None unique values
+
+    Top - The most frequent value
+
+    Freq - Frequency of the top value
+
+    """
+
+    def __init__(self, table_name, schema=None) -> None:
+        if schema:
+            table_name = f"{schema}.{table_name}"
+
+        columns = sql.run.raw_run(
+            Connection.current, f"SELECT * FROM {table_name} WHERE 1=0"
+        ).keys()
+
+        table_stats = dict({})
+        columns_to_include_in_report = set()
+
+        for column in columns:
+            table_stats[column] = dict()
+
+            # Note: index is reserved word in sqlite
+            try:
+                result_col_freq_values = sql.run.raw_run(
+                    Connection.current,
+                    f"""SELECT DISTINCT {column} as top,
+                    COUNT({column}) as frequency FROM {table_name}
+                    GROUP BY {column} ORDER BY Count({column}) Desc"""
+                ).fetchall()
+
+                table_stats[column]["freq"] = result_col_freq_values[0][1]
+                table_stats[column]["top"] = result_col_freq_values[0][0]
+
+                columns_to_include_in_report.update(["freq", "top"])
+
+            except Exception:
+                pass
+
+            try:
+                # get all non None values, min, max and avg.
+                result_value_values = sql.run.raw_run(
+                    Connection.current,
+                    f"""
+                    SELECT MIN({column}) AS min,
+                    MAX({column}) AS max,
+                    COUNT(DISTINCT {column}) AS unique_count,
+                    COUNT({column}) AS count
+                    FROM {table_name}
+                    WHERE {column} IS NOT NULL
+                    """
+                ).fetchall()
+
+                table_stats[column]["min"] = result_value_values[0][0]
+                table_stats[column]["max"] = result_value_values[0][1]
+                table_stats[column]["unique"] = result_value_values[0][2]
+                table_stats[column]["count"] = result_value_values[0][3]
+
+                columns_to_include_in_report.update(["count", "unique", "min", "max"])
+
+            except Exception:
+                pass
+
+            try:
+                results_avg = sql.run.raw_run(
+                    Connection.current,
+                    f"""
+                                SELECT AVG({column}) AS avg
+                                FROM {table_name}
+                                WHERE {column} IS NOT NULL
+                                """
+                ).fetchall()
+
+                table_stats[column]["mean"] = float(results_avg[0][0])
+                columns_to_include_in_report.update(["mean"])
+
+            except Exception:
+                table_stats[column]["mean"] = math.nan
+
+            # These keys are numeric and work only on duckdb
+            special_numeric_keys = ["std", "25%", "50%", "75%"]
+
+            try:
+                # Note: stddev_pop and PERCENTILE_DISC will work only on DuckDB
+                result = sql.run.raw_run(
+                    Connection.current,
+                    f"""
+                    SELECT
+                        stddev_pop({column}) as key_std,
+                        percentile_disc(0.25) WITHIN GROUP
+                        (ORDER BY {column}) as key_25,
+                        percentile_disc(0.50) WITHIN GROUP
+                        (ORDER BY {column}) as key_50,
+                        percentile_disc(0.75) WITHIN GROUP
+                        (ORDER BY {column}) as key_75
+                    FROM {table_name}
+                    """
+                ).fetchall()
+
+                for i, key in enumerate(special_numeric_keys):
+                    # r_key = f'key_{key.replace("%", "")}'
+                    table_stats[column][key] = float(result[0][i])
+
+                columns_to_include_in_report.update(special_numeric_keys)
+
+            except TypeError:
+                # for non numeric values
+                for key in special_numeric_keys:
+                    table_stats[column][key] = math.nan
+
+            except Exception as e:
+                # We tried to apply numeric function on
+                # non numeric value, i.e: DateTime
+                if "duckdb.BinderException" or "add explicit type casts" in str(e):
+                    for key in special_numeric_keys:
+                        table_stats[column][key] = math.nan
+
+                # Failed to run sql command/func (e.g stddev_pop).
+                # We ignore the cell stats for such case.
+                pass
+
+        self._table = PrettyTable()
+        self._table.field_names = [" "] + list(table_stats.keys())
+
+        rows = list(columns_to_include_in_report)
+        rows.sort(reverse=True)
+        for row in rows:
+            values = [row]
+            for column in table_stats:
+                if row in table_stats[column]:
+                    value = table_stats[column][row]
+                else:
+                    value = ""
+                value = convert_to_scientific(value)
+                values.append(value)
+
+            self._table.add_row(values)
+
+        self._table_html = self._table.get_html_string()
+        self._table_txt = self._table.get_string()
+
+
 @telemetry.log_call()
 def get_table_names(schema=None):
     """Get table names for a given connection"""
@@ -83,3 +246,14 @@ def get_table_names(schema=None):
 def get_columns(name, schema=None):
     """Get column names for a given connection"""
     return Columns(name, schema)
+
+
+@telemetry.log_call()
+def get_table_statistics(name, schema=None):
+    """Get table statistics for a given connection.
+
+    For all data types the results will include `count`, `mean`, `std`, `min`
+    `max`, `25`, `50` and `75` percentiles. It will also include `unique`, `top`
+    and `freq` statistics.
+    """
+    return TableDescription(name, schema=schema)