Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table profile added #168

Merged
merged 35 commits into from
Mar 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
b9e4027
table profile added
yafimvo Feb 27, 2023
0fa3532
lint
yafimvo Feb 27, 2023
eca6957
test fixed
yafimvo Feb 27, 2023
a400a03
lint
yafimvo Feb 27, 2023
7041081
autopolars property added to config
yafimvo Feb 27, 2023
8ab801f
Merge branch 'master' of https://github.com/yafimvo/jupysql into 66_p…
yafimvo Feb 27, 2023
b1ea6e4
Merge branch 'master' of https://github.com/yafimvo/jupysql into 66_p…
yafimvo Feb 27, 2023
9ccd1cc
save report added
yafimvo Feb 27, 2023
17f4d70
Merge branch 'master' of https://github.com/yafimvo/jupysql into 66_p…
yafimvo Mar 5, 2023
9a0dc82
percentile_disc added, schema added, docs updated
yafimvo Mar 5, 2023
56e3d2e
numpy added to setup
yafimvo Mar 5, 2023
896973a
Merge branch 'master' of https://github.com/yafimvo/jupysql into 66_p…
yafimvo Mar 6, 2023
431d2fb
np removed, run_raw added, queries updated, test fixed
yafimvo Mar 7, 2023
ab56ba6
Merge branch 'master' of https://github.com/yafimvo/jupysql into 66_p…
yafimvo Mar 7, 2023
fafa533
test fixed
yafimvo Mar 7, 2023
122f106
config.autolimit check fixed
yafimvo Mar 8, 2023
83b9dd3
integration tests added
yafimvo Mar 8, 2023
4d0f84d
integration tests fixed
yafimvo Mar 8, 2023
105aa3d
lint
yafimvo Mar 8, 2023
8e4aac3
index removed from integration tests
yafimvo Mar 8, 2023
829352d
postgres, mysql and maria excluded from profile test
yafimvo Mar 8, 2023
823cc61
lint
yafimvo Mar 8, 2023
29492a1
postgresql fixed
yafimvo Mar 9, 2023
abeb44a
postgresql nan values fixed
yafimvo Mar 9, 2023
a8517d2
Merge branch 'master' into 66_profile
yafimvo Mar 13, 2023
606b9bb
rebase
yafimvo Mar 14, 2023
ea81d9e
naming changed
yafimvo Mar 14, 2023
a4c5618
Merge branch '66_profile' of https://github.com/yafimvo/jupysql into …
yafimvo Mar 14, 2023
f88a053
Merge branch 'master' into 66_profile
yafimvo Mar 16, 2023
46b6455
rebase
yafimvo Mar 16, 2023
1f5bea0
sqlalchemy downgraded to 1
yafimvo Mar 16, 2023
6f1aaef
Merge branch '66_profile' of https://github.com/yafimvo/jupysql into …
yafimvo Mar 16, 2023
a0398f1
config removed from raw_run
yafimvo Mar 19, 2023
2a4af61
Merge branch 'master' of https://github.com/yafimvo/jupysql into 66_p…
yafimvo Mar 19, 2023
b600218
Merge branch 'master' of https://github.com/yafimvo/jupysql into 66_p…
yafimvo Mar 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# CHANGELOG

## 0.7.0dev

* [Feature] Adds `%sqlcmd profile` (#66)
* [API Change] Deprecates old SQL parametrization: `$var`, `:var`, and `{var}` in favor of `{{var}}`
* [Feature] Adds sql magic test to list of possible magics to test datasets

Expand Down
1 change: 1 addition & 0 deletions doc/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ parts:
- file: user-guide/tables-columns
- file: plot-legacy
- file: user-guide/template
- file: user-guide/data-profiling

- caption: Integrations
chapters:
Expand Down
157 changes: 157 additions & 0 deletions doc/user-guide/data-profiling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.4
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

# Data profiling

When dealing with a new dataset, it's crucial for practitioners to have a comprehensive understanding of the data in a timely manner. This involves exploring and summarizing the dataset efficiently to extract valuable insights. However, this can be a time-consuming process. Fortunately, `%sqlcmd profile` offers an easy way to generate statistics and descriptive information, enabling practitioners to quickly gain a deeper understanding of the dataset.

Availble statistics:

* The count of non empty values
* The number of unique values
* The top (most frequent) value
* The frequency of your top value
* The mean, standard deviation, min and max values
* The percentiles of your data: 25%, 50% and 75%.

## Examples

### DuckDB

In this example we'll demonstrate the process of profiling a sample dataset that contains historical taxi data from NYC, using DuckDB. However, the code used here is compatible with all major databases.

Download the data

```{code-cell} ipython3
from pathlib import Path
from urllib.request import urlretrieve

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"

if not Path("yellow_tripdata_2021-01.parquet").is_file():
urlretrieve(url, "yellow_tripdata_2021-01.parquet")
```

Setup

```{note}
This example requires duckdb-engine: `pip install duckdb-engine`
```

Load the extension and connect to an in-memory DuckDB database:

```{code-cell} ipython3
%load_ext sql
```

```{code-cell} ipython3
%sql duckdb://
```

Profile table

```{code-cell} ipython3
%sqlcmd profile --table "yellow_tripdata_2021-01.parquet"
```

### SQLite

We can easily explore large SQLite database using DuckDB.

```{code-cell} ipython3
:tags: [hide-output]

import urllib.request
from pathlib import Path

if not Path("example.db").is_file():
url = "https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite" # noqa
urllib.request.urlretrieve(url, "example.db")
```


```{code-cell} ipython3
:tags: [hide-output]

%%sql duckdb:///
INSTALL 'sqlite_scanner';
LOAD 'sqlite_scanner';
CALL sqlite_attach('example.db');
```

```{code-cell} ipython3
%sqlcmd profile -t track
```

### Saving report as HTML

To save the generated report as an HTML file, use the `--output`/`-o` attribute followed by the desired file name

```{code-cell} ipython3
:tags: [hide-output]

%sqlcmd profile -t track --output my-report.html
```

```{code-cell} ipython3
from IPython.display import HTML
HTML("my-report.html")
```

### Use schemas

To profile a specific table from various tables in different schemas, we can use the `--schema/-s` attribute.

```{code-cell} ipython3
:tags: [hide-output]

import sqlite3

with sqlite3.connect("a.db") as conn:
conn.execute("CREATE TABLE my_numbers (number FLOAT)")
conn.execute("INSERT INTO my_numbers VALUES (1)")
conn.execute("INSERT INTO my_numbers VALUES (2)")
conn.execute("INSERT INTO my_numbers VALUES (3)")
```

```{code-cell} ipython3
:tags: [hide-output]

%%sql
ATTACH DATABASE 'a.db' AS a_schema
```

```{code-cell} ipython3
:tags: [hide-output]

import sqlite3

with sqlite3.connect("b.db") as conn:
conn.execute("CREATE TABLE my_numbers (number FLOAT)")
conn.execute("INSERT INTO my_numbers VALUES (11)")
conn.execute("INSERT INTO my_numbers VALUES (22)")
conn.execute("INSERT INTO my_numbers VALUES (33)")
```

```{code-cell} ipython3
:tags: [hide-output]

%%sql
ATTACH DATABASE 'b.db' AS b_schema
```

Let's profile `my_numbers` of `b_schema`

```{code-cell} ipython3
%sqlcmd profile --table my_numbers --schema b_schema
```
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"sqlglot",
"jinja2",
"ploomber-core>=0.2.4",
'importlib-metadata;python_version<"3.8"',
'importlib-metadata;python_version<"3.8"'
]

DEV = [
Expand Down
176 changes: 175 additions & 1 deletion src/sql/inspect.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
from sqlalchemy import inspect
from prettytable import PrettyTable
from ploomber_core.exceptions import modify_exceptions

from sql.connection import Connection
from sql.telemetry import telemetry
import sql.run
import math
from sql.util import convert_to_scientific


def _get_inspector(conn):
Expand Down Expand Up @@ -73,6 +75,167 @@ def __init__(self, name, schema, conn=None) -> None:
self._table_txt = self._table.get_string()


@modify_exceptions
class TableDescription(DatabaseInspection):
"""
Generates descriptive statistics.

Descriptive statistics are:

Count - Number of all not None values

Mean - Mean of the values

Max - Maximum of the values in the object.

Min - Minimum of the values in the object.

STD - Standard deviation of the observations

25h, 50h and 75h percentiles

Unique - Number of not None unique values

Top - The most frequent value

Freq - Frequency of the top value

"""

def __init__(self, table_name, schema=None) -> None:
if schema:
table_name = f"{schema}.{table_name}"

columns = sql.run.raw_run(
Connection.current, f"SELECT * FROM {table_name} WHERE 1=0"
).keys()

table_stats = dict({})
columns_to_include_in_report = set()

for column in columns:
table_stats[column] = dict()

# Note: index is reserved word in sqlite
try:
result_col_freq_values = sql.run.raw_run(
Connection.current,
f"""SELECT DISTINCT {column} as top,
COUNT({column}) as frequency FROM {table_name}
GROUP BY {column} ORDER BY Count({column}) Desc"""
).fetchall()

table_stats[column]["freq"] = result_col_freq_values[0][1]
table_stats[column]["top"] = result_col_freq_values[0][0]

columns_to_include_in_report.update(["freq", "top"])

except Exception:
pass

try:
# get all non None values, min, max and avg.
result_value_values = sql.run.raw_run(
Connection.current,
f"""
SELECT MIN({column}) AS min,
MAX({column}) AS max,
COUNT(DISTINCT {column}) AS unique_count,
COUNT({column}) AS count
FROM {table_name}
WHERE {column} IS NOT NULL
"""
).fetchall()

table_stats[column]["min"] = result_value_values[0][0]
table_stats[column]["max"] = result_value_values[0][1]
table_stats[column]["unique"] = result_value_values[0][2]
table_stats[column]["count"] = result_value_values[0][3]

columns_to_include_in_report.update(["count", "unique", "min", "max"])

except Exception:
pass

try:
results_avg = sql.run.raw_run(
Connection.current,
f"""
SELECT AVG({column}) AS avg
FROM {table_name}
WHERE {column} IS NOT NULL
"""
).fetchall()

table_stats[column]["mean"] = float(results_avg[0][0])
columns_to_include_in_report.update(["mean"])

except Exception:
table_stats[column]["mean"] = math.nan

# These keys are numeric and work only on duckdb
special_numeric_keys = ["std", "25%", "50%", "75%"]

try:
# Note: stddev_pop and PERCENTILE_DISC will work only on DuckDB
result = sql.run.raw_run(
Connection.current,
f"""
SELECT
stddev_pop({column}) as key_std,
percentile_disc(0.25) WITHIN GROUP
(ORDER BY {column}) as key_25,
percentile_disc(0.50) WITHIN GROUP
(ORDER BY {column}) as key_50,
percentile_disc(0.75) WITHIN GROUP
(ORDER BY {column}) as key_75
FROM {table_name}
"""
).fetchall()

for i, key in enumerate(special_numeric_keys):
# r_key = f'key_{key.replace("%", "")}'
table_stats[column][key] = float(result[0][i])

columns_to_include_in_report.update(special_numeric_keys)

except TypeError:
# for non numeric values
for key in special_numeric_keys:
table_stats[column][key] = math.nan

except Exception as e:
# We tried to apply numeric function on
# non numeric value, i.e: DateTime
if "duckdb.BinderException" or "add explicit type casts" in str(e):
for key in special_numeric_keys:
table_stats[column][key] = math.nan

# Failed to run sql command/func (e.g stddev_pop).
# We ignore the cell stats for such case.
pass

self._table = PrettyTable()
self._table.field_names = [" "] + list(table_stats.keys())

rows = list(columns_to_include_in_report)
rows.sort(reverse=True)
for row in rows:
values = [row]
for column in table_stats:
if row in table_stats[column]:
value = table_stats[column][row]
else:
value = ""
value = convert_to_scientific(value)
values.append(value)

self._table.add_row(values)

self._table_html = self._table.get_html_string()
self._table_txt = self._table.get_string()


@telemetry.log_call()
def get_table_names(schema=None):
"""Get table names for a given connection"""
Expand All @@ -83,3 +246,14 @@ def get_table_names(schema=None):
def get_columns(name, schema=None):
"""Get column names for a given connection"""
return Columns(name, schema)


@telemetry.log_call()
def get_table_statistics(name, schema=None):
"""Get table statistics for a given connection.

For all data types the results will include `count`, `mean`, `std`, `min`
`max`, `25`, `50` and `75` percentiles. It will also include `unique`, `top`
and `freq` statistics.
"""
return TableDescription(name, schema=schema)
Loading