jupytext

kernelspec

myst

notebook_metadata_filter

text_representation

myst

extension	format_name	format_version	jupytext_version
.md	myst	0.13	1.14.7

display_name	language	name
Python 3 (ipykernel)	python	python3

html_meta

description lang=en	keywords	property=og:locale
Use DuckDB from Jupyter using JupySQL	jupyter, sql, jupysql, duckdb, plotting	en_US

DuckDB

JupySQL also supports DuckDB with a native connection (no SQLAlchemy needed), to learn more, see [the tutorial](../integrations/duckdb-native.md). To learn the differences, [click here.](../tutorials/duckdb-native-sqlalchemy.md)

JupySQL integrates with DuckDB so you can run SQL queries in a Jupyter notebook. Jump into any section to learn more!

+++

Pre-requisites for `.csv` file

%pip install jupysql duckdb duckdb-engine --quiet
%load_ext sql
%sql duckdb://

Load sample data

+++

Get a sample .csv file:

from urllib.request import urlretrieve

_ = urlretrieve(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
    "penguins.csv",
)

Query

+++

The data from the .csv file must first be registered as a table in order for the table to be listed.

%%sql
CREATE TABLE penguins AS SELECT * FROM penguins.csv

The cell above allows the data to now be listed as a table from the following code:

%sqlcmd tables

List columns in the penguins table:

%sqlcmd columns -t penguins

%%sql
SELECT *
FROM penguins.csv
LIMIT 3

%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC

Plotting

%%sql species_count <<
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC

ax = species_count.bar()
# customize plot (this is a matplotlib Axes object)
_ = ax.set_title("Num of penguins by species")

Pre-requisites for `.parquet` file

%pip install jupysql duckdb duckdb-engine pyarrow --quiet
%load_ext sql
%sql duckdb://

Load sample data

+++

Get a sample .parquet file:

from urllib.request import urlretrieve

_ = urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
    "yellow_tripdata_2021-01.parquet",
)

Query

+++

Identically, to list the data from a .parquet file as a table, the data must first be registered as a table.

%%sql
CREATE TABLE tripdata AS SELECT * FROM "yellow_tripdata_2021-01.parquet"

The data is now able to be listed as a table from the following code:

%sqlcmd tables

List columns in the tripdata table:

%sqlcmd columns -t tripdata

%%sql
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count
FROM "yellow_tripdata_2021-01.parquet"
LIMIT 3

%%sql
SELECT
    passenger_count, AVG(trip_distance) AS avg_trip_distance
FROM "yellow_tripdata_2021-01.parquet"
GROUP BY passenger_count
ORDER BY passenger_count ASC

Plotting

%%sql avg_trip_distance <<
SELECT
    passenger_count, AVG(trip_distance) AS avg_trip_distance
FROM "yellow_tripdata_2021-01.parquet"
GROUP BY passenger_count
ORDER BY passenger_count ASC

ax = avg_trip_distance.plot()
# customize plot (this is a matplotlib Axes object)
_ = ax.set_title("Avg trip distance by num of passengers")

Plotting large datasets

This section demonstrates how we can efficiently plot large datasets with DuckDB and JupySQL without blowing up our machine's memory. %sqlplot performs all aggregations in DuckDB.

Let's install the required package:

%pip install jupysql duckdb duckdb-engine pyarrow --quiet

Now, we download a sample data: NYC Taxi data split in 3 parquet files:

from pathlib import Path
from urllib.request import urlretrieve

N_MONTHS = 3

# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
for i in range(1, N_MONTHS + 1):
    filename = f"yellow_tripdata_2021-{str(i).zfill(2)}.parquet"
    if not Path(filename).is_file():
        print(f"Downloading: {filename}")
        url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{filename}"
        urlretrieve(url, filename)

In total, this contains more then 4.6M observations:

%%sql
SELECT count(*) FROM 'yellow_tripdata_2021-*.parquet'

Let's use JupySQL to get a histogram of trip_distance across all 12 files:

%sqlplot histogram --table yellow_tripdata_2021-*.parquet --column trip_distance --bins 50

We have some outliers, let's find the 99th percentile:

%%sql
SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance)
FROM 'yellow_tripdata_2021-*.parquet'

We now write a query to remove everything above that number:

%%sql --save no_outliers --no-execute
SELECT trip_distance
FROM 'yellow_tripdata_2021-*.parquet'
WHERE trip_distance < 18.93

%sqlplot histogram --table no_outliers --column trip_distance --bins 50

%sqlplot boxplot --table no_outliers --column trip_distance

Querying existing dataframes

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("duckdb:///:memory:")
df = pd.DataFrame({"x": range(100)})

%sql engine

If you're using DuckDB 1.1.0 or higher, you must run this before querying a data frame

~~~sql
%sql SET python_scan_all_frames=true
~~~

%%sql
SELECT *
FROM df
WHERE x > 95

Passing parameters to connection

from sqlalchemy import create_engine

some_engine = create_engine(
    "duckdb:///:memory:",
    connect_args={
        "preload_extensions": [],
    },
)

%sql some_engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duckdb.md

duckdb.md

DuckDB

Pre-requisites for `.csv` file

Load sample data

Query

Plotting

Pre-requisites for `.parquet` file

Load sample data

Query

Plotting

Plotting large datasets

Querying existing dataframes

Passing parameters to connection

Files

duckdb.md

Latest commit

History

duckdb.md

File metadata and controls

DuckDB

Pre-requisites for .csv file

Load sample data

Query

Plotting

Pre-requisites for .parquet file

Load sample data

Query

Plotting

Plotting large datasets

Querying existing dataframes

Passing parameters to connection

Pre-requisites for `.csv` file

Pre-requisites for `.parquet` file