Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add duckdb support #1398

Merged
merged 18 commits into from
Sep 25, 2024
Binary file modified doc/assets/diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,174 changes: 367 additions & 807 deletions doc/assets/diagram.svg
ahuang11 marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 19 additions & 0 deletions doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ alt: Works with GeoPandas
align: center
---
:::

:::{tab-item} Polars
```python
import polars
Expand All @@ -116,6 +117,24 @@ align: center
---
:::

:::{tab-item} DuckDB
ahuang11 marked this conversation as resolved.
Show resolved Hide resolved
```python
import duckdb
import hvplot.duckdb
from bokeh.sampledata.autompg import autompg_clean as df

df_duckdb = duckdb.from_df(df)
table = df_duckdb.groupby(['origin', 'mfr'])['mpg'].mean().sort_values().tail(5)
table.hvplot.barh('mfr', 'mpg', by='origin', stacked=True)
```
```{image} ./_static/home/pandas.gif
---
alt: Works with DuckDB
align: center
---
```

:::
:::{tab-item} Intake
```python
import hvplot.intake
Expand Down
117 changes: 108 additions & 9 deletions doc/user_guide/Integrations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -254,19 +254,13 @@
},
{
"cell_type": "markdown",
"id": "a46e377e-729a-4f99-b5d3-83b0736cb8a3",
"id": "7474a792-2cfd-4139-a1cd-872f913fa07b",
"metadata": {},
"source": [
":::{note}\n",
"Added in version `0.9.0`.\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "7474a792-2cfd-4139-a1cd-872f913fa07b",
"metadata": {},
"source": [
":::\n",
"\n",
":::{important}\n",
"While other data sources like `Pandas` or `Dask` have built-in support in HoloViews, as of version 1.17.1 this is not yet the case for `Polars`. You can track this [issue](https://github.com/holoviz/holoviews/issues/5939) to follow the evolution of this feature in HoloViews. Internally hvPlot simply selects the columns that contribute to the plot and casts them to a Pandas object using Polars' `.to_pandas()` method.\n",
":::"
Expand Down Expand Up @@ -327,6 +321,111 @@
"df_polars['A'].hvplot.line(height=150)"
]
},
{
"cell_type": "markdown",
"id": "efc2f45e",
"metadata": {},
"source": [
"#### DuckDB"
]
},
{
"cell_type": "markdown",
"id": "db91860c",
"metadata": {},
"source": [
":::{note}\n",
"Added in version `0.11.0`.\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d6460d0",
"metadata": {},
"outputs": [],
"source": [
ahuang11 marked this conversation as resolved.
Show resolved Hide resolved
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"df_pandas = pd.DataFrame(np.random.randn(1000, 4), columns=list('ABCD')).cumsum()\n",
"df_pandas.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21638d45",
"metadata": {},
"outputs": [],
"source": [
"import hvplot.duckdb # noqa \n",
"import duckdb\n",
"\n",
"connection = duckdb.connect(':memory:')\n",
"relation = duckdb.from_df(df_pandas, connection=connection)\n",
"relation.to_view(\"example_view\");"
]
},
{
"cell_type": "markdown",
"id": "40b56f16",
"metadata": {},
"source": [
"`.hvplot()` supports [DuckDB](https://duckdb.org/docs/api/python/overview.html) `DuckDBPyRelation` and `DuckDBConnection` objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f588e3fe",
"metadata": {},
"outputs": [],
"source": [
"relation.hvplot.line(y=['A', 'B', 'C', 'D'], height=150)"
]
},
{
"cell_type": "markdown",
"id": "68a47856",
"metadata": {},
"source": [
"`DuckDBPyRelation` is a bit more optimized because it handles column subsetting directly within DuckDB before the data is converted to a `pd.DataFrame`.\n",
"\n",
"So, it's a good idea to use the `connection.sql()` method when possible, which gives you a `DuckDBPyRelation`, instead of `connection.execute()`, which returns a `DuckDBPyConnection`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "214c60ee",
"metadata": {},
"outputs": [],
"source": [
"sql_expr = \"SELECT * FROM example_view WHERE A > 0 AND B > 0\"\n",
"connection.sql(sql_expr).hvplot.line(y=['A', 'B'], hover_cols=[\"C\"], height=150) # subsets A, B, C"
]
},
{
"cell_type": "markdown",
"id": "2a2f61d4",
"metadata": {},
"source": [
"Alternatively, you can directly subset the desired columns in the SQL expression."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ce25c3d",
"metadata": {},
"outputs": [],
"source": [
"sql_expr = \"SELECT A, B, C FROM example_view WHERE A > 0 AND B > 0\"\n",
"connection.execute(sql_expr).hvplot.line(y=['A', 'B'], hover_cols=[\"C\"], height=150)"
]
},
{
"cell_type": "markdown",
"id": "25a6e724-6a84-4bff-9108-ac71dcfa9116",
Expand Down
1 change: 1 addition & 0 deletions doc/user_guide/Introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
"\n",
"* [Pandas](https://pandas.pydata.org): DataFrame, Series (columnar/tabular data)\n",
"* [Rapids cuDF](https://docs.rapids.ai/api/cudf/stable/): GPU DataFrame, Series (columnar/tabular data)\n",
"* [DuckDB](https://www.duckdb.org/): DuckDB is a fast in-process analytical database\n",
"* [Polars](https://www.pola.rs/): Polars is a fast DataFrame library/in-memory query engine (columnar/tabular data)\n",
"* [Dask](https://www.dask.org): DataFrame, Series (distributed/out of core arrays and columnar data)\n",
"* [XArray](https://xarray.pydata.org): Dataset, DataArray (labelled multidimensional arrays)\n",
Expand Down
1 change: 1 addition & 0 deletions envs/py3.10-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.11-docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dependencies:
- colorcet>=2
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.11-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.12-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.9-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
4 changes: 4 additions & 0 deletions hvplot/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
is_tabular,
is_series,
is_dask,
is_duckdb,
is_intake,
is_cudf,
is_streamz,
Expand Down Expand Up @@ -1088,6 +1089,9 @@ def _process_data(
elif is_dask(data):
datatype = 'dask'
self.data = data.persist() if persist else data
elif is_duckdb(data):
datatype = 'duckdb'
self.data = data
elif is_cudf(data):
datatype = 'cudf'
self.data = data
Expand Down
27 changes: 27 additions & 0 deletions hvplot/duckdb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""Adds the `.hvplot` method to duckdb.DuckDBPyRelation and duckdb.DuckDBPyConnection"""


def patch(name='hvplot', interactive='interactive', extension='bokeh', logo=False):
from hvplot.plotting.core import hvPlotTabularDuckDB
from . import post_patch, _module_extensions

if 'hvplot.duckdb' not in _module_extensions:
try:
import duckdb
except ImportError:
raise ImportError(
'Could not patch plotting API onto DuckDB. DuckDB could not be imported.'
)

# Patching for DuckDBPyRelation and DuckDBPyConnection
_patch_duckdb_plot = lambda self: hvPlotTabularDuckDB(self) # noqa: E731
_patch_duckdb_plot.__doc__ = hvPlotTabularDuckDB.__call__.__doc__
plot_prop_duckdb = property(_patch_duckdb_plot)
setattr(duckdb.DuckDBPyRelation, name, plot_prop_duckdb)
setattr(duckdb.DuckDBPyConnection, name, plot_prop_duckdb)
_module_extensions.add('hvplot.duckdb')

post_patch(extension, logo)


patch()
8 changes: 7 additions & 1 deletion hvplot/plotting/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import holoviews as hv
from ..util import with_hv_extension, is_polars
from ..util import with_hv_extension, is_duckdb, is_polars

from .core import hvPlot, hvPlotTabular # noqa

Expand All @@ -11,6 +11,7 @@

@with_hv_extension
def plot(data, kind, **kwargs):
print(data)
ahuang11 marked this conversation as resolved.
Show resolved Hide resolved
ahuang11 marked this conversation as resolved.
Show resolved Hide resolved
# drop reuse_plot
kwargs.pop('reuse_plot', None)

Expand All @@ -34,6 +35,11 @@ def plot(data, kind, **kwargs):
from .core import hvPlotTabularPolars

return hvPlotTabularPolars(data)(kind=kind, **no_none_kwargs)

elif is_duckdb(data):
from .core import hvPlotTabularDuckDB

return hvPlotTabularDuckDB(data)(kind=kind, **no_none_kwargs)
return hvPlotTabular(data)(kind=kind, **no_none_kwargs)


Expand Down
83 changes: 83 additions & 0 deletions hvplot/plotting/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1864,6 +1864,89 @@ def labels(self, x=None, y=None, text=None, **kwds):
return self(x, y, text=text, kind='labels', **kwds)


class hvPlotTabularDuckDB(hvPlotTabular):
def _get_converter(self, x=None, y=None, kind=None, **kwds):
import duckdb
from duckdb.typing import (
BIGINT,
FLOAT,
DOUBLE,
INTEGER,
SMALLINT,
TINYINT,
UBIGINT,
UINTEGER,
USMALLINT,
UTINYINT,
HUGEINT,
)

params = dict(self._metadata, **kwds)
x = x or params.pop('x', None)
y = y or params.pop('y', None)
kind = kind or params.pop('kind', None)

# Handle DuckDB Relation and Connection objects
if isinstance(self._data, (duckdb.DuckDBPyConnection, duckdb.DuckDBPyRelation)):
if isinstance(self._data, duckdb.DuckDBPyConnection):
data = self._data.df()
else:
data = self._data

if params.get('hover_cols') != 'all':
data_columns = data.columns
possible_columns = [
[v] if isinstance(v, str) else v
for v in params.values()
if isinstance(v, (str, list))
]

columns = (set(data_columns) & set(itertools.chain(*possible_columns))) or {
data_columns[0]
}
if y is None:
# When y is not specified HoloViewsConverter finds all the numeric
# columns and use them as y values (see _process_chart_y). We need
# to include these columns too.

if isinstance(data, duckdb.DuckDBPyRelation):
numeric_columns = data.select_types(
[
BIGINT,
FLOAT,
DOUBLE,
INTEGER,
SMALLINT,
TINYINT,
UBIGINT,
UINTEGER,
USMALLINT,
UTINYINT,
HUGEINT,
]
).columns
else:
numeric_columns = data.select_dtypes(include='number').columns
columns |= set(numeric_columns)
xs = x if is_list_like(x) else (x,)
ys = y if is_list_like(y) else (y,)
columns |= {*xs, *ys}
columns.discard(None)

if isinstance(data, duckdb.DuckDBPyRelation):
columns = sorted(columns, key=lambda c: data_columns.index(c))
data = data.select(*columns).to_df()
else:
columns = sorted(columns, key=lambda c: data.columns.get_loc(c))
data = data[list(columns)]
else:
raise ValueError(
'Only duckdb.DuckDBPyConnection and duckdb.DuckDBPyRelation are supported'
)

return HoloViewsConverter(data, x, y, kind=kind, **params)


class hvPlotTabularPolars(hvPlotTabular):
def _get_converter(self, x=None, y=None, kind=None, **kwds):
import polars as pl
Expand Down
Loading
Loading