This is experimental and unstable.
This is a proof of concept at executing duckdb_wasm from a Pyodide kernel. This unlocks a few paths for using duckdb, such as PyScript & JupyterLite.
** The project should probably be called Pyoduckwasm or something like that... it started with JupyterLite as the end goal.
-
import micropip; await micropip.install('pandas'); await micropip.install('jupylite-duckdb'); import jupylite_duckdb as jd; conn = await jd.connect(); r1 = await jd.query("pragma version", conn); r2 = await jd.query("create or replace table xyz as select * from 'https://raw.githubusercontent.com/Teradata/kylo/master/samples/sample-data/parquet/userdata2.parquet'", conn); r3 = await jd.query("select gender, count(*) as c from xyz group by gender", conn); print(r1); print(r2); print(r3);
-
JupyterLite Code Console REPL
Note: reloading seems somewhat unreliable with pyodide. CTRL-F5 works more reliably.
Limitations:
- API: duckdb.connect() and duckdb.query()
- DataFrames are not (yet) registered in the DuckDB database.
- Data is copied from the duckdb_wasm arrow result to a python list[dict], and then to a dataframe. PyArrow is not available (yet) in Pyodide.
- It takes about a minute to run the JupyterLite examples. Most of this time is prior to any DuckDB stuff. Some of this time could be shaved off with a custom pyodide build, but PyScript is much faster.
- JupyterLite was unreliable with page reloads, I ended up having to clear the cache a lot.
- Not thrilled with PyScript removing the top level await... will probably just auto-wrap it (like ipython %autoawait)
Python wrapper to run DuckDB_WASM within JupyterLite with a Pyodide Kernel See notebooks for example of running this within jupyterlite
Following the example of magic_duckdb, there's an initial proof of concept for a duckdb for JupyterLite. See Magic Example
import micropip;
await micropip.install('pandas');
await micropip.install('jupylite-duckdb');
import jupylite_duckdb as jd;
conn = await jd.connect();
r1 = await jd.query("pragma version", conn);
r2 = await jd.query("create or replace table xyz as select * from 'https://raw.githubusercontent.com/Teradata/kylo/master/samples/sample-data/parquet/userdata2.parquet'", conn);
r3 = await jd.query("select gender, count(*) as c from xyz group by gender", conn);
print(r1);
print(r2);
print(r3);
- Implement a proof of concept version of dataframe registration
- Evaluate startup time reduction, perhaps custom pyodide build
- Handling errors: detect and display errors in Jupyter: too much sfuff buried in console, such as CORS errors
- invalidate pip browser cache (as/if needed); annoying for development purposes
- think through async/await/transform_cell approach and whether there's a better solution.
- Zero copy data exchange (js/duckdb arrow -> python/dataframe and python/df -> js/duckdb): Blocked by Pyarrow support
- If you're adding local .py files, use importlib.invalidate_caches(). Even then, it was flaky to import.
- Careful with caching... %pip install will pull from browser cache. I had to clear frequently within dev tools
- To clear local storage, which is annoyingly persistent, https://superuser.com/questions/519628/clear-html5-local-storage-on-a-specific-page
- %autoawait is part of why this works in notebooks, which is enabled by default. The %%dql cell magic patches transform-cell to push an await into the cell transformation.: https://ipython.readthedocs.io/en/stable/interactive/autoawait.html