Skip to content

iqmo-org/jupylite_duckdb

Repository files navigation

Experimental

This is experimental and unstable.

Pyodide + DuckDB

This is a proof of concept at executing duckdb_wasm from a Pyodide kernel. This unlocks a few paths for using duckdb, such as PyScript & JupyterLite.

** The project should probably be called Pyoduckwasm or something like that... it started with JupyterLite as the end goal.

Demonstration:

  • Static PyScript Example

  • PyScript REPL

  • pyodide console

    import micropip;
    await micropip.install('pandas');
    await micropip.install('jupylite-duckdb');
    import jupylite_duckdb as jd;
    conn = await jd.connect();
    r1 = await jd.query("pragma version", conn);
    r2 = await jd.query("create or replace table xyz as select * from 'https://raw.githubusercontent.com/Teradata/kylo/master/samples/sample-data/parquet/userdata2.parquet'", conn);
    r3 = await jd.query("select gender, count(*) as c from xyz group by gender", conn);
    print(r1);
    print(r2);
    print(r3);
    
  • JupyterLite

  • JupyterLite Code Console REPL

Note: reloading seems somewhat unreliable with pyodide. CTRL-F5 works more reliably.

Limitations:

  • API: duckdb.connect() and duckdb.query()
  • DataFrames are not (yet) registered in the DuckDB database.
  • Data is copied from the duckdb_wasm arrow result to a python list[dict], and then to a dataframe. PyArrow is not available (yet) in Pyodide.

Observations:

  • It takes about a minute to run the JupyterLite examples. Most of this time is prior to any DuckDB stuff. Some of this time could be shaved off with a custom pyodide build, but PyScript is much faster.
  • JupyterLite was unreliable with page reloads, I ended up having to clear the cache a lot.
  • Not thrilled with PyScript removing the top level await... will probably just auto-wrap it (like ipython %autoawait)

Demonstration

Code Console REPL Example

jupyterlite_duckdb_wasm

Python wrapper to run DuckDB_WASM within JupyterLite with a Pyodide Kernel See notebooks for example of running this within jupyterlite

Cell Magic %%dql

Following the example of magic_duckdb, there's an initial proof of concept for a duckdb for JupyterLite. See Magic Example

Pyodide Console

pyodide console

import micropip;
await micropip.install('pandas');
await micropip.install('jupylite-duckdb');
import jupylite_duckdb as jd;
conn = await jd.connect();
r1 = await jd.query("pragma version", conn);
r2 = await jd.query("create or replace table xyz as select * from 'https://raw.githubusercontent.com/Teradata/kylo/master/samples/sample-data/parquet/userdata2.parquet'", conn);
r3 = await jd.query("select gender, count(*) as c from xyz group by gender", conn);
print(r1);
print(r2);
print(r3);

Various Issues, Todos and Ideas

  • Implement a proof of concept version of dataframe registration
  • Evaluate startup time reduction, perhaps custom pyodide build
  • Handling errors: detect and display errors in Jupyter: too much sfuff buried in console, such as CORS errors
  • invalidate pip browser cache (as/if needed); annoying for development purposes
  • think through async/await/transform_cell approach and whether there's a better solution.
  • Zero copy data exchange (js/duckdb arrow -> python/dataframe and python/df -> js/duckdb): Blocked by Pyarrow support
  • If you're adding local .py files, use importlib.invalidate_caches(). Even then, it was flaky to import.
  • Careful with caching... %pip install will pull from browser cache. I had to clear frequently within dev tools
  • To clear local storage, which is annoyingly persistent, https://superuser.com/questions/519628/clear-html5-local-storage-on-a-specific-page
  • %autoawait is part of why this works in notebooks, which is enabled by default. The %%dql cell magic patches transform-cell to push an await into the cell transformation.: https://ipython.readthedocs.io/en/stable/interactive/autoawait.html

About

duckdb_wasm in jupyterlite & pyodide

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published