Skip to content
This repository has been archived by the owner on Nov 2, 2023. It is now read-only.

Commit

Permalink
Polish Database construction part of tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
JakobGM committed Dec 27, 2022
1 parent 2ab9292 commit 6dff76d
Show file tree
Hide file tree
Showing 4 changed files with 538 additions and 24 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
"sphinx.ext.autosummary",
"sphinx.ext.napoleon",
"sphinx_autodoc_typehints",
"sphinx_toolbox.collapse",
"sphinxcontrib.mermaid",
]
autodoc_member_order = "bysource"
Expand Down
67 changes: 52 additions & 15 deletions docs/tutorial/database-integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ Integrating Polars With an SQL Database Using Efficient Caching
dw-->|cached Patito<br>integration|ds

Many data-driven applications involve data that must be retrieved from a database, either a remote data warehouse solution such as Databricks or Snowflake, or a local database such as DuckDB or SQLite3.
Patito offers a database-agnostic API to query such sources in an efficient way by utilizing intelligent caching.

Patito offers a database-agnostic API to query such sources, returning the result as a polars DataFrame, while offering intelligent query caching on top.
By the end of this tutorial you will be able to write data ingestion logic that looks like this:

.. code::
Expand All @@ -35,9 +34,9 @@ The wrapped ``users`` function will now construct, execute, cache, and return th
The cache for ``users(country="NO")`` will be stored independently from ``users(country="US")``, and so on.
This, among with other functionality that will be explained later, allows you to integrate your local data pipeline with your remote database in an effortless way.

The following tutorial will explain how to construct a :class:`patito.Database` object which provides Patito with the required context to execute SQL queries on your database of choice.
The following tutorial will explain how to construct a :class:`patito.Database` object which provides Patito with the required context to execute SQL queries against your database of choice.
In turn :func:`patito.Database.query` can be used to execute SQL query strings directly and :func:`patito.Database.as_query` can be used to wrap functions that *produce* SQL query strings.
The latter decorator turns functions into :class:`patito.Database.Query <patito.database.DatabaseQuery>` objects which act very much like the original functions, only that it actually executes and returns the result as a DataFrame when invoked.
The latter decorator turns functions into :class:`patito.Database.Query <patito.database.DatabaseQuery>` objects which act very much like the original functions, only that they actually execute the constructed queries and return the results as DataFrames when invoked.
The ``Query`` object also has :ref:`additional methods <QueryMethods>` for managing the query caches and more.

This tutorial will take a relatively opinionated approach to how to organize your code.
Expand All @@ -46,14 +45,29 @@ For less opinionated documentation, see the referenced classes and methods above
.. contents:: Table of Contents
:local:

Setup
-----

The following tutorial will depend on ``patito`` having been installed with the ``caching`` extension group:

.. code::
pip install patito[caching]
Code samples in this tutorial will use `DuckDB <https://duckdb.org/>`_, but you should be able to replace it with your database of choice as you follow along:

.. code::
pip install duckdb
Construct a ``patito.Database`` Object
--------------------------------------

To begin, we need to provide Patito with the tools required query your database of choice.
To begin we need to provide Patito with the tools required to query your database of choice.
First we must implement a *query handler*, a function that takes a query string as its first argument, executes the query, and returns the result of the query in the form an Arrow table.

We are going to use DuckDB as our example in this tutorial.
Another example using SQLite3 is provided in the example section of the reference documentation for :class:`patito.Database`.
We are going to use DuckDB as our detailed example in this tutorial, but example code for other databases, including SQLite3, is provided at the end of this section.
We start by creating a ``db.py`` module in the root of our application, and implement ``db.connection`` as a way to connect to a DuckDB instance.

.. code-block::
Expand Down Expand Up @@ -82,7 +96,7 @@ We can use this new function in order to implement our query handler.
return connection.cursor().query(query).arrow()
Notice how the first argument of ``query_handler`` is the query string to be executed, as required by Patito, but the ``name`` keyword is specific to our database of choice.
It is now simple for us to create a :class:`patito.Database` object by providing ``db.query_handler`` to its constructor.
It is now simple for us to create a :class:`patito.Database` object by providing ``db.query_handler``:

.. code-block::
:caption: **db.py** -- pt.Database
Expand All @@ -103,8 +117,19 @@ It is now simple for us to create a :class:`patito.Database` object by providing
my_database = pt.Database(query_handler=query_handler)
Additional arguments can be provided to the ``Database`` constructor, for example a custom cache directory, and are documented :ref:`here <Database.__init__>`.
Additional arguments can be provided to the ``Database`` constructor, for example a custom cache directory.
These additional parameters are documented :ref:`here <Database.__init__>`.
Documentation for constructing query handlers and :class:`patito.Database` objects for other databases is provided in the collapsable sections below:

.. collapse:: SQLite3

See "Examples" section of :class:`patito.Database`.

.. collapse:: Other

You are welcome to create `a GitHub issue <https://github.com/kolonialno/patito/issues/new>`_ if you need help integrating with you specific database of choice.

|
Querying the Database Directly
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -122,27 +147,38 @@ The ``db`` module is now complete and we should be able to use it in order to ex
╞═════╪═════╡
│ 1 ┆ 2 │
└─────┴─────┘
<polars.LazyFrame object at 0x13571D310>
The query result is provided in the form of a polars ``DataFrame`` object.
Additional parameters can be provided to :func:`patito.Database.query` as described :ref:`here <Database.query>`.
An example for how to cache the result of a specific query and return it as a ``polars.LazyFrame`` goes as follow:
As an example, the query result can be provided as a ``polars.LazyFrame`` by specifying ``lazy=True``.

.. code-block::
>>> from db import my_database
>>> my_database.query("select 1 as a, 2 as b", lazy=True, cache=True)
>>> my_database.query("select 1 as a, 2 as b", lazy=True)
<polars.LazyFrame object at 0x13571D310>
Any *additional* keyword arguments provided to :func:`patito.Database.query` are forwarded to the original query handler, so the following will execute the query against a hypothetical DuckDB database stored at ``./my.db``:
Any *additional* keyword arguments provided to :func:`patito.Database.query` are forwarded directly to the original query handler, so the following will execute the query against the database stored in ``my.db``:

.. code-block::
>>> my_database.query("select * from my_table", name="my.db")
Query Data from Database as a DataFrame
---------------------------------------
.. mermaid::
:caption: Delegation of parameters provided to :func:`patito.Database.query`.
:align: center

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#FFF5E6', 'secondaryColor': '#FFF5E6' }}}%%
graph LR;
input["<code>Database.query(query, lazy, cache, ttl, model, **kwargs)</code>"]
query[patito.Query]
query_handler[Database query handler]
input-->|<code>lazy, cache, ttl, model</code>|query
input-->|<code>query, **kwargs</code>|query_handler
Wrapping Query-Producing Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's assume that you have a project named ``user-analyzer`` which analyzes users.
The associated python package should therefore be named ``user_analyzer``.
Expand Down Expand Up @@ -173,6 +209,7 @@ This is clearly not enough, the ``fetch.users`` function only returns a query st
def users():
return "select * from d_users"
Polars vs. Pandas
~~~~~~~~~~~~~~~~~

Expand Down
Loading

0 comments on commit 6dff76d

Please sign in to comment.