Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add summary of each backend to docs #1385

Merged
merged 5 commits into from
Jul 3, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 49 additions & 9 deletions docs/topic_guides/backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,22 +12,60 @@ tags:

Splink is a Python library. It implements all data linking computations by generating SQL, and submitting the SQL statements to a backend of the user's chosing for execution.

For smaller input datasets of up to 1-2 million records, users can link data in Python on their laptop using the DuckDB backend. This is the recommended approach because the DuckDB backend is installed automatically when the user installs Splink using `pip install splink`. No additional configuration is needed.

Linking larger datasets requires highly computationally intensive calculations, and generates datasets which are too large to be processed on a standard laptop. For these scenarios, we recommend using one of Splink's big data backend - currently Spark or AWS Athena. When these backends are used, the SQL generated by Splink is sent to the chosen backend for execution.

The Splink code you write is almost identical between backends, so it's straightforward to migrate between backends. Often, it's a good idea to start working using DuckDB on a sample of data, because it will produce results very quickly. When you're comfortable with your model, you may wish to migrate to a big data backend to estimate/predict on the full dataset.

## Choosing a backend

### Considerations when choosing a SQL backend for Splink
When choosing which backend to use when getting started with Splink, there are another factors to consider:
RossKen marked this conversation as resolved.
Show resolved Hide resolved

- the size of the datasets
RossKen marked this conversation as resolved.
Show resolved Hide resolved
- the amount of configuration required
RossKen marked this conversation as resolved.
Show resolved Hide resolved
- access to specific (sometimes proprietary) platforms
- the backend-specific features offered by Splink
- the level of support and active development offered by Splink

Below is a short summary of each of the backends available in Splink.

### :simple-duckdb: DuckDB

DuckDB is recommended for smaller datasets (1-2 million records), and would be our primary recommendation for getting started with Splink. It is fast, easy to set up, can be run on any device with python installed and it is installed automatically with Splink via `pip install splink`. DuckDB has complete coverage for the functions in the Splink [comparison libraries](../comparison_level_library.md) and, as a mainstay of the Splink development team, is actively maintained with features being added regularly.

Often, it's a good idea to start working using DuckDB on a sample of data, because it will produce results very quickly. When you're comfortable with your model, you may wish to migrate to a big data backend to estimate/predict on the full dataset.

See the DuckDB [deduplication example notebook](../demos/example_deduplicate_50k_synthetic.ipynb) to get a better idea of how Splink works with DuckDB.

### :simple-apachespark: Spark

Spark is a system for distributed computing which is great for large datasets (10-100+ million records). It is more involved in terms of configuration, with more boilerplate code than the likes of DuckDB. Spark has complete coverage for the functions in the Splink [comparison libraries](../comparison_level_library.md) and, as a mainstay of the Splink development team, is actively maintained with features being added regularly.

If working with Databricks, the Spark backend is recommended, however as the Splink development team does not have access to a Databricks environment there will be instances where we will be unable to provide support.

See the Spark [deduplication example notebook](../demos/example_simple_pyspark.ipynb) to get a better idea of how Splink works with Spark.

### :simple-amazonaws: Athena

Athena is a big data SQL backend provided on AWS which is great for large datasets (10+ million records). It requires access to a live AWS account and as a persistent database, requires some additional management of the tables created by Splink. Athena has reasonable, but not complete, coverage for the functions in the Splink [comparison libraries](../comparison_level_library.md), with gaps in string fuzzy matching functionality due to the lack of some string functions in Athena's underlying SQL engine, [Presto](https://prestodb.io/docs/current/). At this time, the Athena backend is not being actively used by the Splink development team so receives minimal levels of support.
RossKen marked this conversation as resolved.
Show resolved Hide resolved

In addition, from a development perspective, the neccessity for an AWS connection makes testing Athena code more difficult, so there may be occassional bugs that would normally be caught by our testing framework.

See the Athena [deduplication example notebook](../demos/athena_deduplicate_50k_synthetic.ipynb) to get a better idea of how Splink works with Athena.

### :simple-sqlite: SQLite

SQLite is similar to DuckDB in that it is, generally, more suited to smaller datasets. While not as performant as DuckDB, SQLite is simple to setup and can even be run directly in a Jupyter notebook. SQLite has reasonable, but not complete, coverage for the functions in the Splink [comparison libraries](../comparison_level_library.md), with gaps in array and date comparisons. String fuzzy matching, while not native to SQLite is available via python UDFs which has some [performance implications](#additional-information-for-specific-backends). SQLite is not actively been used by the Splink team so receives minimal levels of support.
RossKen marked this conversation as resolved.
Show resolved Hide resolved

### :simple-postgresql: PostgreSql

PostgreSql is a relatively new linker, so we have not fully tested performance or what size of datasets can processed with Splink. The Postgres backend requires a Postgres database, so it is recommened to use this backend only if you are working with a pre-existing Postgres database. Postgres has reasonable, but not complete, coverage for the functions in the Splink [comparison libraries](../comparison_level_library.md), with gaps in string fuzzy matching functionality due to the lack of some string functions in Postgres. At this time, the Postgres backend is not being actively used by the Splink development team so receives minimal levels of support.


## Using your chosen backend

Import the linker from the backend of your choosing, and the backend-specific comparison libraries.

Once you have initialised the `linker` object, there is no difference in the subequent code between backends.

Note however, that not all comparison functions are available in all backends.
There are tables detailing the available functions for each backend on
the [comparison library API page](../comparison_library.html) and the [comparison level library API page](../comparison_level_library.html).

=== ":simple-duckdb: DuckDB"

```python
Expand Down Expand Up @@ -80,7 +118,9 @@ the [comparison library API page](../comparison_library.html) and the [compariso

```

## Information for specific backends


## Additional Information for specific backends

### :simple-sqlite: SQLite

Expand Down