Skip to content

Commit

Permalink
Initial schema and testing (#3)
Browse files Browse the repository at this point in the history
* sql linting/formatting with precommit

* dbmate testing with docker

* update pg docker to have pg_partman extension

* move pg Dockerfile and add pgTap for testing

* pg docker updates

- add pg_prove
- add dbmate
- refactor Dockerfile
- mount project in pg container for tools
- turn off pg password requirement

* base schema in place with (minimal) tests

* locking fixes

* split thread insert and update into two triggers

* add database readme

* Updating readme ref

* changes based on PR feedback

* dbmate-tupac: a dbmate wrapper for better schemas (#5)

* dbmate-tupac: a dbmate wrapper for better schemas

I have a problem with dbmate's schema handling. It will dump the entire
database "schema only" into a file, and expects that file to be useful
for reference. But much of the file can end up being changes due to
things outside the current project's concerns.

For example, enabling an extension that creates functions/tables in the
database will cause those to show up in the schema file. And things like
partitions that change over time: even if nothing changes in the
migrations, the dumped schema can show changes.

The "schema only" dump also excludes data that is effectively part of
the schema itself. For example, the `thread_state` table defines a set
of states that can be used for the `thread.status` field, and those rows
are a fixed set of values enforcing the set of possible `thread.status`
values. In effect, these "data" rows are as important as the info in a
"schema only" dump to ensuring the database has the information it needs
to work as expected.

As a result, it seems that the schema handling of dbmate does not give
us what we want, so I created this wrapper implementing a new `verify`
command that I think makes more sense. It does the following:

  - creates the database from the reference schema
  - dumps the database
  - drops the database
  - creates the database and applies all migrations
  - dumps the database
  - drops the database
  - diffs the dumps

If the diff shows no changes then the migrations and the reference
schema are aligned.

Functionally, this means that dbmate commands no longer dump the schema
into `schema.sql`, and we end up maintaining that file ourselves as the
true schema the migraitons must match.

This means that some changes will need to be made both in a migration
and in the reference schema. This is additional work. But it gives us a
way to actually enforce that the schema is an effective reference for
what we want to have in the database. In other words, this gives us a
reference schema that is not subject to the limitations of the "schema
only" dump supported by dbmate.

* add test command to dbmate-tupac

* fixup pg dockerfile so dbmate is dbmate-tupac

* update database readme

* update database readme with additional context

---------

Co-authored-by: Matt Bialas <mbialas@element84.com>
  • Loading branch information
jkeifer and MattBialas authored Apr 20, 2023
1 parent 5a51fa4 commit 3eff2b7
Show file tree
Hide file tree
Showing 14 changed files with 1,743 additions and 231 deletions.
7 changes: 7 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
DATABASE_HOST="localhost"
DATABASE_PORT="5432"
DATABASE_USER="postgres"
DATABASE_PASS="password"
DATABASE_NAME="swoop"
DBMATE_MIGRATIONS_TABLE="swoop.schema_migrations"
DATABASE_URL="postgres://${DATABASE_USER}:${DATABASE_PASS}@${DATABASE_HOST}:${DATABASE_PORT}/${DATABASE_NAME}?sslmode=disable"
13 changes: 13 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,19 @@ repos:
language: system
types: [text]
stages: [commit, push, manual]
- id: sqlfluff-fix
name: sqlfluff-fix
# Set a couple of default flags:
# - `--force` to disable confirmation
# - `--show-lint-violations` shows issues to not require running `sqlfluff lint`
# - `--processes 0` to use maximum parallelism
# By default, this hook applies all rules.
entry: sqlfluff fix --force --show-lint-violations --processes 0
language: python
description: "Fixes sql lint errors with `SQLFluff`"
types: [sql]
require_serial: true
exclude: ^db/schema.sql
# Currently disabled due to incomplete typing
#- id: mypy
# name: mypy
Expand Down
20 changes: 20 additions & 0 deletions .sqlfluff
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[sqlfluff]
dialect = postgres
exclude_rules = ST07

[sqlfluff:indentation]
tab_space_size = 2
indented_joins = True

[sqlfluff:rules:capitalisation.identifiers]
extended_capitalisation_policy = lower

[sqlfluff:rules:capitalisation.functions]
extended_capitalisation_policy = lower

[sqlfluff:rules:capitalisation.types]
extended_capitalisation_policy = lower

[sqlfluff:rules:layout.long_lines]
ignore_comment_lines = True
ignore_comment_clauses = True
19 changes: 1 addition & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,7 @@ brew install dbmate

## Database Setup / Migrations

The DB schema and migrations are managed by [Dbmate](https://github.com/amacneil/dbmate#commands).

Existing migrations can be found in: `/db/migrations/`
<br><br>
### Database setup:

Create a `.env` file (specifying a `user`, `password`, `port`):
```
touch .env
echo "DATABASE_URL=\"postgres://{user}:{password}@127.0.0.1:{port}/swoop?sslmode=disable\"" >> .env2
```

Create the database and tables:
```
dbmate up
```
<br>
Instructions for this can be found in the Database README, found at: `/db/README.md`

## Environment Setup and Testing

Expand Down
32 changes: 32 additions & 0 deletions db/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
FROM postgres:15

# May not be ideal, I think it might be upgrading the pg version
# and inflating the image size, but it works for testing.

# install build deps and pg_partman
RUN set -x && \
apt-get update && \
apt-get install -y postgresql-15-partman curl make patch perl && \
apt-get clean -y && \
rm -r /var/lib/apt/lists/*

# install pg_prove
RUN cpan TAP::Parser::SourceHandler::pgTAP


# install dbmate and wrapper
COPY dbmate /usr/local/bin/dbmate
RUN set -x && \
DBMATE="/usr/local/bin/_dbmate" && \
ARCH="$(a="$(uname -m)"; [ "$a" = "aarch64" ] && echo "amd64" || echo "$a")" && \
curl -fsSL -o "$DBMATE" https://github.com/amacneil/dbmate/releases/download/v2.2.0/dbmate-linux-$ARCH && \
chmod +x "$DBMATE"

# install pgtap
RUN set -x && \
tmp="$(mktemp -d)" && \
trap "rm -rf '$tmp'" EXIT && \
cd "$tmp" && \
curl -fsSL https://github.com/theory/pgtap/archive/refs/tags/v1.2.0.tar.gz -o pgtap.tar.gz && \
tar -xzf pgtap.tar.gz --strip-components 1 && \
make install
171 changes: 171 additions & 0 deletions db/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# `swoop-db`

The swoop database schema is managed via `dbmate`. `dbmate` will apply all
migration files found in the `./migrations` directory. We wrap the [upstream
dbmate tool](https://github.com/amacneil/dbmate) with a bash script
(`./dbmate`) to enforce a custom workflow where `./schema.sql` is not a "schema
only" dump made after applying migrations but is instead a "manually-created
reference schema" not generated by any `dbmate` commands.

The `dbmate` help is overridden by the wrapper script and includes details
about what is overridden and how. The main things of note:

* `./schema.sql` is not automatically dumped by `dbmate` commands
* `create` has an option to load a schema into the database
* `verify` is a new command that will check differences between `./schema.sql`
and the schema generated by applying all of the migration files
* `test` is a new command (specific to swoop) to run the database tests

The `dbmate` wrapper can be run from the local OS against the postgres docker
container, but is typically easier to run within the container itself. See
details below.

### What is a "schema"?

Some have suggested that schema is an overloaded term, so it makes sense to
better define what that means.

In the context of postgres tooling like `pg_dump`, a schema is the structure of
the database including any types, functions, or proceedures (among,
potentially, many other things). With this definition perhaps it is better said
what a schema does not include: tables rows. Thus, a "schema" in this view is
everything but the data.

This perspective ignores the fact that some data may actually be part of the
structure of the database and required for proper database operations. Such
data should be considered different than data inserted by applications using
the database, and would therefore be considered an aspect of the schema we need
to track.

Moreover, some "schema" is actually generated from other commands, such that
the set of sql commands required to reproduce a given "schema" may actually be
much more limited than the result of running those commands. In an sense this
is like any build process where the minimal set of source artifacts actually
produce much more output than went in as they are built into the output
artifacts. In such a case, the accepted best practice is to track only that
minimal set of inputs, as the rest can be generated again at build time.

A clear example of this could be something like enabling a database extension.
The command to do so would be something like `CREATE EXTENSION
<extension_name>;`. When running a schema-only dump with `pg_dump`, all tables,
types, and other non-data items created in the database from that command would
be present in the output. But from our perpective as application developers, we
don't really care what the extension created, we just care to know we need to
run `CREATE EXTENSION <extension_name>;`. Therefore we should only track that
single command in our schema.

In these ways the schema we track in `./schema.sql` is different than what one
gets running `pg_dump` and exporting only the schema. And in this way
`./schema.sql` is different than what `dbmate` would dump into said file. But
the content we end up with in `./schema.sql` is much more useful for our
purposes.

### Why wrap `dbmate`?

The above difference in schema definitions is the reason. For more on the idea
and intent behind the dbmate wrapper, see [this dbmate
discussion](https://github.com/amacneil/dbmate/discussions/433). The choice to
use a wrapper here is simply a pragmatic one; long-term either merging this
behaivor into `dbmate` or creating a dedicated cli tool that uses `dbmate` as a
library is preferred.

### What does it mean if the schema and migrations are out of sync?

The `./schema.sql` file represents what we want a new instance of the database
to look like, whereas the migrations are a way to capture what operations need
to be done to update an existing database from an older schema state to a new
one. Therefore, when we want make changes to `./schema.sql`, we need a
corresponding migration(s) to update existing databases with the older state.

Or, if we approach this from the other way around: if we make a migration to
make changes to existing databases, we also need to update the `./schema.sql`
in a corresponding manner.

In the event that changes are made to `./schema.sql` without a migraiton also
making those changes, or if we have a migration and fail to update
`./schema.sql`, then the schema and the migrations are out of sync. The
`verify` command added by the `dbmate` wrapper is used to detect this condition
and will provide a diff to help resolve any inconsistencies.

## Extensions

`swoop-db` makes use of two postgres extensions:

* `pg_partman`: an automated table partition manager
* `pgtap`: a postgres-native testing framework

## Database testing with docker

`./Dockerfile` defines the build steps for a database test container. The
container includes the requsite postgres extensions and any other required
utilities like `dbmate` and `pg_prove`. As the Dockerfile builds an image with
all the database dependencies with fixed versions, using docker with that image
is strongly recommended for all testing to help guarantee consistency between
developers (running postgres in another way is fine if desired, but does
require that the necessary extensions and utilities are installed, and that the
connection information is correctly configured for tooling).

To make using the docker container more convenient, a `docker-compose.yml` file
is provided in the project root. The repo contents are mounted as `/swoop`
inside the container to help facilitate database operations and testing using
the included utilities. For example, to bring up the database and run the
tests:

```shell
# load the .env vars
source .env

# bring up the database container in the background
# --build forces rebuild of the container in case changes have been made
# -V recreates any volumes instead of reusing data
# -d run the composed images in daemon mode rather than in the foreground
docker compose up --build -V -d

# create the database and apply all migrations
docker compose exec postgres dbmate up

# run the database tests
docker compose exec postgres dbmate test db/tests/

# connect to the database with psql
docker compose exec postgres psql -U postgres swoop
```

To verify the schema and migrations match:

```shell
# drop an existing database to start clean
docker compose exec postgres dbmate drop

# run the verification; any diff indicates schema/migrations out-of-sync
docker compose exec postgres dbmate verify
```

To stop the `compose`d container(s):

```shell
docker compose down
```

### Adding a migration

Use `dbmate` if needing to create a new migration file:

```shell
docker compose exec postgres dbmate new <migration_name>
```

### Adding database tests

Database tests should be added as `.sql` files in the `./tests` directory.
Follow the pattern of the existing test files. It's best to keep each file
short and focused with a descriptive name. For more about the `pgtap` test
framework see [the documentation](https://pgtap.org/documentation.html).

## pre-commit hooks related to the database

We use `sqlfluff` for linting sql. See the root `.sqlfluff` config file and the
command defined in the `.pre-commit-config.yaml` for more information. Note
that the tool is a bit slow and somewhat inaccurate at times; it is better than
nothing but we should not hesitate to replace it with a better option if one
becomes available.
Loading

0 comments on commit 3eff2b7

Please sign in to comment.