Initial schema and testing (#3)

* sql linting/formatting with precommit * dbmate testing with docker * update pg docker to have pg_partman extension * move pg Dockerfile and add pgTap for testing * pg docker updates - add pg_prove - add dbmate - refactor Dockerfile - mount project in pg container for tools - turn off pg password requirement * base schema in place with (minimal) tests * locking fixes * split thread insert and update into two triggers * add database readme * Updating readme ref * changes based on PR feedback * dbmate-tupac: a dbmate wrapper for better schemas (#5) * dbmate-tupac: a dbmate wrapper for better schemas I have a problem with dbmate's schema handling. It will dump the entire database "schema only" into a file, and expects that file to be useful for reference. But much of the file can end up being changes due to things outside the current project's concerns. For example, enabling an extension that creates functions/tables in the database will cause those to show up in the schema file. And things like partitions that change over time: even if nothing changes in the migrations, the dumped schema can show changes. The "schema only" dump also excludes data that is effectively part of the schema itself. For example, the `thread_state` table defines a set of states that can be used for the `thread.status` field, and those rows are a fixed set of values enforcing the set of possible `thread.status` values. In effect, these "data" rows are as important as the info in a "schema only" dump to ensuring the database has the information it needs to work as expected. As a result, it seems that the schema handling of dbmate does not give us what we want, so I created this wrapper implementing a new `verify` command that I think makes more sense. It does the following: - creates the database from the reference schema - dumps the database - drops the database - creates the database and applies all migrations - dumps the database - drops the database - diffs the dumps If the diff shows no changes then the migrations and the reference schema are aligned. Functionally, this means that dbmate commands no longer dump the schema into `schema.sql`, and we end up maintaining that file ourselves as the true schema the migraitons must match. This means that some changes will need to be made both in a migration and in the reference schema. This is additional work. But it gives us a way to actually enforce that the schema is an effective reference for what we want to have in the database. In other words, this gives us a reference schema that is not subject to the limitations of the "schema only" dump supported by dbmate. * add test command to dbmate-tupac * fixup pg dockerfile so dbmate is dbmate-tupac * update database readme * update database readme with additional context --------- Co-authored-by: Matt Bialas <mbialas@element84.com>
Element84 · Apr 20, 2023 · 3eff2b7 · 3eff2b7
1 parent 5a51fa4
commit 3eff2b7
Show file tree

Hide file tree

Showing 14 changed files with 1,743 additions and 231 deletions.
diff --git a/.env b/.env
@@ -0,0 +1,7 @@
+DATABASE_HOST="localhost"
+DATABASE_PORT="5432"
+DATABASE_USER="postgres"
+DATABASE_PASS="password"
+DATABASE_NAME="swoop"
+DBMATE_MIGRATIONS_TABLE="swoop.schema_migrations"
+DATABASE_URL="postgres://${DATABASE_USER}:${DATABASE_PASS}@${DATABASE_HOST}:${DATABASE_PORT}/${DATABASE_NAME}?sslmode=disable"
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -49,6 +49,19 @@ repos:
         language: system
         types: [text]
         stages: [commit, push, manual]
+      - id: sqlfluff-fix
+        name: sqlfluff-fix
+        # Set a couple of default flags:
+        #  - `--force` to disable confirmation
+        #  - `--show-lint-violations` shows issues to not require running `sqlfluff lint`
+        #  - `--processes 0` to use maximum parallelism
+        # By default, this hook applies all rules.
+        entry: sqlfluff fix --force --show-lint-violations --processes 0
+        language: python
+        description: "Fixes sql lint errors with `SQLFluff`"
+        types: [sql]
+        require_serial: true
+        exclude: ^db/schema.sql
       # Currently disabled due to incomplete typing
       #- id: mypy
       #  name: mypy

diff --git a/.sqlfluff b/.sqlfluff
@@ -0,0 +1,20 @@
+[sqlfluff]
+dialect = postgres
+exclude_rules = ST07
+
+[sqlfluff:indentation]
+tab_space_size = 2
+indented_joins = True
+
+[sqlfluff:rules:capitalisation.identifiers]
+extended_capitalisation_policy = lower
+
+[sqlfluff:rules:capitalisation.functions]
+extended_capitalisation_policy = lower
+
+[sqlfluff:rules:capitalisation.types]
+extended_capitalisation_policy = lower
+
+[sqlfluff:rules:layout.long_lines]
+ignore_comment_lines = True
+ignore_comment_clauses = True
diff --git a/README.md b/README.md
@@ -10,24 +10,7 @@ brew install dbmate
 
 ## Database Setup / Migrations
 
-The DB schema and migrations are managed by [Dbmate](https://github.com/amacneil/dbmate#commands).
-
-Existing migrations can be found in: `/db/migrations/`
-<br><br>
-### Database setup:
-
-Create a `.env` file (specifying a `user`, `password`, `port`):
-```
-touch .env
-
-echo "DATABASE_URL=\"postgres://{user}:{password}@127.0.0.1:{port}/swoop?sslmode=disable\"" >> .env2
-```
-
-Create the database and tables:
-```
-dbmate up
-```
-<br>
+Instructions for this can be found in the Database README, found at:  `/db/README.md`
 
 ## Environment Setup and Testing
 

diff --git a/db/Dockerfile b/db/Dockerfile
@@ -0,0 +1,32 @@
+FROM postgres:15
+
+# May not be ideal, I think it might be upgrading the pg version
+# and inflating the image size, but it works for testing.
+
+# install build deps and pg_partman
+RUN set -x && \
+    apt-get update && \
+    apt-get install -y postgresql-15-partman curl make patch perl && \
+    apt-get clean -y && \
+    rm -r /var/lib/apt/lists/*
+
+# install pg_prove
+RUN cpan TAP::Parser::SourceHandler::pgTAP
+
+
+# install dbmate and wrapper
+COPY dbmate /usr/local/bin/dbmate
+RUN set -x && \
+    DBMATE="/usr/local/bin/_dbmate" && \
+    ARCH="$(a="$(uname -m)"; [ "$a" = "aarch64" ] && echo "amd64" || echo "$a")" && \
+    curl -fsSL -o "$DBMATE" https://github.com/amacneil/dbmate/releases/download/v2.2.0/dbmate-linux-$ARCH && \
+    chmod +x "$DBMATE"
+
+# install pgtap
+RUN set -x && \
+    tmp="$(mktemp -d)" && \
+    trap "rm -rf '$tmp'" EXIT && \
+    cd "$tmp" && \
+    curl -fsSL https://github.com/theory/pgtap/archive/refs/tags/v1.2.0.tar.gz -o pgtap.tar.gz && \
+    tar -xzf pgtap.tar.gz --strip-components 1 && \
+    make install
diff --git a/db/README.md b/db/README.md
@@ -0,0 +1,171 @@
+# `swoop-db`
+
+The swoop database schema is managed via `dbmate`. `dbmate` will apply all
+migration files found in the `./migrations` directory. We wrap the [upstream
+dbmate tool](https://github.com/amacneil/dbmate) with a bash script
+(`./dbmate`) to enforce a custom workflow where `./schema.sql` is not a "schema
+only" dump made after applying migrations but is instead a "manually-created
+reference schema" not generated by any `dbmate` commands.
+
+The `dbmate` help is overridden by the wrapper script and includes details
+about what is overridden and how. The main things of note:
+
+* `./schema.sql` is not automatically dumped by `dbmate` commands
+* `create` has an option to load a schema into the database
+* `verify` is a new command that will check differences between `./schema.sql`
+  and the schema generated by applying all of the migration files
+* `test` is a new command (specific to swoop) to run the database tests
+
+The `dbmate` wrapper can be run from the local OS against the postgres docker
+container, but is typically easier to run within the container itself. See
+details below.
+
+### What is a "schema"?
+
+Some have suggested that schema is an overloaded term, so it makes sense to
+better define what that means.
+
+In the context of postgres tooling like `pg_dump`, a schema is the structure of
+the database including any types, functions, or proceedures (among,
+potentially, many other things). With this definition perhaps it is better said
+what a schema does not include: tables rows. Thus, a "schema" in this view is
+everything but the data.
+
+This perspective ignores the fact that some data may actually be part of the
+structure of the database and required for proper database operations. Such
+data should be considered different than data inserted by applications using
+the database, and would therefore be considered an aspect of the schema we need
+to track.
+
+Moreover, some "schema" is actually generated from other commands, such that
+the set of sql commands required to reproduce a given "schema" may actually be
+much more limited than the result of running those commands. In an sense this
+is like any build process where the minimal set of source artifacts actually
+produce much more output than went in as they are built into the output
+artifacts. In such a case, the accepted best practice is to track only that
+minimal set of inputs, as the rest can be generated again at build time.
+
+A clear example of this could be something like enabling a database extension.
+The command to do so would be something like `CREATE EXTENSION
+<extension_name>;`. When running a schema-only dump with `pg_dump`, all tables,
+types, and other non-data items created in the database from that command would
+be present in the output. But from our perpective as application developers, we
+don't really care what the extension created, we just care to know we need to
+run `CREATE EXTENSION <extension_name>;`. Therefore we should only track that
+single command in our schema.
+
+In these ways the schema we track in `./schema.sql` is different than what one
+gets running `pg_dump` and exporting only the schema. And in this way
+`./schema.sql` is different than what `dbmate` would dump into said file. But
+the content we end up with in `./schema.sql` is much more useful for our
+purposes.
+
+### Why wrap `dbmate`?
+
+The above difference in schema definitions is the reason. For more on the idea
+and intent behind the dbmate wrapper, see [this dbmate
+discussion](https://github.com/amacneil/dbmate/discussions/433).  The choice to
+use a wrapper here is simply a pragmatic one; long-term either merging this
+behaivor into `dbmate` or creating a dedicated cli tool that uses `dbmate` as a
+library is preferred.
+
+### What does it mean if the schema and migrations are out of sync?
+
+The `./schema.sql` file represents what we want a new instance of the database
+to look like, whereas the migrations are a way to capture what operations need
+to be done to update an existing database from an older schema state to a new
+one. Therefore, when we want make changes to `./schema.sql`, we need a
+corresponding migration(s) to update existing databases with the older state.
+
+Or, if we approach this from the other way around: if we make a migration to
+make changes to existing databases, we also need to update the `./schema.sql`
+in a corresponding manner.
+
+In the event that changes are made to `./schema.sql` without a migraiton also
+making those changes, or if we have a migration and fail to update
+`./schema.sql`, then the schema and the migrations are out of sync. The
+`verify` command added by the `dbmate` wrapper is used to detect this condition
+and will provide a diff to help resolve any inconsistencies.
+
+## Extensions
+
+`swoop-db` makes use of two postgres extensions:
+
+* `pg_partman`: an automated table partition manager
+* `pgtap`: a postgres-native testing framework
+
+## Database testing with docker
+
+`./Dockerfile` defines the build steps for a database test container. The
+container includes the requsite postgres extensions and any other required
+utilities like `dbmate` and `pg_prove`.  As the Dockerfile builds an image with
+all the database dependencies with fixed versions, using docker with that image
+is strongly recommended for all testing to help guarantee consistency between
+developers (running postgres in another way is fine if desired, but does
+require that the necessary extensions and utilities are installed, and that the
+connection information is correctly configured for tooling).
+
+To make using the docker container more convenient, a `docker-compose.yml` file
+is provided in the project root. The repo contents are mounted as `/swoop`
+inside the container to help facilitate database operations and testing using
+the included utilities. For example, to bring up the database and run the
+tests:
+
+```shell
+# load the .env vars
+source .env
+
+# bring up the database container in the background
+#   --build  forces rebuild of the container in case changes have been made
+#   -V       recreates any volumes instead of reusing data
+#   -d       run the composed images in daemon mode rather than in the foreground
+docker compose up --build -V -d
+
+# create the database and apply all migrations
+docker compose exec postgres dbmate up
+
+# run the database tests
+docker compose exec postgres dbmate test db/tests/
+
+# connect to the database with psql
+docker compose exec postgres psql -U postgres swoop
+```
+
+To verify the schema and migrations match:
+
+```shell
+# drop an existing database to start clean
+docker compose exec postgres dbmate drop
+
+# run the verification; any diff indicates schema/migrations out-of-sync
+docker compose exec postgres dbmate verify
+```
+
+To stop the `compose`d container(s):
+
+```shell
+docker compose down
+```
+
+### Adding a migration
+
+Use `dbmate` if needing to create a new migration file:
+
+```shell
+docker compose exec postgres dbmate new <migration_name>
+```
+
+### Adding database tests
+
+Database tests should be added as `.sql` files in the `./tests` directory.
+Follow the pattern of the existing test files. It's best to keep each file
+short and focused with a descriptive name. For more about the `pgtap` test
+framework see [the documentation](https://pgtap.org/documentation.html).
+
+## pre-commit hooks related to the database
+
+We use `sqlfluff` for linting sql. See the root `.sqlfluff` config file and the
+command defined in the `.pre-commit-config.yaml` for more information. Note
+that the tool is a bit slow and somewhat inaccurate at times; it is better than
+nothing but we should not hesitate to replace it with a better option if one
+becomes available.