Feature/coturn april update #365

gfodor · 2020-04-22T00:55:11Z

First deals with some issues related to shutdown:

First, the graceful connection draining logic broke with Phoenix 1.4 due to unsupported behavior in the the_end dependency. We now use a fork that has the necessary one line change to fix it. (The upstream project seems dead.)

Next, when we start reticulum on Hubs Cloud nodes, it uses the burned in AMI version to start. However, if there is a new version, habitat will SIGTERM that process and then install + run the new version. This is problematic when a new system is coming online, since it may receive the SIGTERM during database migrations, and leave the database in an inconsistent state. (Ecto properly runs each migration in a transaction unless we opt-out of it, but it sadly does not transactionally update the schema versions table in that transaction, so a poorly timed SIGTERM can leave the database in a state where a migration could be attempted a second time, which breaks things.)

This adds a new SIGTERM handler which is temporarily registered during application startup during migrations. If the signal is seen during this time, it is ignored, and the application shuts down after the migrations have finished. After migrations finish, if no signal was seen, the default signal handler is re-registered, so there are no long term side effects of this temporary override during startup.

Also, in Hubs Cloud, another common failure was that because the repo migration on AWS is happening in the context of a transaction scope pgbouncer connection, it was easy for the search_path to be lost during migrations (which led to us having to be explicit about ret0 everywhere it was not being injected by the migration runner automatically.) This is now fixed: there is a separate SessionLockRepo that can be used to run SQL in the context of a session (instead of a transaction) scoped pgbouncer connection, where it is necessary to retain connection environment variables across several statements. This is now done for migrations.

Drops the unneeded coturn user from our migration chain - this ended up being unnecessary and clunky to deal with in Hubs Cloud. Since this migration already ran on some Hubs Cloud servers (without TURN) that user is on those systems, but doesn't have any purpose given that TURN is not on those servers.

The TURN secrets are now rotated on startup, because otherwise there is a catch-22: the cron won't add the first secret right away and it is also skipped if nobody is connected. So there's no way to get the initial secret.

Drops the transport field from the TURN response from reticulum, it's now assumed that TURN is available over both UDP/DTLS and TCP/TLS.

This reverts commit 7b3b6a1.

This reverts commit eccaa59.

…culum into feature/coturn-april-update

…db connection

brianpeiris · 2020-04-22T17:41:31Z

lib/ret/application.ex

+          try do
+            Ecto.Migrator.run(Ret.SessionLockRepo, priv_path, :up, all: true, prefix: "ret0")
+          after
+            Ret.DelayStopSignalHandler.allow_stop()
          end


I believe we won't return repos_pids if there's an error in the try. Should we also stop_repos in the after clause?

after is like finally, if it throws in try this will throw, and hence stop starting the application altogether. (since this is the application start process)

brianpeiris · 2020-04-22T17:43:14Z

lib/ret/application.ex

+    if repos_pids do
+      # Ensure there are some TURN secrets in the database, so that if system is idle
+      # the cron isn't indefinitely skipped and nobody can join rooms.
+      Ret.Coturn.rotate_secrets(true, Ret.SessionLockRepo)


Why is rotate_secrets dependent on repos_pids?

its a bit roundabout - repos_pids is null when mix env is test, and we don't want to rotate the secrets on startup in test. i can refactor even tho its redundant.

Gotcha. Maybe just add this explanation to the comments.

brianpeiris · 2020-04-22T17:47:33Z

lib/ret/delay_stop_signal_handler.ex

+    :ok =
+      :gen_event.swap_sup_handler(
+        :erl_signal_server,
+        {Ret.SignalHandler, []},


What is Ret.SignalHandler?

😱
typo. i tested this in production and it seemed to work because migrations were not stomped, but obviously was timing related. nice catch. will need to re-do all the integration tests.

brianpeiris · 2020-04-22T17:49:28Z

priv/repo/migrations/20200413202224_add_coturn_schema.exs

-      """)
-
-      execute("grant usage on schema coturn to coturn;")
-      execute("grant all privileges on all tables in schema coturn to coturn;")


Don't we still have to grant access to the coturn schema to some user?

no - the relevant user is now just the superuser postgres - we have the same security contract for coturn as we do ret's db access. both are internal servers accessing the db via superuser. the fine grained security policy we have is postgrest, which is a SQL proxy to the whole database, so security is managed via postgres access control whitelisting and a dedicated user.

gfodor added 18 commits April 19, 2020 19:10

Fix up the_end graceful shutdown for phoenix 1.4

1954da4

Add new delay stop handler

016b625

Update hex

8212651

Force coturn rotation on startup

04d5ca4

Revert "Skip user if blank password"

c5f2ec8

This reverts commit 7b3b6a1.

Revert "Add coturn user create"

48ed81e

This reverts commit eccaa59.

Fix yet one more schemaed migration

85f300a

Stop including coturn in search path

3237b83

Merge branch 'feature/coturn-april-update' of github.com:mozilla/reti…

1320ea8

…culum into feature/coturn-april-update

To eliminate search path issues in HC migrations, use session scoped …

ad3794a

…db connection

Session lock config fixes

4f61860

Share timeout settings

83061e8

Use real database name

07592b0

Drop incorrect transport

b244287

Reduce the scope of SIGTERM handler override

3fa2318

Fix formatting

d147835

Fix tests, add dummy migrations folder

4ee72f5

Fix ecto.reset

4dbe24c

gfodor requested a review from brianpeiris April 22, 2020 03:42

gfodor merged commit 22c2e4b into master Apr 22, 2020

gfodor deleted the feature/coturn-april-update branch April 22, 2020 03:58

brianpeiris reviewed Apr 22, 2020

View reviewed changes

brianpeiris mentioned this pull request Sep 25, 2020

Add revocable API tokens using guardian_db #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/coturn april update #365

Feature/coturn april update #365

gfodor commented Apr 22, 2020

brianpeiris Apr 22, 2020

gfodor Apr 22, 2020

brianpeiris Apr 22, 2020

gfodor Apr 22, 2020

brianpeiris Apr 22, 2020

brianpeiris Apr 22, 2020

gfodor Apr 22, 2020

brianpeiris Apr 22, 2020

gfodor Apr 22, 2020 •

edited

Loading

Feature/coturn april update #365

Feature/coturn april update #365

Conversation

gfodor commented Apr 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfodor Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

gfodor Apr 22, 2020 •

edited

Loading