Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/coturn april update #365

Merged
merged 18 commits into from
Apr 22, 2020
Merged

Feature/coturn april update #365

merged 18 commits into from
Apr 22, 2020

Conversation

gfodor
Copy link
Contributor

@gfodor gfodor commented Apr 22, 2020

First deals with some issues related to shutdown:

First, the graceful connection draining logic broke with Phoenix 1.4 due to unsupported behavior in the the_end dependency. We now use a fork that has the necessary one line change to fix it. (The upstream project seems dead.)

Next, when we start reticulum on Hubs Cloud nodes, it uses the burned in AMI version to start. However, if there is a new version, habitat will SIGTERM that process and then install + run the new version. This is problematic when a new system is coming online, since it may receive the SIGTERM during database migrations, and leave the database in an inconsistent state. (Ecto properly runs each migration in a transaction unless we opt-out of it, but it sadly does not transactionally update the schema versions table in that transaction, so a poorly timed SIGTERM can leave the database in a state where a migration could be attempted a second time, which breaks things.)

This adds a new SIGTERM handler which is temporarily registered during application startup during migrations. If the signal is seen during this time, it is ignored, and the application shuts down after the migrations have finished. After migrations finish, if no signal was seen, the default signal handler is re-registered, so there are no long term side effects of this temporary override during startup.

Also, in Hubs Cloud, another common failure was that because the repo migration on AWS is happening in the context of a transaction scope pgbouncer connection, it was easy for the search_path to be lost during migrations (which led to us having to be explicit about ret0 everywhere it was not being injected by the migration runner automatically.) This is now fixed: there is a separate SessionLockRepo that can be used to run SQL in the context of a session (instead of a transaction) scoped pgbouncer connection, where it is necessary to retain connection environment variables across several statements. This is now done for migrations.

Drops the unneeded coturn user from our migration chain - this ended up being unnecessary and clunky to deal with in Hubs Cloud. Since this migration already ran on some Hubs Cloud servers (without TURN) that user is on those systems, but doesn't have any purpose given that TURN is not on those servers.

The TURN secrets are now rotated on startup, because otherwise there is a catch-22: the cron won't add the first secret right away and it is also skipped if nobody is connected. So there's no way to get the initial secret.

Drops the transport field from the TURN response from reticulum, it's now assumed that TURN is available over both UDP/DTLS and TCP/TLS.

@gfodor gfodor requested a review from brianpeiris April 22, 2020 03:42
@gfodor gfodor merged commit 22c2e4b into master Apr 22, 2020
@gfodor gfodor deleted the feature/coturn-april-update branch April 22, 2020 03:58
try do
Ecto.Migrator.run(Ret.SessionLockRepo, priv_path, :up, all: true, prefix: "ret0")
after
Ret.DelayStopSignalHandler.allow_stop()
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we won't return repos_pids if there's an error in the try. Should we also stop_repos in the after clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after is like finally, if it throws in try this will throw, and hence stop starting the application altogether. (since this is the application start process)

if repos_pids do
# Ensure there are some TURN secrets in the database, so that if system is idle
# the cron isn't indefinitely skipped and nobody can join rooms.
Ret.Coturn.rotate_secrets(true, Ret.SessionLockRepo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is rotate_secrets dependent on repos_pids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a bit roundabout - repos_pids is null when mix env is test, and we don't want to rotate the secrets on startup in test. i can refactor even tho its redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Maybe just add this explanation to the comments.

:ok =
:gen_event.swap_sup_handler(
:erl_signal_server,
{Ret.SignalHandler, []},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Ret.SignalHandler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱
typo. i tested this in production and it seemed to work because migrations were not stomped, but obviously was timing related. nice catch. will need to re-do all the integration tests.

""")

execute("grant usage on schema coturn to coturn;")
execute("grant all privileges on all tables in schema coturn to coturn;")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still have to grant access to the coturn schema to some user?

Copy link
Contributor Author

@gfodor gfodor Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - the relevant user is now just the superuser postgres - we have the same security contract for coturn as we do ret's db access. both are internal servers accessing the db via superuser. the fine grained security policy we have is postgrest, which is a SQL proxy to the whole database, so security is managed via postgres access control whitelisting and a dedicated user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants