-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/coturn april update #365
Conversation
try do | ||
Ecto.Migrator.run(Ret.SessionLockRepo, priv_path, :up, all: true, prefix: "ret0") | ||
after | ||
Ret.DelayStopSignalHandler.allow_stop() | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we won't return repos_pids
if there's an error in the try
. Should we also stop_repos
in the after
clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after
is like finally
, if it throws in try this will throw, and hence stop starting the application altogether. (since this is the application start process)
if repos_pids do | ||
# Ensure there are some TURN secrets in the database, so that if system is idle | ||
# the cron isn't indefinitely skipped and nobody can join rooms. | ||
Ret.Coturn.rotate_secrets(true, Ret.SessionLockRepo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is rotate_secrets
dependent on repos_pids
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a bit roundabout - repos_pids
is null when mix env is test, and we don't want to rotate the secrets on startup in test. i can refactor even tho its redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Maybe just add this explanation to the comments.
:ok = | ||
:gen_event.swap_sup_handler( | ||
:erl_signal_server, | ||
{Ret.SignalHandler, []}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Ret.SignalHandler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
typo. i tested this in production and it seemed to work because migrations were not stomped, but obviously was timing related. nice catch. will need to re-do all the integration tests.
""") | ||
|
||
execute("grant usage on schema coturn to coturn;") | ||
execute("grant all privileges on all tables in schema coturn to coturn;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still have to grant access to the coturn schema to some user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - the relevant user is now just the superuser postgres
- we have the same security contract for coturn as we do ret's db access. both are internal servers accessing the db via superuser. the fine grained security policy we have is postgrest, which is a SQL proxy to the whole database, so security is managed via postgres access control whitelisting and a dedicated user.
First deals with some issues related to shutdown:
First, the graceful connection draining logic broke with Phoenix 1.4 due to unsupported behavior in the the_end dependency. We now use a fork that has the necessary one line change to fix it. (The upstream project seems dead.)
Next, when we start reticulum on Hubs Cloud nodes, it uses the burned in AMI version to start. However, if there is a new version, habitat will SIGTERM that process and then install + run the new version. This is problematic when a new system is coming online, since it may receive the SIGTERM during database migrations, and leave the database in an inconsistent state. (Ecto properly runs each migration in a transaction unless we opt-out of it, but it sadly does not transactionally update the schema versions table in that transaction, so a poorly timed SIGTERM can leave the database in a state where a migration could be attempted a second time, which breaks things.)
This adds a new SIGTERM handler which is temporarily registered during application startup during migrations. If the signal is seen during this time, it is ignored, and the application shuts down after the migrations have finished. After migrations finish, if no signal was seen, the default signal handler is re-registered, so there are no long term side effects of this temporary override during startup.
Also, in Hubs Cloud, another common failure was that because the repo migration on AWS is happening in the context of a
transaction
scopepgbouncer
connection, it was easy for thesearch_path
to be lost during migrations (which led to us having to be explicit aboutret0
everywhere it was not being injected by the migration runner automatically.) This is now fixed: there is a separateSessionLockRepo
that can be used to run SQL in the context of a session (instead of a transaction) scoped pgbouncer connection, where it is necessary to retain connection environment variables across several statements. This is now done for migrations.Drops the unneeded
coturn
user from our migration chain - this ended up being unnecessary and clunky to deal with in Hubs Cloud. Since this migration already ran on some Hubs Cloud servers (without TURN) that user is on those systems, but doesn't have any purpose given that TURN is not on those servers.The TURN secrets are now rotated on startup, because otherwise there is a catch-22: the cron won't add the first secret right away and it is also skipped if nobody is connected. So there's no way to get the initial secret.
Drops the
transport
field from the TURN response from reticulum, it's now assumed that TURN is available over both UDP/DTLS and TCP/TLS.