-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of network related timeouts #481
Improve handling of network related timeouts #481
Conversation
This looks like a great contribution, very nice.
How did you test it during development?
It seems like a really good idea to write this down somewhere. I am also a little unclear when a timeout will cause a connection to die and when it won't.
Do you know who's responsible for defining that one second period? Is that tiny tds? freetds? someone else?
I think your current solution seems reasonable. What do you mean "normal db processing" though? |
I had a scratch script that would use IO.write('/etc/pf.conf', File.open('/etc/pf.conf') { |f| f.read.gsub(/#block drop/, "block drop") })
system('pfctl -f /etc/pf.conf > /dev/null 2>&1') Something like that isn't going to be cross platform and required a little bit of manual setup. If we can come up with a reliable set of commands to drop/restore a network connection across platforms then we can likely write a test for this case. I doubt it's desirable to drop network across the whole system just to run a test though so it would ideally work for just the test host.
I'll take an initial stab at drafting something for the readme
Initially I couldn't find it but it looks like freetds defines this at 1 second: https://github.com/FreeTDS/freetds/blob/cfa161498673d91a963d173561cbc99835da1cfa/src/tds/net.c#L622. Which means it should be consistent.
I mean in cases where the timeout is triggered while waiting for results from the db and a network connection is up. e.g. the timeout is configured for 1 second but sql |
SQL Server is running in a container, right? Does pausing the container after creating a connection have the same effect? Or maybe we can do something like this to pause the docker network temporarily: https://stackoverflow.com/a/56218103 We can also conditionally skip the test based on env vars, etc. I think it would be nice to have something even if it's silly and shells out to docker. |
Great suggestion. Pausing the container does have a similar effect. I've gone ahead and refactored an existing test we were previously skipping altogether. If you have a better suggestion for the skip logic I'd be happy to incorporate it. |
4ee535c
to
222e7b3
Compare
222e7b3
to
501156d
Compare
501156d
to
08f6bed
Compare
These changes are now available to test in 2.1.4.pre |
This is probably fine for what it is, however we still have some timeout related problems. For example, when I run this scratch script and kill sql server, the main thread appears to hang forever instead of raising a timeout error.
require 'tiny_tds'
def w(*args)
puts [Thread.current.object_id, *args].join ' '
end
t = Thread.new do
i = 0
loop do
i += 1
w i
sleep(1)
end
end
c = TinyTds::Client.new({
host: 'localhost',
username: 'sa',
password: 'Testing123@@',
timeout: 5,
login_timeout: 5,
message_handler: ->(e) { w e },
})
w TinyTds::VERSION
sql = <<~SQL
create or alter procedure tinytds_TestPrintWithError
as
begin
raiserror('raiserror', 1, 1) with nowait
waitfor delay '00:01:00'
end
SQL
w c.execute(sql).to_a
w c.execute('exec tinytds_TestPrintWithError').to_a
t.exit
I'm not sure if this particular problem should be addressed by this PR, or we should just remember it's still a problem and tackle it separately. |
One suggestion for how to test this is to use ToxiProxy. It's a tool for simulating different kinds of network failure. You can boot toxiproxy container, connect to mssql through it, and then kill the connection, add jitter or latency, etc. |
I was unable to reproduce using your scratch script. I used ToxiProxy to test network failures (slow close, timeout, and down) as well as manually pausing/stopping my sqlserver docker container. In all instances I hit a timeout at the expected point (at 6 seconds). |
8fc51de
to
91fec1c
Compare
These changes look good to me. We've been testing 2.1.4-pre2 in production for a week or so without any ill effect from this PR. Can you squash/edit your commits down to whatever's appropriate for master history before merging? |
91fec1c
to
05a6b29
Compare
05a6b29
to
bc5fae8
Compare
With a working network connection: * command batch is called on non gvl thread * tinytds_err_handler is called with timeout error and returns INT_TIMEOUT * dbcancel is called and command batch returns * nogvl_cleanup is called and timeout error is raised This is all great. The timeout is hit, the db connection lives, and a error is thrown. With a network failure: * command batch is called on non gvl thread * tinytds_err_handler is called with timeout error and returns INT_TIMEOUT * dbcancel is called and does not succeed. command batch never returns * nogvl_cleanup is not called This is not great. dbcancel does not succeed because of the network failure. the command batch does not return until the underlying network timeout on the os is hit. TinyTds doesn't throw an error in the expected timeout window. To fix, we set a flag when a timeout is encountered. We use dbsetinterrupt to check this flag periodically while waiting on a read from the server. Once the flag is set the interrupt with send INT_CANCEL causing the pending command batch to return early. This means nogvl_cleanup will be called and raise the timeout error. This shouldn't have any affect in "normal" timeout conditions due to the fact that dbcancel will actually succeed and cause the normal flow before the interrupt can be called/handled. This is good because in these situtations we still want the dbproc to remain in working condition.
bc5fae8
to
6466b78
Compare
Summary
In cases of network failure with a command batch running - configured timeouts were not being respected. This meant that the time elapsed before a
TinyTds::Error
was thrown would extend well beyond the configured timeout #445.The reason for this is due to the fact that
dbcancel
does not work under these conditions. Thetinytds_err_handler
is called when the timeout is reached, we returnINT_TIMEOUT
and attempt to cancel the command batch usingdbcancel
. That call fails which means we continue to wait for the command batch to return even when subsequent timeout errors are reported.Ideally to fix this we would detect if
dbcancel
fails and returnINT_CANCEL
in these circumstances. Unfortunately, we can't do this given thatdbcancel
will always returnSUCCEED
As an alternative, when a timeout occurs, we set a flag and use that to return
INT_CANCEL
on the next timeout error we encounter. This allows the pending command batch to return andnogvl_cleanup
can raise the ruby exception. This works well except for the fact that it will take two timeout periods to elapse before sendingINT_CANCEL
. One reported timeout error to set the flag and another one to act on it. To get around this we can usedbsetinterrupt
to check the flag periodically while waiting on a read from the server, returnINT_CANCEL
sooner, and raise the timeout exception.This shouldn't have any affect on normal timeout conditions due to the fact that
dbcancel
will actually succeed and cause the normal flow before the interrupt can be called/handled. This is good because in these situations we still want the db process to remain in working condition.Fixes #445.
Notes
INT_CANCEL
means that the db process will be killed. It's not ideal that after some timeout errors the connection will still be usable, while for others it will not. This is how things behaved prior to this PR but wanted to call it out. Is this worth adding as a note in the readme?rb_thread_call_without_gvl
and raising the error directly from the error handler usingrb_thread_call_with_gvl
but I couldn't get either solution working due to the fact that I couldn't differentiate between timeouts caused by normal db processing and those caused by network related issues.