Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of network related timeouts #481

Merged

Conversation

bvogelzang
Copy link
Contributor

Summary

In cases of network failure with a command batch running - configured timeouts were not being respected. This meant that the time elapsed before a TinyTds::Error was thrown would extend well beyond the configured timeout #445.

The reason for this is due to the fact that dbcancel does not work under these conditions. The tinytds_err_handler is called when the timeout is reached, we return INT_TIMEOUT and attempt to cancel the command batch using dbcancel. That call fails which means we continue to wait for the command batch to return even when subsequent timeout errors are reported.

Ideally to fix this we would detect if dbcancel fails and return INT_CANCEL in these circumstances. Unfortunately, we can't do this given that dbcancel will always return SUCCEED

As an alternative, when a timeout occurs, we set a flag and use that to return INT_CANCEL on the next timeout error we encounter. This allows the pending command batch to return and nogvl_cleanup can raise the ruby exception. This works well except for the fact that it will take two timeout periods to elapse before sending INT_CANCEL. One reported timeout error to set the flag and another one to act on it. To get around this we can use dbsetinterrupt to check the flag periodically while waiting on a read from the server, return INT_CANCEL sooner, and raise the timeout exception.

This shouldn't have any affect on normal timeout conditions due to the fact that dbcancel will actually succeed and cause the normal flow before the interrupt can be called/handled. This is good because in these situations we still want the db process to remain in working condition.

Fixes #445.

Notes

  • I can't come up with a good way to write a test for this scenario. I would love suggestions on how we might approach this.
  • Returning INT_CANCEL means that the db process will be killed. It's not ideal that after some timeout errors the connection will still be usable, while for others it will not. This is how things behaved prior to this PR but wanted to call it out. Is this worth adding as a note in the readme?
  • In my testing, the interrupt was called once per second. This means the timeout exception will be raised 1 second after hitting the configured timeout. I have no idea what the interrupt interval is in other environments so it is possible that this could vary.
  • Other solutions I considered were trying to kill the thread spawned by rb_thread_call_without_gvl and raising the error directly from the error handler using rb_thread_call_with_gvl but I couldn't get either solution working due to the fact that I couldn't differentiate between timeouts caused by normal db processing and those caused by network related issues.

ext/tiny_tds/client.c Outdated Show resolved Hide resolved
ext/tiny_tds/client.c Outdated Show resolved Hide resolved
@aharpervc
Copy link
Contributor

This looks like a great contribution, very nice.

  • I can't come up with a good way to write a test for this scenario. I would love suggestions on how we might approach this.

How did you test it during development?

  • Returning INT_CANCEL means that the db process will be killed. It's not ideal that after some timeout errors the connection will still be usable, while for others it will not. This is how things behaved prior to this PR but wanted to call it out. Is this worth adding as a note in the readme?

It seems like a really good idea to write this down somewhere. I am also a little unclear when a timeout will cause a connection to die and when it won't.

  • In my testing, the interrupt was called once per second. This means the timeout exception will be raised 1 second after hitting the configured timeout. I have no idea what the interrupt interval is in other environments so it is possible that this could vary.

Do you know who's responsible for defining that one second period? Is that tiny tds? freetds? someone else?

  • Other solutions I considered were trying to kill the thread spawned by rb_thread_call_without_gvl and raising the error directly from the error handler using rb_thread_call_with_gvl but I couldn't get either solution working due to the fact that I couldn't differentiate between timeouts caused by normal db processing and those caused by network related issues.

I think your current solution seems reasonable. What do you mean "normal db processing" though?

@bvogelzang
Copy link
Contributor Author

How did you test it during development?

I had a scratch script that would use pfctl to drop the connection.

IO.write('/etc/pf.conf', File.open('/etc/pf.conf') { |f| f.read.gsub(/#block drop/, "block drop") })
system('pfctl -f /etc/pf.conf > /dev/null 2>&1')

Something like that isn't going to be cross platform and required a little bit of manual setup. If we can come up with a reliable set of commands to drop/restore a network connection across platforms then we can likely write a test for this case. I doubt it's desirable to drop network across the whole system just to run a test though so it would ideally work for just the test host.

It seems like a really good idea to write this down somewhere. I am also a little unclear when a timeout will cause a connection to die and when it won't.

I'll take an initial stab at drafting something for the readme

Do you know who's responsible for defining that one second period? Is that tiny tds? freetds? someone else?

Initially I couldn't find it but it looks like freetds defines this at 1 second: https://github.com/FreeTDS/freetds/blob/cfa161498673d91a963d173561cbc99835da1cfa/src/tds/net.c#L622. Which means it should be consistent.

What do you mean "normal db processing" though?

I mean in cases where the timeout is triggered while waiting for results from the db and a network connection is up. e.g. the timeout is configured for 1 second but sql waitfor delay 00:00:02 causes a timeout.

@aharpervc
Copy link
Contributor

If we can come up with a reliable set of commands to drop/restore a network connection across platforms then we can likely write a test for this case.

SQL Server is running in a container, right? Does pausing the container after creating a connection have the same effect? Or maybe we can do something like this to pause the docker network temporarily: https://stackoverflow.com/a/56218103

We can also conditionally skip the test based on env vars, etc. I think it would be nice to have something even if it's silly and shells out to docker.

@bvogelzang
Copy link
Contributor Author

Great suggestion. Pausing the container does have a similar effect. I've gone ahead and refactored an existing test we were previously skipping altogether. If you have a better suggestion for the skip logic I'd be happy to incorporate it.

README.md Outdated Show resolved Hide resolved
test/client_test.rb Show resolved Hide resolved
test/client_test.rb Outdated Show resolved Hide resolved
test/test_helper.rb Outdated Show resolved Hide resolved
@bvogelzang bvogelzang force-pushed the network-timeout-handling branch from 501156d to 08f6bed Compare January 19, 2021 17:42
@aharpervc aharpervc mentioned this pull request Jan 19, 2021
10 tasks
@bvogelzang
Copy link
Contributor Author

These changes are now available to test in 2.1.4.pre

@aharpervc
Copy link
Contributor

These changes are now available to test in 2.1.4.pre

This is probably fine for what it is, however we still have some timeout related problems. For example, when I run this scratch script and kill sql server, the main thread appears to hang forever instead of raising a timeout error.

  1. run a sql server container
  2. run this scratch script
  3. immediately stop the sql server container
  4. counter thread runs forever, main thread hangs instead of timeout
require 'tiny_tds'

def w(*args)
  puts [Thread.current.object_id, *args].join ' '
end

t = Thread.new do
  i = 0
  loop do
    i += 1
    w i
    sleep(1)
  end
end

c = TinyTds::Client.new({
  host: 'localhost',
  username: 'sa',
  password: 'Testing123@@',

  timeout: 5,
  login_timeout: 5,

  message_handler: ->(e) { w e },
})

w TinyTds::VERSION

sql = <<~SQL
create or alter procedure tinytds_TestPrintWithError
as
begin
    raiserror('raiserror', 1, 1) with nowait
    waitfor delay '00:01:00'
end
SQL

w c.execute(sql).to_a
w c.execute('exec tinytds_TestPrintWithError').to_a

t.exit
$ bundle exec ruby .\scratch.rb
23804380 1
4649200 2.1.4.pre
4649200
23804380 2
23804380 3
23804380 4
23804380 5
23804380 6
23804380 7
23804380 8
23804380 9
23804380 10
23804380 11
23804380 12
23804380 13
23804380 14
23804380 15
(...snip...)

I'm not sure if this particular problem should be addressed by this PR, or we should just remember it's still a problem and tackle it separately.

@sixfeetover
Copy link

One suggestion for how to test this is to use ToxiProxy. It's a tool for simulating different kinds of network failure. You can boot toxiproxy container, connect to mssql through it, and then kill the connection, add jitter or latency, etc.

@bvogelzang
Copy link
Contributor Author

I was unable to reproduce using your scratch script. I used ToxiProxy to test network failures (slow close, timeout, and down) as well as manually pausing/stopping my sqlserver docker container. In all instances I hit a timeout at the expected point (at 6 seconds).

test/result_test.rb Outdated Show resolved Hide resolved
tiny_tds.gemspec Outdated Show resolved Hide resolved
test/client_test.rb Outdated Show resolved Hide resolved
@bvogelzang bvogelzang force-pushed the network-timeout-handling branch 2 times, most recently from 8fc51de to 91fec1c Compare March 17, 2021 17:51
@aharpervc
Copy link
Contributor

These changes look good to me. We've been testing 2.1.4-pre2 in production for a week or so without any ill effect from this PR.

Can you squash/edit your commits down to whatever's appropriate for master history before merging?

@bvogelzang bvogelzang force-pushed the network-timeout-handling branch from 91fec1c to 05a6b29 Compare April 16, 2021 15:52
@bvogelzang bvogelzang force-pushed the network-timeout-handling branch from 05a6b29 to bc5fae8 Compare May 4, 2021 17:37
bvogelzang added 2 commits May 4, 2021 17:13
With a working network connection:
* command batch is called on non gvl thread
* tinytds_err_handler is called with timeout error and returns INT_TIMEOUT
* dbcancel is called and command batch returns
* nogvl_cleanup is called and timeout error is raised

This is all great. The timeout is hit, the db connection lives, and a error is thrown.

With a network failure:
* command batch is called on non gvl thread
* tinytds_err_handler is called with timeout error and returns INT_TIMEOUT
* dbcancel is called and does not succeed. command batch never returns
* nogvl_cleanup is not called

This is not great. dbcancel does not succeed because of the network failure. the command batch does not return until the underlying network timeout on the os is hit. TinyTds doesn't throw an error in the expected timeout window.

To fix, we set a flag when a timeout is encountered. We use dbsetinterrupt to check this flag periodically while waiting on a read from the server. Once the flag is set the interrupt with send INT_CANCEL causing the pending command batch to return early. This means nogvl_cleanup will be called and raise the timeout error.

This shouldn't have any affect in "normal" timeout conditions due to the fact that dbcancel will actually succeed and cause the normal flow before the interrupt can be called/handled. This is good because in these situtations we still want the dbproc to remain in working condition.
@bvogelzang bvogelzang force-pushed the network-timeout-handling branch from bc5fae8 to 6466b78 Compare May 4, 2021 22:13
@bvogelzang bvogelzang merged commit 2b7d7ab into rails-sqlserver:master May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout does not work when tcp packets are dropped
3 participants