Improve handling of network related timeouts #481

bvogelzang · 2020-10-05T21:46:41Z

Summary

In cases of network failure with a command batch running - configured timeouts were not being respected. This meant that the time elapsed before a TinyTds::Error was thrown would extend well beyond the configured timeout #445.

The reason for this is due to the fact that dbcancel does not work under these conditions. The tinytds_err_handler is called when the timeout is reached, we return INT_TIMEOUT and attempt to cancel the command batch using dbcancel. That call fails which means we continue to wait for the command batch to return even when subsequent timeout errors are reported.

Ideally to fix this we would detect if dbcancel fails and return INT_CANCEL in these circumstances. Unfortunately, we can't do this given that dbcancel will always return SUCCEED

As an alternative, when a timeout occurs, we set a flag and use that to return INT_CANCEL on the next timeout error we encounter. This allows the pending command batch to return and nogvl_cleanup can raise the ruby exception. This works well except for the fact that it will take two timeout periods to elapse before sending INT_CANCEL. One reported timeout error to set the flag and another one to act on it. To get around this we can use dbsetinterrupt to check the flag periodically while waiting on a read from the server, return INT_CANCEL sooner, and raise the timeout exception.

This shouldn't have any affect on normal timeout conditions due to the fact that dbcancel will actually succeed and cause the normal flow before the interrupt can be called/handled. This is good because in these situations we still want the db process to remain in working condition.

Fixes #445.

Notes

I can't come up with a good way to write a test for this scenario. I would love suggestions on how we might approach this.
Returning INT_CANCEL means that the db process will be killed. It's not ideal that after some timeout errors the connection will still be usable, while for others it will not. This is how things behaved prior to this PR but wanted to call it out. Is this worth adding as a note in the readme?
In my testing, the interrupt was called once per second. This means the timeout exception will be raised 1 second after hitting the configured timeout. I have no idea what the interrupt interval is in other environments so it is possible that this could vary.
Other solutions I considered were trying to kill the thread spawned by rb_thread_call_without_gvl and raising the error directly from the error handler using rb_thread_call_with_gvl but I couldn't get either solution working due to the fact that I couldn't differentiate between timeouts caused by normal db processing and those caused by network related issues.

ext/tiny_tds/client.c

aharpervc · 2020-10-09T15:54:12Z

This looks like a great contribution, very nice.

I can't come up with a good way to write a test for this scenario. I would love suggestions on how we might approach this.

How did you test it during development?

Returning INT_CANCEL means that the db process will be killed. It's not ideal that after some timeout errors the connection will still be usable, while for others it will not. This is how things behaved prior to this PR but wanted to call it out. Is this worth adding as a note in the readme?

It seems like a really good idea to write this down somewhere. I am also a little unclear when a timeout will cause a connection to die and when it won't.

In my testing, the interrupt was called once per second. This means the timeout exception will be raised 1 second after hitting the configured timeout. I have no idea what the interrupt interval is in other environments so it is possible that this could vary.

Do you know who's responsible for defining that one second period? Is that tiny tds? freetds? someone else?

Other solutions I considered were trying to kill the thread spawned by rb_thread_call_without_gvl and raising the error directly from the error handler using rb_thread_call_with_gvl but I couldn't get either solution working due to the fact that I couldn't differentiate between timeouts caused by normal db processing and those caused by network related issues.

I think your current solution seems reasonable. What do you mean "normal db processing" though?

bvogelzang · 2020-10-09T17:03:22Z

How did you test it during development?

I had a scratch script that would use pfctl to drop the connection.

IO.write('/etc/pf.conf', File.open('/etc/pf.conf') { |f| f.read.gsub(/#block drop/, "block drop") })
system('pfctl -f /etc/pf.conf > /dev/null 2>&1')

Something like that isn't going to be cross platform and required a little bit of manual setup. If we can come up with a reliable set of commands to drop/restore a network connection across platforms then we can likely write a test for this case. I doubt it's desirable to drop network across the whole system just to run a test though so it would ideally work for just the test host.

It seems like a really good idea to write this down somewhere. I am also a little unclear when a timeout will cause a connection to die and when it won't.

I'll take an initial stab at drafting something for the readme

Do you know who's responsible for defining that one second period? Is that tiny tds? freetds? someone else?

Initially I couldn't find it but it looks like freetds defines this at 1 second: https://github.com/FreeTDS/freetds/blob/cfa161498673d91a963d173561cbc99835da1cfa/src/tds/net.c#L622. Which means it should be consistent.

What do you mean "normal db processing" though?

I mean in cases where the timeout is triggered while waiting for results from the db and a network connection is up. e.g. the timeout is configured for 1 second but sql waitfor delay 00:00:02 causes a timeout.

aharpervc · 2020-10-09T17:28:35Z

If we can come up with a reliable set of commands to drop/restore a network connection across platforms then we can likely write a test for this case.

SQL Server is running in a container, right? Does pausing the container after creating a connection have the same effect? Or maybe we can do something like this to pause the docker network temporarily: https://stackoverflow.com/a/56218103

We can also conditionally skip the test based on env vars, etc. I think it would be nice to have something even if it's silly and shells out to docker.

bvogelzang · 2020-10-20T16:44:31Z

Great suggestion. Pausing the container does have a similar effect. I've gone ahead and refactored an existing test we were previously skipping altogether. If you have a better suggestion for the skip logic I'd be happy to incorporate it.

README.md

test/client_test.rb

test/test_helper.rb

bvogelzang · 2021-01-19T22:46:52Z

These changes are now available to test in 2.1.4.pre

aharpervc · 2021-01-25T15:21:01Z

These changes are now available to test in 2.1.4.pre

This is probably fine for what it is, however we still have some timeout related problems. For example, when I run this scratch script and kill sql server, the main thread appears to hang forever instead of raising a timeout error.

run a sql server container
run this scratch script
immediately stop the sql server container
counter thread runs forever, main thread hangs instead of timeout

require 'tiny_tds'

def w(*args)
  puts [Thread.current.object_id, *args].join ' '
end

t = Thread.new do
  i = 0
  loop do
    i += 1
    w i
    sleep(1)
  end
end

c = TinyTds::Client.new({
  host: 'localhost',
  username: 'sa',
  password: 'Testing123@@',

  timeout: 5,
  login_timeout: 5,

  message_handler: ->(e) { w e },
})

w TinyTds::VERSION

sql = <<~SQL
create or alter procedure tinytds_TestPrintWithError
as
begin
    raiserror('raiserror', 1, 1) with nowait
    waitfor delay '00:01:00'
end
SQL

w c.execute(sql).to_a
w c.execute('exec tinytds_TestPrintWithError').to_a

t.exit

$ bundle exec ruby .\scratch.rb
23804380 1
4649200 2.1.4.pre
4649200
23804380 2
23804380 3
23804380 4
23804380 5
23804380 6
23804380 7
23804380 8
23804380 9
23804380 10
23804380 11
23804380 12
23804380 13
23804380 14
23804380 15
(...snip...)

I'm not sure if this particular problem should be addressed by this PR, or we should just remember it's still a problem and tackle it separately.

sixfeetover · 2021-01-26T15:10:31Z

One suggestion for how to test this is to use ToxiProxy. It's a tool for simulating different kinds of network failure. You can boot toxiproxy container, connect to mssql through it, and then kill the connection, add jitter or latency, etc.

bvogelzang · 2021-01-28T23:39:32Z

I was unable to reproduce using your scratch script. I used ToxiProxy to test network failures (slow close, timeout, and down) as well as manually pausing/stopping my sqlserver docker container. In all instances I hit a timeout at the expected point (at 6 seconds).

test/result_test.rb

tiny_tds.gemspec

test/client_test.rb

aharpervc · 2021-04-12T19:56:29Z

These changes look good to me. We've been testing 2.1.4-pre2 in production for a week or so without any ill effect from this PR.

Can you squash/edit your commits down to whatever's appropriate for master history before merging?

With a working network connection: * command batch is called on non gvl thread * tinytds_err_handler is called with timeout error and returns INT_TIMEOUT * dbcancel is called and command batch returns * nogvl_cleanup is called and timeout error is raised This is all great. The timeout is hit, the db connection lives, and a error is thrown. With a network failure: * command batch is called on non gvl thread * tinytds_err_handler is called with timeout error and returns INT_TIMEOUT * dbcancel is called and does not succeed. command batch never returns * nogvl_cleanup is not called This is not great. dbcancel does not succeed because of the network failure. the command batch does not return until the underlying network timeout on the os is hit. TinyTds doesn't throw an error in the expected timeout window. To fix, we set a flag when a timeout is encountered. We use dbsetinterrupt to check this flag periodically while waiting on a read from the server. Once the flag is set the interrupt with send INT_CANCEL causing the pending command batch to return early. This means nogvl_cleanup will be called and raise the timeout error. This shouldn't have any affect in "normal" timeout conditions due to the fact that dbcancel will actually succeed and cause the normal flow before the interrupt can be called/handled. This is good because in these situtations we still want the dbproc to remain in working condition.

aharpervc reviewed Oct 9, 2020

View reviewed changes

ext/tiny_tds/client.c Outdated Show resolved Hide resolved

ext/tiny_tds/client.c Outdated Show resolved Hide resolved

aharpervc reviewed Oct 20, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

aharpervc reviewed Oct 23, 2020

View reviewed changes

test/client_test.rb Show resolved Hide resolved

test/client_test.rb Outdated Show resolved Hide resolved

aharpervc reviewed Oct 27, 2020

View reviewed changes

test/test_helper.rb Outdated Show resolved Hide resolved

aharpervc requested a review from a team October 28, 2020 18:13

bvogelzang force-pushed the network-timeout-handling branch from 4ee535c to 222e7b3 Compare October 28, 2020 18:17

bvogelzang force-pushed the network-timeout-handling branch from 222e7b3 to 501156d Compare December 18, 2020 15:06

bvogelzang mentioned this pull request Dec 18, 2020

Report stored procedure errors when using info messages #475

Merged

aharpervc approved these changes Dec 18, 2020

View reviewed changes

bvogelzang force-pushed the network-timeout-handling branch from 501156d to 08f6bed Compare January 19, 2021 17:42

aharpervc mentioned this pull request Jan 19, 2021

Prepare for 2.1.4 #486

Merged

10 tasks

aharpervc mentioned this pull request Feb 22, 2021

Raise error if client is unable to execute statement #469

Closed

aharpervc requested changes Mar 17, 2021

View reviewed changes

test/result_test.rb Show resolved Hide resolved

tiny_tds.gemspec Outdated Show resolved Hide resolved

test/client_test.rb Outdated Show resolved Hide resolved

bvogelzang force-pushed the network-timeout-handling branch 2 times, most recently from 8fc51de to 91fec1c Compare March 17, 2021 17:51

bvogelzang force-pushed the network-timeout-handling branch from 91fec1c to 05a6b29 Compare April 16, 2021 15:52

bvogelzang force-pushed the network-timeout-handling branch from 05a6b29 to bc5fae8 Compare May 4, 2021 17:37

bvogelzang added 2 commits May 4, 2021 17:13

add more timeout info to readme

6466b78

bvogelzang force-pushed the network-timeout-handling branch from bc5fae8 to 6466b78 Compare May 4, 2021 22:13

bvogelzang merged commit 2b7d7ab into rails-sqlserver:master May 5, 2021

bvogelzang deleted the network-timeout-handling branch May 5, 2021 14:43

bvogelzang mentioned this pull request May 19, 2021

Fix login timeout segmentation faults #497

Merged

andyundso mentioned this pull request May 8, 2024

ruby 3.3.0 -> 3.3.1 new error "machine stack overflow in critical region (fatal)" #557

Open

Improve handling of network related timeouts #481

Improve handling of network related timeouts #481

Uh oh!

Conversation

bvogelzang commented Oct 5, 2020

Summary

Notes

Uh oh!

Uh oh!

Uh oh!

aharpervc commented Oct 9, 2020

Uh oh!

bvogelzang commented Oct 9, 2020

Uh oh!

aharpervc commented Oct 9, 2020

Uh oh!

bvogelzang commented Oct 20, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bvogelzang commented Jan 19, 2021

Uh oh!

aharpervc commented Jan 25, 2021

Uh oh!

sixfeetover commented Jan 26, 2021

Uh oh!

bvogelzang commented Jan 28, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aharpervc commented Apr 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants