Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(Queue Mode): handle OS signals and RSpec internal wants_to_quit and rspec_is_quitting states to stop consuming tests from the Queue API when the CI node is terminated #207

Merged
merged 10 commits into from
Jul 26, 2023

Conversation

ArturT
Copy link
Member

@ArturT ArturT commented Jun 23, 2023

problem

When you use AWS Spot Instances, and your CI node is terminated, then it could happen that the RSpec process is somehow stopped and it does not execute tests, but the knapsack_pro gem keeps running and trying to fetch more tests from Queue API and assigns it to RSpec but RSpec does not execute it. Knapsack Pro thinks that you executed tests, and it asks for more tests from Queue API. This leads to allocating too many tests to the terminated CI node.

When the CI node is retried, it will rerun the tests that were assigned in the first place to the terminated CI node index. The retried CI node would get too many tests which lead to very slow tests execution on the CI node index.

solution

  • Let's handle the OS signals like TERM signal and stop consuming more tests from Queue API when it happens.

  • handle RSpec internal wants_to_quit and rspec_is_quitting states to stop consuming tests from the Queue API when the CI node is terminated

story

https://trello.com/c/1UJnbWQG

how to reproduce the problem and fix on the local project

Run tests with bin script bin/knapsack_pro_queue_rspec_record_first_run.

You can add a slow test example to one of the spec files like spec/rake_tasks/dummy_rake_spec.rb so that your tests would run slow enough and you will be able to perform all the following steps.

it 'a slow test example' do
  puts 'SLOW '*40
  sleep 10
end

Now when you run bin/knapsack_pro_queue_rspec_record_first_run please run also in another terminal:

watch -n 3 'ps aux|grep rspec'

You will see a PID 29939 of the process, look for something like, please note rspec_go in the process name.

artur            29939   0.0  0.3 409492640 171088 s023  S+    5:17PM   0:01.65 ruby /Users/artur/.rvm/rubies/ruby-3.2.2/bin/rake knapsack_pro:queue:rspec_go[--format d]

Then run:

kill -s TERM 29939

After you do it, we expect to see the batch of tests fetched from Queue API to be completed and then you see an exception Knapsack Pro process was terminated!.

Please note in the Knapsack Pro API logs that there is no more requests to Queue API to fetch more tests because after the batch of tests completed we raised the exception and stopped consuming tests from Queue API. Knapsack Pro can handle the TERM signal gracefully.

@ArturT ArturT added the bug label Jun 23, 2023
@ArturT ArturT changed the title fix(Queue Mode): handle TERM signal and stop consuming tests from Queue Mode fix(Queue Mode): handle TERM signal and stop consuming tests from the Queue API Jun 23, 2023
@ArturT ArturT changed the title fix(Queue Mode): handle TERM signal and stop consuming tests from the Queue API fix(Queue Mode): handle OS signals and RSpec internal wants_to_quit and rspec_is_quitting state to stop consuming tests from the Queue API when the CI node is terminated Jul 26, 2023
@ArturT ArturT changed the title fix(Queue Mode): handle OS signals and RSpec internal wants_to_quit and rspec_is_quitting state to stop consuming tests from the Queue API when the CI node is terminated fix(Queue Mode): handle OS signals and RSpec internal wants_to_quit and rspec_is_quitting states to stop consuming tests from the Queue API when the CI node is terminated Jul 26, 2023
@ArturT ArturT merged commit c910c8b into master Jul 26, 2023
@ArturT ArturT deleted the handle-signals branch July 26, 2023 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant