Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please try v7.0.10 #571

Closed
mhenrixon opened this issue Jan 22, 2021 · 73 comments
Closed

Please try v7.0.10 #571

mhenrixon opened this issue Jan 22, 2021 · 73 comments

Comments

@mhenrixon
Copy link
Owner

I have worked really hard for over a year on stabilizing and improving the gem.

Unfortunately, I couldn't really come to terms with the code and the tests in version 6. I kindly ask that any person who has an issue with version 6 considers an upgrade to >= v7.0.1. That way, it will be a lot easier for me to help you.

Being a sole maintainer with a wife, two kids, other hobbies, and a full-time job makes it really difficult to support the same code in two versions. Since I put so much effort into v7, I suggest you help me make it even better by at least trying it out and reporting if you still have the same problem.

With fewer issues on version six to read and think about, I can put all my focus on version seven! That will quickly turn version seven into our best gem-friend.

Sorry to all you v6 people out there who still have problems. Many people have reported that v7 works better for them, and it does have some much-needed improvements; orphaned locks cleanup for one.

To upgrade, you need to make some changes:

Configuration

I'd say this is a pretty sane configuration for most people starting with version 7.

# frozen_string_literal: true

require "sidekiq"
require "sidekiq-unique-jobs"

SidekiqUniqueJobs.configure do |config|
  config.debug_lua       = false # true for debugging
  config.lock_info       = false # true for debugging
  config.max_history     = 1000  # keeps n number of changelog entries
  config.reaper          = :ruby # also :lua but that will lock while cleaning
  config.reaper_count    = 1000  # Reap maximum this many orphaned locks    
  config.reaper_interval = 10    # Reap every 10 seconds
  config.reaper_timeout  = 5     # Give the reaper 5 seonds to finish
end

Sidekiq.configure_client do |config|
  config.redis = { url: ENV["REDIS_URL"], driver: :hiredis }

  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end
end

Sidekiq.configure_server do |config|
  config.redis = { url: ENV["REDIS_URL"], driver: :hiredis }

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end

  # NOTE: This is really important!!!!
  #
  #  1. Starts the reaper background process/
  #  2. Adds a death_handler that cleans up locks when jobs die.
  #  3. Automatically upgrades all locks from v6; you don't need to worry.
  #
  SidekiqUniqueJobs::Server.configure(config)
end

If you are using other sidekiq middleware, I suggest you read the section Other Sidekiq gems for examples of which order the middleware needs to be added.

The order of said middleware is super important, so don't skip on this, and please let me know if I got something wrong to update the README.

I have done my best to support version six, but I am at my wit's end. I didn't rewrite the codebase just for fun. I did it to be able to fix people's problems. Now has come the time to try to push people over to the next major version.

Help me help you by upgrading!

@bekicot
Copy link

bekicot commented Jan 26, 2021

Do hiredis is a requirement?

@mhenrixon
Copy link
Owner Author

Hi @bekicot, not a requirement, definitely recommended if you are on regular ruby and are able to use C extensions as it does speed up things a little.

I use it for every project that deals with redis and sidekiq

@collimarco
Copy link

Hello,

first of all thanks for this great gem.

However I see some problems after the upgrade from v6... In particular if I run rspec in our application I get hundreds of messages like this:

WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:..., job_id:...)

This is one of the workers that produce the error:

class DestroyUserJob
  include Sidekiq::Worker

  sidekiq_options queue: :low, lock: :until_executed

  def perform(user_id)
    user = User.find user_id
    user.custom_destroy
  end
end

Then in the tests:

it 'schedules the job only once' do
  expect {
    DestroyUserJob.perform_async uid
    DestroyUserJob.perform_async uid
    DestroyUserJob.perform_async uid
  }.to change(DestroyUserJob.jobs, :size).by(1)
end
  • If I call DestroyUserJob.perform_async 1 time, I don't get any warning...
  • If I call it 3 times, like in this example, I get 2 warnings...
  • If I call it 5 time, I get 4 warning... etc.

What does it mean? What should I do?

@mhenrixon
Copy link
Owner Author

Hi @collimarco,

Please see the section about testing: https://github.com/mhenrixon/sidekiq-unique-jobs/#testing I recommend you leave the testing of uniqueness to the gem maintainer. There are ways to test uniqueness but it is complex and unreliable.

@collimarco
Copy link

@mhenrixon Thanks for the reply! The test for uniqueness here is just an example... I can even remove that test from our test suite, however I would not solve the problem. We have tens (or hundreds) of WARN like that above in very different tests : I took one that was simple to reproduce for you.

Basically every test that involves background workers in some way raises that errors. I'm not talking about testing uniqueness, I am talking about tests that deal in some way with background workers.

All our tests have been passing for years using SidekiqUniqueJobs v6... While today, if I update to v7, everything raises that WARN (which is totally obscure to me). What does it mean? Any idea on how to fix it?

@collimarco
Copy link

Example:

# this is a perfectly valid test, but with v7 the test raises warnings
it 'tests something' do
  ExampleUniqueJob.perform_async
  # ...
  ExampleUniqueJob.perform_async
  # ...
end

@ArturT
Copy link
Contributor

ArturT commented Feb 11, 2021

Update after using below config since last night:

config.reaper_count    = 1000
config.reaper_interval = 600  
config.reaper_timeout  = 595
  • number of threads seems to be stable ~10-12. I don't see leaking threads anymore.
  • I still see timeout from time to time in the output:
2021-02-10T23:41:15.171569+00:00 app[worker.1]: pid=3 tid=6cdmz WARN: (2021-02-10 23:41:15 +0000) Execution timed out
2021-02-11T00:01:10.458276+00:00 app[worker.1]: pid=3 tid=7igen WARN: (2021-02-11 00:01:10 +0000) Execution timed out

I also see the reaper removes some orphans so that's good.

2021-02-11T12:29:07.320673+00:00 app[worker.1]: pid=3 tid=77rr uniquejobs=reaper INFO: Deleting batch with 64 digests
2021-02-11T12:41:14.266990+00:00 app[worker.1]: pid=3 tid=77rr uniquejobs=reaper INFO: Deleting batch with 67 digests
2021-02-11T12:53:51.259274+00:00 app[worker.1]: pid=3 tid=77rr uniquejobs=reaper INFO: Deleting batch with 123 digests
2021-02-11T13:11:23.906011+00:00 app[worker.1]: pid=3 tid=77rr uniquejobs=reaper INFO: Deleting batch with 733 digests

I don't see the impact on Redis memory so running reaper every 10 minutes is good enough in my case. I would consider the problem solved just by using reaper every 10 minutes and using reaper_timeout close to that value.

I'd recommend other users who use Redis intensively (~3k ops/s) to be aware of how slow reaper can be and adjust their config to avoid leaking threads when the reaper can't complete work within reaper_timeout.

Thanks @mhenrixon for help with investigating the issue.

@mhenrixon
Copy link
Owner Author

mhenrixon commented Feb 11, 2021

No, thank you @ArturT you did most of the investigation. I am so happy you managed to find a workaround until I get the underlying issue completely fixed.

I will also see about improving the performance here a bit. What I had in mind was to use futures and zip them at the end so at least the checks don't have to run in parallel.

Even if it wasn't perfect, as long as no one is negatively affected I am super pleased with the release overall. Having someone like you test it will help me get v7 in top shape. Thank you for taking a leap of faith and trusting.

Ultimately, after fixing the issues you found with the reaper, I still have to make sure that reaping is not needed anymore but I wanted to buy myself some time to really dig into the details.

  • Fix the thread leakage due to bug with the TimerTask (honestly, TimerTask should never be allowed to spawn more than the one thread it is started with).
  • Improve the performance of the reaper as much as possible. A more efficient reaper will avoid any types of problems with configuration.
  • Gather some intelligence of the performance of the locking mechanism during high load and improve that as much as possible.
  • Do an even better job at removing the need for the reaper, it should run as rarely as possible, perhaps once a day or something. There shouldn't be more need than that for reaping. It should be for edge cases when the sidekiq server process was killed abruptly without the possibility of a cleanup.

When that is done, I'll see about adding some more features:

  • One of the things that I'd really like to see is unique batches/units of work :)
  • Another thing I'd like to see is an improved fetcher that filters jobs that can't be processed because another job has the lock already.
  • Third thing would be to somehow also provide a nicer way of bulk pushing jobs back on failure.

@mhenrixon
Copy link
Owner Author

@ArturT I tried to fix the problem in #576. As you can see in bin/profiling I can no longer replicate the problem but before merging I would really appreciate some help checking the PR out if it might work for you.

@DAmatheson
Copy link

👋🏻 It seems like the Locks and Changelog web pages have some functionality problems.

  • Paging doesn't seem to work on either page
  • Filtering on Changelogs doesn't seem to work, I tried searching with values in any of the columns and got no results
  • Filtering gets reset when you try to use the paging controls

@ArturT
Copy link
Contributor

ArturT commented Feb 24, 2021

Filtering gets reset when you try to use the paging controls

I can confirm this issue. When you click the Next 100 button then filter is reset back to *.

Filtering on Changelogs doesn't seem to work, I tried searching with values in any of the columns and got no results

Have you tried *my-string*. It seems to return results then.

@mhenrixon
Copy link
Owner Author

Thank for the report @DAmatheson and @ArturT I will see about getting that fixed this weekend.

@mhenrixon
Copy link
Owner Author

Merp! Been a crazy week, kids sick and such. Hopefully finding a minute during this weekend.

@mhenrixon
Copy link
Owner Author

@DAmatheson @ArturT I have something almost ready now: #584

@mhenrixon
Copy link
Owner Author

@DAmatheson @ArturT it has been released as v7.0.7

@roman-melnyk
Copy link

hi @mhenrixon !
doing some performance testing on 7.0.9 version I found one strange issue, may be you could explain what it can be:
I created a simple job which does nothing

  sidekiq_options(
    lock: :until_and_while_executing,
    on_conflict: {client: :log, server: :reschedule},
    lock_ttl: nil,
    lock_timeout: 0
  )

  def perform(a)
  end

and then I've put 10_000 jobs in a queue

10_000.times { |i| TestJob.perform_async i }

after starting Sidekiq I got a lot of messages like

... WARN: Timed out after 0s while waiting for primed token

by monitoring jobs execution in sidekiq web, I noticed, that some jobs were executed, whereas another were rescheduled, so that total count of processed jobs was ~15_000 (each second job was rescheduled?)

do I miss something in configuration? thanks

@mhenrixon
Copy link
Owner Author

@roman-melnyk this would be expected. Waiting for zero seconds means that when the job is not able to achieve a lock it will timeout after the number of seconds you configured it for. #598 commented on the logs being kind of verbose and wanted to have the ability to silence them.

What I'd like to do is to add hooks or lifecycle events to the gem so that people can do what they want with this type of information. That way, I don't have to log the warnings but can leave it to the user to act if they feel it is worth acting on.

In this case, it should probably not be a warning because in all actuality the gem is doing what it is supposed to.

Does that help at all? Any suggestions for how to make it more clear? Perhaps reducing the log level or merging the related PR to allow silencing the information until I have time to add the hooks?

@roman-melnyk
Copy link

@mhenrixon thanks for reply!
seems I was confused by lock_ttl - when it's nil, then this code tries to receive a value in a very short period of time

    def primed_async(conn)
      return yield if Concurrent::Promises
                      .future(conn) { |red_con| pop_queued(red_con) }
                      .value(drift(config.pttl) / 1000) # Important to reduce time spent waiting

      warn_about_timeout
    end

it's 0.002 s
so, it looks like it definitely should be bigger - but how much?
if I set lock_ttl to 2 (s), then drift(config.pttl) / 1000 is 0.022 s - which is enough (locally) not to get timeout at all on 20 threads
could you provide suggestion what this value should be? also, could you clarify why we set the dependency on TTL with time in which we expect rpoplpush be performed? shoudn't it correlate with config.timeout instead?

@roman-melnyk
Copy link

just noticed - in this case I'm getting a lot of Couldn't unlock digest:

@roman-melnyk
Copy link

roman-melnyk commented May 6, 2021

so, what is seems worked for me - keep lock_ttl to nil, lock_timeout to 0, and change this to

      return yield if Concurrent::Promises
                      .future(conn) { |red_con| pop_queued(red_con) }
                      .value(0.05) # testing value

then I get no unlock warnings, no timeouts

@mhenrixon
Copy link
Owner Author

Thanks for the details @roman-melnyk I'll take another stab at this again.

Just a little too late for me right now.

@mhenrixon mhenrixon changed the title Please try v7.0.1 Please try v7.0.10 May 10, 2021
@mhenrixon
Copy link
Owner Author

@roman-melnyk I believe v7.0.10 should do the trick. The problem was that the concurrent promise might have waited shorter than the pop_queued.

@ragesoss
Copy link

Hi! Thanks for all the work you've put into v7. I'm very excited about the reaper feature, which should let me get rid of some custom code for handling orphan locks.

I'm trying out the version 7 upgrade. One thing I was a little confused about is lock_timeout, which appears as a configuration value for the gem, where the example configuration block sets it with a value of nil — and also as a worker option, where it says the default is 0 and that nil is dangerous. Is nil in the example configuration not the default? (I initially assumed that all the values shown in that configuration were the defaults.)

Also, in my test suite, I have several cases where version 7 now throws a warning like this:

2021-05-13T22:10:20.831Z pid=3828 tid=78s uniquejobs=client until_executed=uniquejobs:b8f23a84a3985876802688e4f373b460 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:b8f23a84a3985876802688e4f373b460, job_id: 07b0d2609282a5448421c9a7)

For version 6, I was removing the uniquejobs client middleware from the Sidekiq chain in the spec helper. I tried removing that block, and I also tried mirroring all the initializer changes, like this:

# Peform all Sidekiq worker tasks immediately during testing
require 'sidekiq/testing'
require 'sidekiq-unique-jobs'
Sidekiq::Testing.inline!
Sidekiq.logger.level = Logger::WARN
Sidekiq.configure_server do |config|
  config.client_middleware do |chain|
    chain.remove SidekiqUniqueJobs::Middleware::Client
  end

  config.server_middleware do |chain|
    chain.remove SidekiqUniqueJobs::Middleware::Server
  end
end

Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.remove SidekiqUniqueJobs::Middleware::Client
  end
end

In all cases, I still get that timeout warning. If I set a higher-than-zero value for lock_timeout, it will wait that long before throwing the warning. Any advice on how to make this play nicely with Sidekiq::Testing.inline!?

@mhenrixon
Copy link
Owner Author

@ragesoss there is a section in the readme about testing.

SidekiqUniqueJobs.configure do |config|
  config.enable = !Rails.env.test?
end

For example.

@roman-melnyk
Copy link

@ragesoss or, if you need to disable it only in some particular specs

    around do |spec|
      enabled = SidekiqUniqueJobs.config.enabled
      SidekiqUniqueJobs.config.enabled = false
      spec.run
      SidekiqUniqueJobs.config.enabled = enabled
    end

@mhenrixon
Copy link
Owner Author

@roman-melnyk that looks overly complicated :) I'll post a much better solution in a few minutes.

@mhenrixon
Copy link
Owner Author

around do |example|
  SidekiqUniqueJobs.disable!(&example)
end

That's how you do it @roman-melnyk

@mhenrixon
Copy link
Owner Author

You can also use it in your test

it "works" do 
  SidekiqUniqueJobs.disable! do 
     # run my code that verifies assumptions
  end
end 

@ragesoss
Copy link

@ragesoss there is a section in the readme about testing.

SidekiqUniqueJobs.configure do |config|
  config.enable = !Rails.env.test?
end

For example.

Thank you! I saw this section in the README, but I didn't understand the context because it's in the 'Uniqueness' subsection of Testing and the link to the Sidekiq Enterprise thing is broken for me. It might help to put this at the top of testing section and explain the recommendation (turn it off for testing, just like Sidekiq Enterprise) explicitly.

@ragesoss
Copy link

Reporting back... Here's what my upgrade looked like: WikiEducationFoundation/WikiEduDashboard@daa4b88

It's been running in production this week without any problems.

Thanks!

@mhenrixon
Copy link
Owner Author

@ragesoss that's awesome man! Thank you for reporting back and a massive thank you for the link to the PR. That's super helpful for other people who upgrade.

@mhenrixon mhenrixon unpinned this issue Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants