-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issues with Postgress #348
Comments
Hi, Please share additional info:
|
Hi, Sure
We have in GCP a few servers. Each server could have specific list of queues. Independent database for Hangfire as service in GCP. Database CPU - 100%, memory - around 50%. We can add additional instances fast. However, we have that issue when we have even 1 server. 13 secs latency for an index scan. And it's slowly degrading. We have 60k jobs in the queue and hangfire cannot start processing more than 5-10 jobs. Now we are not even creating jobs, just trying to process existing. So, technically we cannot add additional servers for horizontal scaling, because DB is dying. Do you have any ideas about what are we doing wrong? |
Try to |
That is not the desired behavior. We have small jobs and all servers could process those. So, technically increasing the pooling interval (15 seconds now) will mean that workers are just doing nothing sometimes. Or am I wrong? |
Technically, yes; in practice, depending on the server which enqueues the job, it can be picked up by a worker in the same server instance immediately because of Partial screenshot of the execution plan has little value. Can you provide the actual query plan from PG in JSON format instead? I can't open any of these images in full size, personally (404). EDIT: Also, what is behind |
Generated a million rows in the table. Here are the results. Even with a milion rows, query runs in sub-1s. |
That is my expectation. :) Will try to share the JSON plan later. I cannot it extract from GCP so simply. UseTagsWithPostgreSql enables Tags functionality by creating an independent table. It doesn't affect the initial table and selecting from jobqueue. OK, we will try to increase pooling interval. |
I'd suggest trying manual vacuum on the table, if GCP allows that in some way. We've had some issues with autovacuum not doing the job so well in our business apps. |
Tried, didn't help :( |
So, the reason is 100% related to amount of records in jobqueu. That subquery is the issue
Here is the plan for the query https://explain.dalibo.com/plan/ga35182eddag2f6h# As I understand we have problem because of |
https://postgrespro.com/list/thread-id/2505440 Here a similar problem is discussed. I quickly scanned through, and one recommendation is to introduce jitter lag to polling threads to decrease contention. Maybe there is something better, need to read until the end. |
Checked that thread. Interesting, but I don’t understand what can we do. Will Increasing pooling interval help us? Honestly I don’t understand how to use Postgres in production now. We are doing everything by best practices, however additional workers fully destroying performance of job processing. Why nobody else has that issue? Or nobody is using Postgres in production? |
Plenty of production applications use both PostgeSQL and this package. Each environment has different capabilities, bandwidth, hardware and so on, so each environment can behave differently. I personally have no experience with GCP so I can't really say anything about it. |
Yeah, just topic above is not related to hangfire and the current package, but Postgres. That is what I don’t understand why we have that issue but nobody else. What are we doing differently?:) I do see that issue related to processing “for update skip locked” in many threads. Any ideas what can we do with that? |
Oh, yeah, it's not about Hangfire.PostgreSql storage provider, but it is about PostgreSQL itself, which the provider does use. I have to admit I've not dug into the performance/indexing/etc mechanisms in PG as I had no need for it. If GCP has a limited set of operations that you can perform on the database directly (again, no idea how GCP works), my suggestion is to perhaps try Redis provider. At least according to Hangfire docs it has a huge throughput increase compared to SQL (makes sense, since Redis is in-memory cache). |
I believe you don’t understand the severity of that issue. Topic above is not related to GCP but yes, to something specific inside of Postgres. Until we understand what is that and how to mitigate it everyone who is using Hangfire with current package under risk of unpredictable performance degradation. It’s time bomb. It could happen once you try to add additional workers, or maybe not. Until we know what is that - production risks are too high for everyone. |
I agree; to an extent. That could happen with or without this package, as the linked mailing list shows. Without a reproduction there's virtually no way to see what can or can not happen. I understand how you feel, being the only one which reported the issue in this repo (I'm not saying you're the only one that's impacted; just about reporting). Unfortunately, you're the one that has a reproduction and means to see whether any changes to the queries are valid or not. Since it's an OSS package, you can submit a pull request if there's a possible fix. |
@vadim-kupovykh I may have some free time this weekend to try come up with something. I have only 2 ideas so far:
|
From what I understand, it's really about many workers trying to handle the same table, as in essence I'm not convinced there's a good way to just drop the ordering, even if configurable. All jobs must be picked up in order they were created. But this gave me an idea - maybe we lack the ordered index for this particular query? |
Probably not. At least, I see these 3 indexes that I think should work: create index ix_hangfire_jobqueue_jobidandqueue
on jobqueue (jobid, queue);
create index ix_hangfire_jobqueue_queueandfetchedat
on jobqueue (queue, fetchedat);
create index jobqueue_queue_fetchat_jobid
on jobqueue (queue, fetchedat, jobid); Requirements
Can you point me where it is documented that Hangfire-compatible storage must ensure ordered job pickup? I just checked MSSQL server implementation, and they do NOT order the jobs on deque (see https://github.com/HangfireIO/Hangfire/blob/main/src/Hangfire.SqlServer/SqlServerJobQueue.cs#L191-L201) Here's their SQL query: set nocount on;set xact_abort on;set tran isolation level read committed;
update top (1) JQ
set FetchedAt = GETUTCDATE()
output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt
from [{_storage.SchemaName}].JobQueue JQ with (forceseek, readpast, updlock, rowlock)
where Queue in @queues and
(FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE())); Ordering question has been discussed here HangfireIO/Hangfire#398 , and it seems that the correct way to maintain job dependencies is using job continuation. Usage scenariosI think that it really depends on the use case of each consumer whether ordering is important or not. Introducing an option to avoid sorting can be considered a feature that gives users even more control over their processing pipeline. PerformanceEnsuring ordering is really putting much stress on performance.
See the container CPU usage chart from docker. My setup
Suggested next steps@azygis I stand firm on my suggestion with a config flag that removes ordering in exchange for more performance. @vadim-kupovykh do you think this tradeoff would be applicable for your case, at least? Though, it may require additional tweaks on Dashboard and other parts of the library. On the dashboard, I saw strange artifacts when page with queue jobs would show "No jobs in queue" occasionally during active processing. Useful readingI also read this article. It is basically saying that https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/ Source code of testing appusing Hangfire;
using Hangfire.PostgreSql;
string[] queues = { "alpha", "beta", "gamma", "test" };
bool generateJobs = true; //Toggle to switch between creating new jobs and processing them
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHangfire(configuration => {
configuration.UsePostgreSqlStorage(options => {
options.UseNpgsqlConnection("User ID=postgres;Password=mysecretpassword;Host=localhost;Port=5432;Database=hangfire-postgres");
});
});
if (!generateJobs)
{
builder.Services.AddHangfireServer(options => {
options.Queues = queues;
options.WorkerCount = 60;
});
}
var app = builder.Build();
if (generateJobs)
{
var client = app.Services.GetRequiredService<IBackgroundJobClientV2>();
foreach (string queue in queues)
{
Task.Run(() => {
for (int i = 0; i < 30_000; i++)
{
client.Enqueue<Jobs>(queue, jobs => jobs.Delayed());
}
});
}
}
app.UseHangfireDashboard("");
app.Run();
public class Jobs
{
public async Task Delayed()
{
}
} Toggling job orderingI compiled Hangfire.Postgres locally, and basically removed
|
What if you specify index order? Currently it's a default, which is ASC NULLS LAST. Query does a complete 180 with NULLS FIRST.
Something like that?
I cannot. It's a mistake done by many, and since the package has been with this ordering for quite some time, it can become a breaking change for some people. Of course, with a default of "keep ordering" it's not a breaking change though. |
@azygis brilliant idea! On my test data it works as fast as w/o ordering. |
…Ls in "fetchedat" column first for better de-queueing performance
…Ls in "fetchedat" column first for better de-queueing performance
Create index for jobqueue table that puts NULLs in "fetchedat" column first for better de-queueing performance Signed-off-by: Dmitry Vychikov <dmitry.vychikov@dsr-corporation.com>
Guys, you are amazing.:) going to test with ordered index! Will update you later. |
Lovely! Then we don't need any conditionals, I suppose? Would you be able to look at other indexes? If order is needed for any of them? Edit: @vadim-kupovykh yeah it's kinda safe to do live. When a new version is deployed, the index will just get recreated. |
Yes, we don't. I tried to test two cases with regards to ordering:
I didn't see any significant difference between these two. Both cases run faster with updated index. At least, we did not make it worse.
I might take a look next weekend. Right now I'm not even sure if other indexes for Maybe it's time to do some clean ups. |
@dmitry-vychikov I've checked on my side with just applying index order - so, we don't have that unexpected degradation anymore. The query is executing ten times faster. Please merge into master, we are ready to update ASAP on all our environments. We are continuing to monitor that the issue will not reproduce, but here is the latency before applying index order and after: I hope we will not have degradation anymore. @dmitry-vychikov, @azygis thank you guys for the fast solution! |
…formance [#348] Improve dequeue performance
Published as 1.20.8. |
Hi guys,
Here is query we see in Postgress logs
We have issues with performance on the Postgress side, sometimes execution up to 8 seconds of that query. We don't have so many jobs in the queue, up to 20K. Any ideas about what it can be?
The text was updated successfully, but these errors were encountered: