-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
No worker instances after upgrade #18189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think you need to give us more logs. From the tiny amount of logging you have given it looks like the problem lies with there being something invalid in your task level queue, now why that would be the case I am uncertain. I also need to see your app.ini to check your queue configuration. |
|
Unfortunately the logs contain sensitive information (Employee and Customer names are part of repositories). There are no warnings or errors in the logs. The error I've quoted above happened once and then vanished. The logs from the upgrade are already gone through the rotation. :( I'm happy to post logs for specific actions, if that helps. |
Why have you set: [queue]
BLOCK_TIMEOUT=0 ? Does removing this help? |
The reasoning as to why this might be the cause of the problem is that on 1.15 Gitea does not start with any workers by default:
With BLOCK_TIMEOUT=0 no blocking of the queue will be detected and thus no boosting will occur. This is mentioned in the breaking notes for Gitea 1.15.0: |
Most of the configuration has been generated by gitea. Could anything have triggered this setting?
It actually does. From the documentation it seems that this inhibited the creation of new workers. |
Thanks for pointing this out! However, I would not have realized this even when reading this change log. |
@zeripath Thank you for the quick help! This means I can remove the work-around now. |
Or it could be part of the doctor tool. I ran all the tests and they said things are fine - maybe one test could check that the configuration will lead to viable workers, so that an administrator doesn't need to know too much about the internal workings of Gitea. |
Nope this will have been added by hand. It's really not the kind of setting we'd be setting to 0 by default as it turns the scaling worker pool into a non-scaling worker pool. I wouldn't have written the scaling pool in the first place if I thought that most people would not need it but the option exists because I know that some people will need to prevent scaling for certain deployments.
The breaking notice clearly refers to queues and definitely states that the default worker count is now 0. (The config cheat sheet similarly says this.) The only reason to set
I never expected that anyone would be setting this value who did not have a really deep understanding of what it was doing. Certainly not without overriding the default WORKERS number manually themselves. There are a lot of way you can add settings that will cause a fundamentally broken configuration (for most circumstances) - it's difficult to detect or predict every way in which things will be broken. We try to provide a decent set of defaults to keep things reasonable and to predict how people will be affected by changes in those defaults but honestly I missed this one, setting In the end though quite a lot of information was already provided to figure this out, you diagnosed that the issue was that there were no workers, there was a warning in the breaking notices about changes to the default number of workers, you looked at the monitoring page and discovered that there were no workers where it would also have said that the block timeout was 0, the cheat sheet tells you default number of workers has changed to 0 and you had your app.ini. While we could provide more information it's not as if there wasn't a lot of information already there. Now, what can be done?
|
Thank you for the long post! Please take my remarks not so much as a critique on this very nice project, but as a user/administrator feedback to help improve the experience. We checked again and it is unclear why the option is there, but upgrades in three other instances didn't show any such issues. One thing came up during discussion: Upgrade from 1.12 to 1.15 was considered a minor upgrade, yet the meaning of a configuration variable changed, which should have triggered a major upgrade. I do not want to go to deep into that rabbit hole. A suggestion was that a new variable would have been better than changing the meaning of an existing variable. (It is another discussion weather configuration is part of the API.)
Please consider a user (or administrator) of this software who does not know about the Gitea or Go internals. Up until this issue I simply was not aware that queues are used, as it was not needed. It is an implementation detail that for our setup never needed to be tweaked.
I only figured this out because, luckily, I am able to read the source code (I have never developed with Go tho) and have enough experience with similar systems to understand the structure. The logs do not reveal that queues are used.
I agree, during normal operations this would not be noticed. When normal operations stopped, I checked the logs thoroughly and would definitely have seen this warning. I also looked at the queue configuration in the UI, but they are formatted as JSON and I did not relate them to the ini-style configuration file. If the log entry is easy to achieve, it would be a nice safeguard. Otherwise I believe that this issue and discussion might be enough, hoping that it will show up in search results for other administrators with a similar problem. Please keep up the good work! :) |
Gitea Version
1.15.9
Git Version
2.30.2 (from container)
Operating System
Alpine 3.13 (from container)
How are you running Gitea?
I'm using the proposed docker-compose configuraiton.
Initial version: 1.12 (.3?)
Upgrade to: 1.15.7
Database
SQLite
Can you reproduce the bug on the Gitea demo site?
No
Log Gist
No response
Description
After upgrading Gitea to 1.15.9 everything seemed well, but over time we experienced weird effects:
A restart of the container would now and then fix one of the issues, but only momentarily. I would see log messages like:
2022/01/04 19:10:48 ...ue/queue_bytefifo.go:224:doPop() [E] level: task-level Failed to unmarshal with error: readObjectStart: expect { or n, but found [, error found in #1 byte of ...|[{"PusherID|..., bigger context ...|[{"PusherID":1,"PusherName":"t","RepoUserName":"t|... 2022/01/04 19:10:48 ...ue/queue_bytefifo.go:224:doPop() [E] level: task-level Failed to unmarshal with error: readObjectStart: expect { or n, but found ", error found in #1 byte of ...|"429"|..., bigger context ...|"429"|...
These, however, disappeared after this one instance.
All these problems do not occur on a fresh instance, but we want to keep our data. (...)
Unfortunately we could also not just go back, because we realized the problems too late for any of the Backups to be viable.
I did the following during debugging:
recreate-table
) and fix warnings on all checksIn the end I realized that there are no workers (Site Administration → Monitoring). My initial guess was that these workers are instantiated automatically of there are tasks in the queue. Later I checked the queue configurations and clicked "Add Flush Workers" with three findings:
As a workaround I have added a group with one worker to each Queue. I could not find these worker groups in the database though.
There is still the problem described in #17204, but it has not happened on new PRs yet.
Is there a problem during upgrade that leads to a situation where no workers exist?
Screenshots
No response
The text was updated successfully, but these errors were encountered: