Skip to content

No worker instances after upgrade #18189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
penguineer opened this issue Jan 5, 2022 · 12 comments
Closed

No worker instances after upgrade #18189

penguineer opened this issue Jan 5, 2022 · 12 comments

Comments

@penguineer
Copy link

penguineer commented Jan 5, 2022

Gitea Version

1.15.9

Git Version

2.30.2 (from container)

Operating System

Alpine 3.13 (from container)

How are you running Gitea?

I'm using the proposed docker-compose configuraiton.
Initial version: 1.12 (.3?)
Upgrade to: 1.15.7

Database

SQLite

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Description

After upgrading Gitea to 1.15.9 everything seemed well, but over time we experienced weird effects:

  • New repositories would always show the "new" page.
  • Pull Requests are fine after being created, but would not update or show "missing fork information".
  • Overall the repositories (git interaction) worked, but the UI was not in sync.

A restart of the container would now and then fix one of the issues, but only momentarily. I would see log messages like:
2022/01/04 19:10:48 ...ue/queue_bytefifo.go:224:doPop() [E] level: task-level Failed to unmarshal with error: readObjectStart: expect { or n, but found [, error found in #1 byte of ...|[{"PusherID|..., bigger context ...|[{"PusherID":1,"PusherName":"t","RepoUserName":"t|... 2022/01/04 19:10:48 ...ue/queue_bytefifo.go:224:doPop() [E] level: task-level Failed to unmarshal with error: readObjectStart: expect { or n, but found ", error found in #1 byte of ...|"429"|..., bigger context ...|"429"|...
These, however, disappeared after this one instance.

All these problems do not occur on a fresh instance, but we want to keep our data. (...)
Unfortunately we could also not just go back, because we realized the problems too late for any of the Backups to be viable.

I did the following during debugging:

  • Verify that hooks are called
  • Use the doctor tool to migrate the database (recreate-table) and fix warnings on all checks

In the end I realized that there are no workers (Site Administration → Monitoring). My initial guess was that these workers are instantiated automatically of there are tasks in the queue. Later I checked the queue configurations and clicked "Add Flush Workers" with three findings:

  • I get a 500 page, the log says that there are nil references and some i18n keys cannot be resolved. (Reproducible by instantiating a flush worker with tasks in the queue.)
  • Complaints about non-existing branches, which made sense because we resolved some merges by hand. (Work needed to continue.)
  • Suddenly there were updates in the UI.

As a workaround I have added a group with one worker to each Queue. I could not find these worker groups in the database though.

There is still the problem described in #17204, but it has not happened on new PRs yet.

Is there a problem during upgrade that leads to a situation where no workers exist?

Screenshots

No response

@zeripath
Copy link
Contributor

zeripath commented Jan 5, 2022

I think you need to give us more logs.

From the tiny amount of logging you have given it looks like the problem lies with there being something invalid in your task level queue, now why that would be the case I am uncertain.

I also need to see your app.ini to check your queue configuration.

@penguineer
Copy link
Author

app.ini looks like this: (values in <> have been replaces)

APP_NAME = <name>
RUN_MODE = prod
RUN_USER = git

[repository]
ROOT = /data/git/repositories

[repository.upload]
TEMP_PATH = /data/gitea/uploads

[server]
APP_DATA_PATH    = /data/gitea
SSH_DOMAIN       = <domain>
HTTP_PORT        = 3000
ROOT_URL         = <url>
START_SSH_SERVER = false
DISABLE_SSH      = false
SSH_PORT         = 2222
DOMAIN           = <domain>
LFS_START_SERVER = true
LFS_CONTENT_PATH = /data/gitea/lfs
LFS_JWT_SECRET   = <secret>
OFFLINE_MODE     = false

[database]
PATH     = /data/gitea/gitea.db
DB_TYPE  = sqlite3
HOST     = localhost:3306
NAME     = gitea
USER     = root
PASSWD   =
SSL_MODE = disable
;LOG_SQL  = false

[session]
PROVIDER_CONFIG = /data/gitea/sessions
PROVIDER        = file

[picture]
AVATAR_UPLOAD_PATH      = /data/gitea/avatars
DISABLE_GRAVATAR        = false
ENABLE_FEDERATED_AVATAR = true

[attachment]
PATH = /data/gitea/attachments

[log]
ROOT_PATH = /data/gitea/log
MODE      = console ;file
LEVEL     = debug

[security]
INSTALL_LOCK   = true
SECRET_KEY     = <key>
INTERNAL_TOKEN = <token>

[service]
DISABLE_REGISTRATION              = true
REQUIRE_SIGNIN_VIEW               = false
REGISTER_EMAIL_CONFIRM            = false
ENABLE_NOTIFY_MAIL                = true
ALLOW_ONLY_EXTERNAL_REGISTRATION  = false
ENABLE_CAPTCHA                    = false
DEFAULT_KEEP_EMAIL_PRIVATE        = false
DEFAULT_ALLOW_CREATE_ORGANIZATION = true
DEFAULT_ENABLE_TIMETRACKING       = true
NO_REPLY_ADDRESS                  = <address>

[mailer]
ENABLED     = true
HOST        = <host:port>
USER        = <user>
PASSWD      = <passwd>
FROM        = <from>
MAILER_TYPE = smtp

[openid]
ENABLE_OPENID_SIGNIN = true
ENABLE_OPENID_SIGNUP = false

[oauth2]
JWT_SECRET = <secret>

[queue]
BLOCK_TIMEOUT=0

@penguineer
Copy link
Author

Unfortunately the logs contain sensitive information (Employee and Customer names are part of repositories). There are no warnings or errors in the logs. The error I've quoted above happened once and then vanished.

The logs from the upgrade are already gone through the rotation. :(

I'm happy to post logs for specific actions, if that helps.

@penguineer
Copy link
Author

penguineer commented Jan 5, 2022

After adding a worker manually the Configuration for the queue push_update-channel now looks like this:
image

It looks like this for every queue.

@zeripath
Copy link
Contributor

zeripath commented Jan 5, 2022

Why have you set:

[queue]
BLOCK_TIMEOUT=0

?

Does removing this help?

@zeripath
Copy link
Contributor

zeripath commented Jan 5, 2022

The reasoning as to why this might be the cause of the problem is that on 1.15 Gitea does not start with any workers by default:

Starting workers are now 0 with boost workers at 1. If you have explicitly set BOOST_WORKERS = 0 you will need to explicitly set WORKERS to at least 1. Administrators of busy sites should tune their WORKERS, BOOST_WORKERS, & DATADIR parameters as needed.

With BLOCK_TIMEOUT=0 no blocking of the queue will be detected and thus no boosting will occur.

This is mentioned in the breaking notes for Gitea 1.15.0:

https://blog.gitea.io/2021/08/gitea-1.15.0-is-released/#-change-default-queue-settings-to-be-low-go-routines-15964httpsgithubcomgo-giteagiteapull15964

@penguineer
Copy link
Author

penguineer commented Jan 5, 2022

Why have you set:

[queue]
BLOCK_TIMEOUT=0

?

Most of the configuration has been generated by gitea. Could anything have triggered this setting?

Does removing this help?

It actually does.

From the documentation it seems that this inhibited the creation of new workers.
Maybe it makes sense to log a warning when this setting is detected?

@penguineer
Copy link
Author

penguineer commented Jan 5, 2022

The reasoning as to why this might be the cause of the problem is that on 1.15 Gitea does not start with any workers by default:

Starting workers are now 0 with boost workers at 1. If you have explicitly set BOOST_WORKERS = 0 you will need to explicitly set WORKERS to at least 1. Administrators of busy sites should tune their WORKERS, BOOST_WORKERS, & DATADIR parameters as needed.

With BLOCK_TIMEOUT=0 no blocking of the queue will be detected and thus no boosting will occur.

This is mentioned in the breaking notes for Gitea 1.15.0:

https://blog.gitea.io/2021/08/gitea-1.15.0-is-released/#-change-default-queue-settings-to-be-low-go-routines-15964httpsgithubcomgo-giteagiteapull15964

Thanks for pointing this out!

However, I would not have realized this even when reading this change log.

@penguineer
Copy link
Author

@zeripath Thank you for the quick help! This means I can remove the work-around now.

@penguineer
Copy link
Author

From the documentation it seems that this inhibited the creation of new workers. Maybe it makes sense to log a warning when this setting is detected?

Or it could be part of the doctor tool. I ran all the tests and they said things are fine - maybe one test could check that the configuration will lead to viable workers, so that an administrator doesn't need to know too much about the internal workings of Gitea.

@zeripath
Copy link
Contributor

zeripath commented Jan 5, 2022

Most of the configuration has been generated by gitea. Could anything have triggered this setting?

Nope this will have been added by hand. It's really not the kind of setting we'd be setting to 0 by default as it turns the scaling worker pool into a non-scaling worker pool. I wouldn't have written the scaling pool in the first place if I thought that most people would not need it but the option exists because I know that some people will need to prevent scaling for certain deployments.

However, I would not have realized this even when reading this change log.

The breaking notice clearly refers to queues and definitely states that the default worker count is now 0. (The config cheat sheet similarly says this.) The only reason to set BLOCK_TIMEOUT=0 is to prevent worker scaling and should be clear that changing the default WORKERS to 0 would conflict with that. If I'd predicted that people were out there setting BLOCK_TIMEOUT=0 without adjusting WORKERS I'd have mentioned it explicitly there but I didn't - however, in the end it's your configuration - you're responsible for looking at your configuration and checking how the breaking notices interact.

Maybe it makes sense to print a warning when this setting is detected?

I never expected that anyone would be setting this value who did not have a really deep understanding of what it was doing. Certainly not without overriding the default WORKERS number manually themselves.

There are a lot of way you can add settings that will cause a fundamentally broken configuration (for most circumstances) - it's difficult to detect or predict every way in which things will be broken. We try to provide a decent set of defaults to keep things reasonable and to predict how people will be affected by changes in those defaults but honestly I missed this one, setting BLOCK_TIMEOUT=0 alone without any other settings in queues is not something that was predicted.

In the end though quite a lot of information was already provided to figure this out, you diagnosed that the issue was that there were no workers, there was a warning in the breaking notices about changes to the default number of workers, you looked at the monitoring page and discovered that there were no workers where it would also have said that the block timeout was 0, the cheat sheet tells you default number of workers has changed to 0 and you had your app.ini. While we could provide more information it's not as if there wasn't a lot of information already there.

Now, what can be done?

  • Whilst I think it would be reasonable to log something if WORKERS=0 and BLOCK_TIMEOUT=0, this would be a best a WARN level logging, I don't think that would actually be noticed. It is however a simple thing and one that could be coupled with checking for MAX_WORKERS=0 and BOOST_WORKERS=0 too.
  • Similarly I wouldn't be against trying to add some kind of configuration sanity check to doctor - but this would be very difficult to get right and we'd be spending a lot of time predicting how people can foot-gun themselves - which honestly we do a lot of already.
  • There's actually a valid use for having a Gitea run with absolutely no queue workers but we actually don't handle that configuration well at the moment (you'll lose stuff from the queue which is not good.) So perhaps when that is sorted we can add in detection at that point.

@penguineer
Copy link
Author

Thank you for the long post!

Please take my remarks not so much as a critique on this very nice project, but as a user/administrator feedback to help improve the experience.

We checked again and it is unclear why the option is there, but upgrades in three other instances didn't show any such issues. One thing came up during discussion: Upgrade from 1.12 to 1.15 was considered a minor upgrade, yet the meaning of a configuration variable changed, which should have triggered a major upgrade. I do not want to go to deep into that rabbit hole. A suggestion was that a new variable would have been better than changing the meaning of an existing variable. (It is another discussion weather configuration is part of the API.)

The breaking notice clearly refers to queues and definitely states that the default worker count is now 0.

Please consider a user (or administrator) of this software who does not know about the Gitea or Go internals. Up until this issue I simply was not aware that queues are used, as it was not needed. It is an implementation detail that for our setup never needed to be tweaked.

In the end though quite a lot of information was already provided to figure this out

I only figured this out because, luckily, I am able to read the source code (I have never developed with Go tho) and have enough experience with similar systems to understand the structure. The logs do not reveal that queues are used.
This was not achieved from the log files. The change log entry also would not have made sense to me before this analysis.

Whilst I think it would be reasonable to log something if WORKERS=0 and BLOCK_TIMEOUT=0, this would be a best a WARN level logging, I don't think that would actually be noticed. It is however a simple thing and one that could be coupled with checking for MAX_WORKERS=0 and BOOST_WORKERS=0 too.

I agree, during normal operations this would not be noticed. When normal operations stopped, I checked the logs thoroughly and would definitely have seen this warning.

I also looked at the queue configuration in the UI, but they are formatted as JSON and I did not relate them to the ini-style configuration file.

If the log entry is easy to achieve, it would be a nice safeguard. Otherwise I believe that this issue and discussion might be enough, hoping that it will show up in search results for other administrators with a similar problem.

Please keep up the good work! :)

@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants