-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
influxd (v2) keeps crashing after 24-48 hours. #19904
Comments
Further investigation-results: If I start the container without influxd (by overwriting the docker startup-command), shell into it and then start influxd manually from the cli (inside the container) it does work on every 2nd - 4th try. So just start "influxd" - it fails. Just do it again - it works. If it works everything is fine. DB is there, data is there - all good. |
unclear what the next steps are here? close? |
Though the DB seemed fine in the beginning this turned out to be not true. After several restarts the DB first had lost data, some restarts later it was completely corrupt (did not save any new data any more, was not able to load tag-values, ...). The number of times Influx crashed was increasing to every 2-4 hours. Soon it was unusable. I downgraded to Influx v1.8.3, deleted the db and started from scratch (on the same infrastructure - I just changed the influx-version in my container). That was working fine without any problems for 2 days. For testing I did a forced restart of the container. After that influxd 1.8.3 says
but after several seconds it seems to magically recover somehow
I can only guess, but for me it looks like influx is loosing files on the storage during the restart of the container. Obviously 2.0 just crashes in this case and 1.8 somehow can recover from that. E.g. it claims "/var/lib/influxdb/wal/_internal/monitor/1/_00003.wal: no such file or directory". The folder /var/lib/influxdb/wal/_internal/monitor/1/ is just empty. No idea why it does loose files. Other containers on the same (azure-)server do not have any problems with data-loss or -corruption. Anyway I guess from the logs it is obvious that also 1.8 does miss some data/files. So it is probably not a problem in 2.0 - apart from that 2.0 is not able to recover from that problem, what makes the impact bigger than in 1.8. |
I guess I found the root-cause for the problem. I think hosting InfluxD in Azure AppService is simply not possible. The only option available for persistent storage on Azure AppService Linux (which we use to host the influxd-instance) is Azure Files. As described e.g. in Azure Files is not fully POSIX-compliant - especially when it comes to file-locking. There are workarounds that enable some database-systems like "sqlite" to work on that storage anyway, see e.g. https://github.com/MicrosoftDocs/azure-docs/issues/47130 That does work well in our case (e.g. when using Grafana). Still that is unsupported by MS, see e.g. and cannot solve every problem, as e.g. seen in In our case that has never been a problem so far. Therefor I did not even know that. I can only guess, that influxd has a problem with the missing POSIX-compliance of the storage and that causes the problems. No idea how that leads to missing files, but other DBs also seem to have problems with data-corruption in this setup. Azure has a new POSIX-compliant storage offering for Azure Files since 09/2020 (called Azure Files NFS). But that is still in preview and does - at least until now - only mount its volumes "readonly" into AppService. |
I am running InfluxDB 2 OSS from the current docker-image from quay.io/influxdb/influxdb:2.0.0-rc (created 2020-10-29T22:29:51.965000138Z) on Azure AppService. The folder with the data (/var/lib/influxdb inside the container) is mapped to Azure Blob Storage.
Everything is running as it should. Performance is fine, querys work, API is working. All good.
After roughly 24-48 hours InfluxD just stops responding (freeze). Container is then auto-restarted by Azure's Health Checker.
During the restarts influxd crashes with different error-messages like
unexpected fault address 0x7fca735fb000
(~1 second after "Open store (start)")
or
2020-11-05T07:45:17.147727243Z ts=2020-11-05T07:45:17.147623Z lvl=info msg="Open store (start)" log_id=0QHqiI_W000 service=storage-engine op_name=tsdb_open op_event=start
2020-11-05T07:45:17.709247550Z Error: readdirent: no such file or directory
or
unexpected fault address 0x7f3bbfe1e008
All of that is preventing influxd from starting. The only "fix" I found so far to get the thing running again is to delete everything in /var/lib/influxdb/. After that (with the same settings/container) Influx starts fine (creates a new DB) and is working like it should again. At least for 1-2 days.
We do use the same host/storage for other containers (e.g. Grafana). All of them are running fine since ages.
Environment info:
Config:
INFLUXD_BOLT_PATH=/var/lib/influxdb2/influxd.bolt
INFLUXD_CONFIG_PATH=/var/lib/influxdb2/config
INFLUXD_ENGINE_PATH=/var/lib/influxdb2/engine
INFLUXD_LOG_LEVEL=debug
INFLUXD_SECRET_STORE=bolt
INFLUXD_STORE=bolt
INFLUXDB_REPORTING_DISABLED=true
Logs:
Full Crash-Logs at https://bacom-my.sharepoint.com/:t:/g/personal/alexb_ba-com_net1/EVGWgfqPBl5IrBhTXW1N1jIB2RUrOxPppm6kaOM-T6vrtg?e=UTdVia
The text was updated successfully, but these errors were encountered: