Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

influxd (v2) keeps crashing after 24-48 hours. #19904

Closed
bacomalex opened this issue Nov 5, 2020 · 4 comments
Closed

influxd (v2) keeps crashing after 24-48 hours. #19904

bacomalex opened this issue Nov 5, 2020 · 4 comments

Comments

@bacomalex
Copy link

bacomalex commented Nov 5, 2020

I am running InfluxDB 2 OSS from the current docker-image from quay.io/influxdb/influxdb:2.0.0-rc (created 2020-10-29T22:29:51.965000138Z) on Azure AppService. The folder with the data (/var/lib/influxdb inside the container) is mapped to Azure Blob Storage.

Everything is running as it should. Performance is fine, querys work, API is working. All good.

After roughly 24-48 hours InfluxD just stops responding (freeze). Container is then auto-restarted by Azure's Health Checker.

During the restarts influxd crashes with different error-messages like

unexpected fault address 0x7fca735fb000
(~1 second after "Open store (start)")

or

2020-11-05T07:45:17.147727243Z ts=2020-11-05T07:45:17.147623Z lvl=info msg="Open store (start)" log_id=0QHqiI_W000 service=storage-engine op_name=tsdb_open op_event=start
2020-11-05T07:45:17.709247550Z Error: readdirent: no such file or directory

or

unexpected fault address 0x7f3bbfe1e008

All of that is preventing influxd from starting. The only "fix" I found so far to get the thing running again is to delete everything in /var/lib/influxdb/. After that (with the same settings/container) Influx starts fine (creates a new DB) and is working like it should again. At least for 1-2 days.

We do use the same host/storage for other containers (e.g. Grafana). All of them are running fine since ages.

Environment info:

  • System info: Linux 5.4.0-1026-azure x86_64
  • InfluxDB version: InfluxDB 2.0.0-rc.3 (git: f46a3bd) build_date: 2020-10-29T22:17:55Z

Config:
INFLUXD_BOLT_PATH=/var/lib/influxdb2/influxd.bolt
INFLUXD_CONFIG_PATH=/var/lib/influxdb2/config
INFLUXD_ENGINE_PATH=/var/lib/influxdb2/engine
INFLUXD_LOG_LEVEL=debug
INFLUXD_SECRET_STORE=bolt
INFLUXD_STORE=bolt
INFLUXDB_REPORTING_DISABLED=true

Logs:
Full Crash-Logs at https://bacom-my.sharepoint.com/:t:/g/personal/alexb_ba-com_net1/EVGWgfqPBl5IrBhTXW1N1jIB2RUrOxPppm6kaOM-T6vrtg?e=UTdVia

@bacomalex
Copy link
Author

Further investigation-results:
If influxd does not start (with various error-messages as described abvoe) the database is not corrupt.

If I start the container without influxd (by overwriting the docker startup-command), shell into it and then start influxd manually from the cli (inside the container) it does work on every 2nd - 4th try.

So just start "influxd" - it fails. Just do it again - it works. If it works everything is fine. DB is there, data is there - all good.

@timhallinflux
Copy link
Contributor

unclear what the next steps are here? close?

@bacomalex
Copy link
Author

Though the DB seemed fine in the beginning this turned out to be not true. After several restarts the DB first had lost data, some restarts later it was completely corrupt (did not save any new data any more, was not able to load tag-values, ...). The number of times Influx crashed was increasing to every 2-4 hours. Soon it was unusable.

I downgraded to Influx v1.8.3, deleted the db and started from scratch (on the same infrastructure - I just changed the influx-version in my container). That was working fine without any problems for 2 days.

For testing I did a forced restart of the container. After that influxd 1.8.3 says

2020-11-11T07:47:16.983175779Z ts=2020-11-11T07:47:16.983030Z lvl=info msg="InfluxDB starting" log_id=0QP_DFTl000 version=1.8.3 branch=1.8 commit=563e6c3d1a7a2790763c6289501095dbec19244e 2020-11-11T07:47:16.983195279Z ts=2020-11-11T07:47:16.983057Z lvl=info msg="Go runtime" log_id=0QP_DFTl000 version=go1.13.8 maxprocs=2 2020-11-11T07:47:22.958995280Z ts=2020-11-11T07:47:22.958690Z lvl=info msg="Using data dir" log_id=0QP_DFTl000 service=store path=/var/lib/influxdb/data 2020-11-11T07:47:22.968696715Z ts=2020-11-11T07:47:22.967830Z lvl=info msg="Compaction settings" log_id=0QP_DFTl000 service=store max_concurrent_compactions=1 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648 2020-11-11T07:47:22.968722816Z ts=2020-11-11T07:47:22.967871Z lvl=info msg="Open store (start)" log_id=0QP_DFTl000 service=store trace_id=0QP_Dbql000 op_name=tsdb_open op_event=start 2020-11-11T07:47:24.045756329Z ts=2020-11-11T07:47:24.045568Z lvl=info msg="Opened file" log_id=0QP_DFTl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/_internal/monitor/1/000000001-000000001.tsm id=0 duration=33.480ms 2020-11-11T07:47:24.061829353Z ts=2020-11-11T07:47:24.061589Z lvl=info msg="Reading file" log_id=0QP_DFTl000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/_internal/monitor/3/_00001.wal size=10486244 2020-11-11T07:47:24.118090838Z ts=2020-11-11T07:47:24.116908Z lvl=info msg="Failed to open shard" log_id=0QP_DFTl000 service=store trace_id=0QP_Dbql000 op_name=tsdb_open db_shard_id=1 error="[shard 1] open /var/lib/influxdb/wal/_internal/monitor/1/_00003.wal: no such file or directory" 2020-11-11T07:47:24.919481892Z ts=2020-11-11T07:47:24.918714Z lvl=info msg="Failed to open shard" log_id=0QP_DFTl000 service=store trace_id=0QP_Dbql000 op_name=tsdb_open db_shard_id=2 error="[shard 2] error opening file /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: open /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: no such file or directory" 2020-11-11T07:47:24.984226610Z ts=2020-11-11T07:47:24.984036Z lvl=info msg="Reading file" log_id=0QP_DFTl000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/_internal/monitor/3/_00002.wal size=3546096 2020-11-11T07:47:25.369511572Z ts=2020-11-11T07:47:25.369259Z lvl=info msg="Opened shard" log_id=0QP_DFTl000 service=store trace_id=0QP_Dbql000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/_internal/monitor/3 duration=1730.008ms 2020-11-11T07:47:25.370931692Z ts=2020-11-11T07:47:25.369339Z lvl=info msg="Open store (end)" log_id=0QP_DFTl000 service=store trace_id=0QP_Dbql000 op_name=tsdb_open op_event=end op_elapsed=2401.470ms 2020-11-11T07:47:25.370961793Z ts=2020-11-11T07:47:25.369371Z lvl=info msg="Opened service" log_id=0QP_DFTl000 service=subscriber 2020-11-11T07:47:25.370970193Z ts=2020-11-11T07:47:25.369402Z lvl=info msg="Starting monitor service" log_id=0QP_DFTl000 service=monitor 2020-11-11T07:47:25.370976893Z ts=2020-11-11T07:47:25.369411Z lvl=info msg="Registered diagnostics client" log_id=0QP_DFTl000 service=monitor name=build 2020-11-11T07:47:25.370998893Z ts=2020-11-11T07:47:25.369418Z lvl=info msg="Registered diagnostics client" log_id=0QP_DFTl000 service=monitor name=runtime 2020-11-11T07:47:25.371005294Z ts=2020-11-11T07:47:25.369423Z lvl=info msg="Registered diagnostics client" log_id=0QP_DFTl000 service=monitor name=network 2020-11-11T07:47:25.371011794Z ts=2020-11-11T07:47:25.369491Z lvl=info msg="Registered diagnostics client" log_id=0QP_DFTl000 service=monitor name=system 2020-11-11T07:47:25.371017894Z ts=2020-11-11T07:47:25.369520Z lvl=info msg="Starting precreation service" log_id=0QP_DFTl000 service=shard-precreation check_interval=10m advance_period=30m 2020-11-11T07:47:25.371024194Z ts=2020-11-11T07:47:25.369538Z lvl=info msg="Starting snapshot service" log_id=0QP_DFTl000 service=snapshot 2020-11-11T07:47:25.371030294Z ts=2020-11-11T07:47:25.369556Z lvl=info msg="Starting continuous query service" log_id=0QP_DFTl000 service=continuous_querier 2020-11-11T07:47:25.371036594Z ts=2020-11-11T07:47:25.369565Z lvl=info msg="Starting HTTP service" log_id=0QP_DFTl000 service=httpd authentication=true 2020-11-11T07:47:25.371042994Z ts=2020-11-11T07:47:25.369572Z lvl=info msg="opened HTTP access log" log_id=0QP_DFTl000 service=httpd path=stderr 2020-11-11T07:47:25.371050294Z ts=2020-11-11T07:47:25.370026Z lvl=info msg="Storing statistics" log_id=0QP_DFTl000 service=monitor db_instance=_internal db_rp=monitor interval=10s 2020-11-11T07:47:25.372493915Z ts=2020-11-11T07:47:25.370603Z lvl=info msg="Listening on HTTP" log_id=0QP_DFTl000 service=httpd addr=[::]:8086 https=false 2020-11-11T07:47:25.372521415Z ts=2020-11-11T07:47:25.370629Z lvl=info msg="Starting retention policy enforcement service" log_id=0QP_DFTl000 service=retention check_interval=30m 2020-11-11T07:47:25.372529015Z ts=2020-11-11T07:47:25.370702Z lvl=info msg="Listening for signals" log_id=0QP_DFTl000 2020-11-11T07:47:28.332056158Z ts=2020-11-11T07:47:28.331836Z lvl=info msg="Write failed" log_id=0QP_DFTl000 service=write shard=2 error="[shard 2] error opening file /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: open /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: no such file or directory" 2020-11-11T07:47:28.332589566Z [httpd] 195.141.126.44:36386, 10.1.5.5:32056,169.254.131.1 - writer [11/Nov/2020:07:47:28 +0000] "POST /write?db=olmero HTTP/1.1" 500 202 "-" "Telegraf/1.16.1 Go/1.15.2" 255d65f3-23f2-11eb-8001-466081c6d930 245399 2020-11-11T07:47:28.332608766Z ts=2020-11-11T07:47:28.332210Z lvl=error msg="[500] - \"[shard 2] error opening file /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: open /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: no such file or directory\"" log_id=0QP_DFTl000 service=httpd 2020-11-11T07:47:30.732024855Z ts=2020-11-11T07:47:30.731764Z lvl=info msg="Write failed" log_id=0QP_DFTl000 service=write shard=2 error="[shard 2] error opening file /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: open /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: no such file or directory" 2020-11-11T07:47:30.733035469Z [httpd] 195.141.126.61:34888, 10.1.5.7:17400,169.254.131.1 - writer [11/Nov/2020:07:47:30 +0000] "POST /write?db=olmero HTTP/1.1" 500 202 "-" "Telegraf/1.16.1 Go/1.15.2" 26d5672c-23f2-11eb-8002-466081c6d930 181010 2020-11-11T07:47:30.733059270Z ts=2020-11-11T07:47:30.732007Z lvl=error msg="[500] - \"[shard 2] error opening file /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: open /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: no such file or directory\"" log_id=0QP_DFTl000 service=httpd 2020-11-11T07:47:32.038808964Z [httpd] 195.141.126.44:50804, 10.1.5.5:31000,169.254.131.1 - - [11/Nov/2020:07:47:32 +0000] "POST /api/v2/write?bucket=olmero&org=OLMeRO HTTP/1.1" 401 55 "-" "Telegraf/1.16.0 Go/1.15.2" 27b85bd3-23f2-11eb-8004-466081c6d930 68 2020-11-11T07:47:32.040173483Z [httpd] 195.141.126.44:50686, 10.1.5.7:17346,169.254.131.1 - - [11/Nov/2020:07:47:32 +0000] "POST /api/v2/write?bucket=olmero&org=OLMeRO HTTP/1.1" 401 55 "-" "Telegraf/1.16.0 Go/1.15.2" 27b889a1-23f2-11eb-8007-466081c6d930 60 2020-11-11T07:47:32.209677184Z ts=2020-11-11T07:47:32.209269Z lvl=info msg="Write failed" log_id=0QP_DFTl000 service=write shard=2 error="[shard 2] error opening file /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: open /var/lib/influxdb/data/olmero/autogen/2/000000001-000000001.tsm: no such file or directory"

but after several seconds it seems to magically recover somehow

2020-11-11T07:50:08.307345121Z ts=2020-11-11T07:50:08.306909Z lvl=info msg="Opened file" log_id=0QP_DFTl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/olmero/autogen/2/000000010-000000002.tsm id=0 duration=154.986ms 2020-11-11T07:50:08.395021860Z ts=2020-11-11T07:50:08.394767Z lvl=info msg="Opened file" log_id=0QP_DFTl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/olmero/autogen/2/000000018-000000002.tsm id=1 duration=236.617ms 2020-11-11T07:50:08.446682832Z ts=2020-11-11T07:50:08.443514Z lvl=info msg="Opened file" log_id=0QP_DFTl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/olmero/autogen/2/000000026-000000002.tsm id=2 duration=136.521ms 2020-11-11T07:50:08.557156467Z ts=2020-11-11T07:50:08.556777Z lvl=info msg="Opened file" log_id=0QP_DFTl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/olmero/autogen/2/000000028-000000001.tsm id=4 duration=161.922ms 2020-11-11T07:50:08.583519610Z ts=2020-11-11T07:50:08.583258Z lvl=info msg="Opened file" log_id=0QP_DFTl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/olmero/autogen/2/000000029-000000001.tsm id=5 duration=139.674ms 2020-11-11T07:50:08.675814109Z ts=2020-11-11T07:50:08.675435Z lvl=info msg="Opened file" log_id=0QP_DFTl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/olmero/autogen/2/000000027-000000001.tsm id=3 duration=118.588ms 2020-11-11T07:50:08.741955768Z ts=2020-11-11T07:50:08.741668Z lvl=info msg="Reading file" log_id=0QP_DFTl000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/olmero/autogen/2/_00097.wal size=4923302 2020-11-11T07:50:09.038688524Z [httpd] 10.1.5.7:17900,169.254.131.1 - - [11/Nov/2020:07:50:09 +0000] "GET /ping HTTP/1.1" 204 0 "-" "-" 854cac04-23f2-11eb-80aa-466081c6d930 75 2020-11-11T07:50:09.836118386Z [httpd] 195.141.126.44:36386, 10.1.5.5:32056,169.254.131.1 - writer [11/Nov/2020:07:50:08 +0000] "POST /write?db=olmero HTTP/1.1" 204 0 "-" "Telegraf/1.16.1 Go/1.15.2" 84b32cee-23f2-11eb-80a9-466081c6d930 1798780 2020-11-11T07:50:10.525020937Z [httpd] 195.141.126.61:34888, 10.1.5.7:17400,169.254.131.1 - writer [11/Nov/2020:07:50:10 +0000] "POST /write?db=olmero HTTP/1.1" 204 0 "-" "Telegraf/1.16.1 Go/1.15.2" 862ca87f-23f2-11eb-80ab-466081c6d930 18498

I can only guess, but for me it looks like influx is loosing files on the storage during the restart of the container. Obviously 2.0 just crashes in this case and 1.8 somehow can recover from that.

E.g. it claims "/var/lib/influxdb/wal/_internal/monitor/1/_00003.wal: no such file or directory". The folder /var/lib/influxdb/wal/_internal/monitor/1/ is just empty.

No idea why it does loose files. Other containers on the same (azure-)server do not have any problems with data-loss or -corruption.

Anyway I guess from the logs it is obvious that also 1.8 does miss some data/files. So it is probably not a problem in 2.0 - apart from that 2.0 is not able to recover from that problem, what makes the impact bigger than in 1.8.

@bacomalex
Copy link
Author

I guess I found the root-cause for the problem. I think hosting InfluxD in Azure AppService is simply not possible.

The only option available for persistent storage on Azure AppService Linux (which we use to host the influxd-instance) is Azure Files. As described e.g. in

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/0e19f5f8-1767-41c9-b3cf-b6aaa770433d/azure-files-storage-posix-compliant?forum=windowsazuredata

Azure Files is not fully POSIX-compliant - especially when it comes to file-locking.

There are workarounds that enable some database-systems like "sqlite" to work on that storage anyway, see e.g.

https://github.com/MicrosoftDocs/azure-docs/issues/47130

That does work well in our case (e.g. when using Grafana). Still that is unsupported by MS, see e.g.

https://social.msdn.microsoft.com/Forums/vstudio/en-US/9e650466-2e97-454e-9096-dfb3243bb1d6/linux-and-sqlite?forum=windowsazurewebsitespreview

and cannot solve every problem, as e.g. seen in

kubernetes/kubernetes#59755

In our case that has never been a problem so far. Therefor I did not even know that.

I can only guess, that influxd has a problem with the missing POSIX-compliance of the storage and that causes the problems. No idea how that leads to missing files, but other DBs also seem to have problems with data-corruption in this setup.

Azure has a new POSIX-compliant storage offering for Azure Files since 09/2020 (called Azure Files NFS). But that is still in preview and does - at least until now - only mount its volumes "readonly" into AppService.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants