Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backup stopped working after upgrade to 1.3.2 #8677

Closed
skladd opened this issue Aug 9, 2017 · 30 comments
Closed

[BUG] Backup stopped working after upgrade to 1.3.2 #8677

skladd opened this issue Aug 9, 2017 · 30 comments
Assignees
Milestone

Comments

@skladd
Copy link
Contributor

skladd commented Aug 9, 2017

After yesterday's upgrade to version 1.3.2 (from 1.3.1 via Debian repo), backup fails with this error message:

influxd backup -database telegraf /tmp/influxbackup

2017/08/09 10:05:21 backing up db=telegraf since 0001-01-01 00:00:00 +0000 UTC
2017/08/09 10:05:21 backing up metastore to /tmp/influxbackup/meta.01
2017/08/09 10:05:21 backing up db=telegraf rp=autogen shard=169 to /tmp/influxbackup/telegraf.autogen.00169.00 since 0001-01-01 00:00:00 +0000 UTC
2017/08/09 10:05:21 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (0)...
2017/08/09 10:05:22 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (1)...
2017/08/09 10:05:23 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (2)...
2017/08/09 10:05:24 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (3)...
2017/08/09 10:05:25 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (4)...
2017/08/09 10:05:26 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (5)...
2017/08/09 10:05:27 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (6)...
2017/08/09 10:05:28 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (7)...
2017/08/09 10:05:29 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (8)...
2017/08/09 10:05:30 Download shard 169 failed copy backup to file: err=<nil>, n=0.  Retrying (9)...
2017/08/09 10:05:31 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0

Version: 1.3.2
OS: Debian jessie on x86_64

@skladd
Copy link
Contributor Author

skladd commented Aug 9, 2017

On another server, where influxdb was installed much later (thus a more recent version), backup still works after the upgrade.

@skladd
Copy link
Contributor Author

skladd commented Aug 10, 2017

I was able to successfully perform the backup after downgrading to 1.3.1-1 (from deb file luckily still in apt cache) and manually restarting influxdb daemon.

The failing influxdb instance was originally installed back in 2016, version 1.0.0:

Start-Date: 2016-09-29  16:17:27
Commandline: apt-get install influxdb
Install: influxdb:amd64 (1.0.0-1)

@duvidcz
Copy link

duvidcz commented Aug 10, 2017

I have same problem with windows version 1.3.2 -1
This follwing error I can seee:

> `[I] 2017-08-10T10:29:06Z Snapshot for path \var\lib\influxdb\data\DB1\autogen
> \3 written in 3.0002ms engine=tsm1
> [I] 2017-08-10T10:29:06Z snapshots disabled service=snapshot
> [I] 2017-08-10T10:29:08Z error writing snapshot from compactor: snapshots disabl
> ed engine=tsm1`

@codylewandowski
Copy link

codylewandowski commented Aug 10, 2017

We have ~10 influx instances we just upgraded to 1.3.2-1 today and we are now seeing this same issue.

@marcofl
Copy link

marcofl commented Aug 11, 2017

same issue here:

2017/08/11 15:42:54 backing up db=icinga2 since 2017-08-10 13:42:54 +0000 UTC
2017/08/11 15:42:54 backing up metastore to /srv/storage/influxdb-prebackup/tmp.o0B6CT8vC4/meta.00
2017/08/11 15:42:54 backing up db=icinga2 rp=6weeks shard=471 to /srv/storage/influxdb-prebackup/tmp.o0B6CT8vC4/icinga2.6weeks.00471.00 since 2017-08-10 13:42:54 +0000 UTC
2017/08/11 15:42:54 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (0)...
2017/08/11 15:42:55 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (1)...
2017/08/11 15:42:56 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (2)...
2017/08/11 15:42:57 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (3)...
2017/08/11 15:42:58 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (4)...
2017/08/11 15:42:59 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (5)...
2017/08/11 15:43:00 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (6)...
2017/08/11 15:43:01 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (7)...
2017/08/11 15:43:02 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (8)...
2017/08/11 15:43:03 Download shard 471 failed copy backup to file: err=<nil>, n=0.  Retrying (9)...
2017/08/11 15:43:04 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0

downgrade to 1.3.1 solved the issue.

looks like the fix from https://github.com/influxdata/influxdb/pull/8378/commits somehow got reverted

@dgnorton dgnorton self-assigned this Aug 15, 2017
dgnorton added a commit that referenced this issue Aug 16, 2017
dgnorton added a commit that referenced this issue Aug 16, 2017
[backport] fix #8677: check for snapshot size == 0
@skladd
Copy link
Contributor Author

skladd commented Aug 18, 2017

Works for me again since latest nightly: 1.4.0~n201708170800-0
Thank you!

@skladd
Copy link
Contributor Author

skladd commented Aug 21, 2017

Apparently the patch did not make it into 1.3.3.

@davidbru
Copy link

Can confirm, that in 1.3.3 it's also not working for me.

@dgnorton
Copy link
Contributor

It will be in 1.3.4.

@ayush-sharma
Copy link

This is broken in 1.3.3.

Metastore backup is fine but database backups produce errors. Can't get it to work on Mac or Ubuntu. I can also confirm that the backup directories have correct permissions, so it doesn't seem to be that.

2017/08/23 20:24:14 backing up db=test since 0001-01-01 00:00:00 +0000 UTC
2017/08/23 20:24:14 backing up metastore to /tmp/JIuZnHpa/test/meta.00
2017/08/23 20:24:14 backing up db=test rp=autogen shard=2 to /tmp/JIuZnHpa/test/test.autogen.00002.00 since 0001-01-01 00:00:00 +0000 UTC
2017/08/23 20:24:14 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (0)...
2017/08/23 20:24:15 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (1)...
2017/08/23 20:24:16 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (2)...
2017/08/23 20:24:17 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (3)...
2017/08/23 20:24:18 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (4)...
2017/08/23 20:24:19 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (5)...
2017/08/23 20:24:20 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (6)...
2017/08/23 20:24:21 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (7)...
2017/08/23 20:24:22 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (8)...
2017/08/23 20:24:23 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Retrying (9)...
2017/08/23 20:24:24 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0

@skladd
Copy link
Contributor Author

skladd commented Aug 24, 2017

1.3.4 is out, backup works again. Thanks!

@jgysel
Copy link

jgysel commented Sep 8, 2017

I still observe it on 1.3.5 although less often. Shard 1784 is the one that currently gets many updates:

# /opt/influxdb/bin/influxd backup -host influxdb-xxx-1:7750 -since 1970-01-01T00:00:00Z -database input /shared/backup/influxdb/93931/2017-09-08_0205/input
2017/09/08 02:05:02 backing up metastore to /shared/backup/influxdb/93931/2017-09-08_0205/input/meta.00
2017/09/08 02:05:02 backing up db=input rp=default shard=1764 to /shared/backup/influxdb/93931/2017-09-08_0205/input/input.default.01764.00 since 1970-01-01 00:00:00 +0000 UTC
2017/09/08 02:05:22 backing up db=input rp=default shard=1769 to /shared/backup/influxdb/93931/2017-09-08_0205/input/input.default.01769.00 since 1970-01-01 00:00:00 +0000 UTC
2017/09/08 02:05:46 backing up db=input rp=default shard=1774 to /shared/backup/influxdb/93931/2017-09-08_0205/input/input.default.01774.00 since 1970-01-01 00:00:00 +0000 UTC
2017/09/08 02:06:34 backing up db=input rp=default shard=1779 to /shared/backup/influxdb/93931/2017-09-08_0205/input/input.default.01779.00 since 1970-01-01 00:00:00 +0000 UTC
2017/09/08 02:07:29 backing up db=input rp=default shard=1784 to /shared/backup/influxdb/93931/2017-09-08_0205/input/input.default.01784.00 since 1970-01-01 00:00:00 +0000 UTC
2017/09/08 02:07:29 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (0)...
2017/09/08 02:07:30 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (1)...
2017/09/08 02:07:31 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (2)...
2017/09/08 02:07:32 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (3)...
2017/09/08 02:07:33 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (4)...
2017/09/08 02:07:34 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (5)...
2017/09/08 02:07:35 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (6)...
2017/09/08 02:07:36 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (7)...
2017/09/08 02:07:37 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (8)...
2017/09/08 02:07:38 Download shard 1784 failed copy backup to file: err=<nil>, n=0.  Retrying (9)...
2017/09/08 02:07:39 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0

@jigarshahindia
Copy link

jigarshahindia commented Jan 9, 2018

influx --version
InfluxDB shell version: 1.4.2

Same issue

influxd  backup -database test /tmp/backup

2018/01/09 08:27:20 backing up metastore to /tmp/backup/meta.02
2018/01/09 08:27:20 backing up db=test rp=autogen shard=13 to /tmp/backup/test.autogen.00013.01 since 0001-01-01 00:00:00 +0000 UTC
2018/01/09 08:27:20 backing up db=test rp=autogen shard=21 to /tmp/backup/test.autogen.00021.01 since 0001-01-01 00:00:00 +0000 UTC
2018/01/09 08:27:20 backing up db=test rp=autogen shard=29 to /tmp/backup/test.autogen.00029.01 since 0001-01-01 00:00:00 +0000 UTC
2018/01/09 08:27:20 backing up db=test rp=autogen shard=37 to /tmp/backup/test.autogen.00037.01 since 0001-01-01 00:00:00 +0000 UTC
2018/01/09 08:27:20 backing up db=test rp=autogen shard=45 to /tmp/backup/test.autogen.00045.01 since 0001-01-01 00:00:00 +0000 UTC
2018/01/09 08:27:20 backing up db=test rp=autogen shard=53 to /tmp/backup/test.autogen.00053.01 since 0001-01-01 00:00:00 +0000 UTC
2018/01/09 08:27:21 backing up db=test rp=autogen shard=61 to /tmp/backup/test.autogen.00061.00 since 0001-01-01 00:00:00 +0000 UTC
2018/01/09 08:27:21 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (0)...
2018/01/09 08:27:22 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (1)...
2018/01/09 08:27:23 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (2)...
2018/01/09 08:27:24 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (3)...
2018/01/09 08:27:25 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (4)...
2018/01/09 08:27:26 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (5)...
2018/01/09 08:27:27 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (6)...
2018/01/09 08:27:28 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (7)...
2018/01/09 08:27:29 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (8)...
2018/01/09 08:27:30 Download shard 61 failed copy backup to file: err=<nil>, n=0.  Retrying (9)...
2018/01/09 08:27:31 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0

@benceszikora
Copy link

I am getting the same on 1.5.0. Are there any updates on this?

2018/03/23 05:01:23 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (0)...                                                                        
2018/03/23 05:01:24 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (1)...                                                                        
2018/03/23 05:01:25 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (2)...                                                                        
2018/03/23 05:01:26 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (3)...                                                                        
2018/03/23 05:01:27 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (4)...                                                                        
2018/03/23 05:01:28 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (5)...                                                                        
2018/03/23 05:01:29 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (6)...                                                                        
2018/03/23 05:01:30 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (7)...                                                                        
2018/03/23 05:01:31 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (8)...                                                                        
2018/03/23 05:01:32 Download shard 27 failed copy backup to file: err=<nil>, n=0.  Retrying (9)...                                                                        
2018/03/23 05:01:33 backup failed: copy backup to file: err=<nil>, n=0               
backup: copy backup to file: err=<nil>, n=0                                          

@rbetts
Copy link
Contributor

rbetts commented Mar 26, 2018

@dgnorton Can you work on reproducing this David?

@dgnorton
Copy link
Contributor

@iliketosneeze this looks like #9618. In your case, is the ifnluxd binary 1.5.0 and the running instance (the source) an older version of influxd?

@benceszikora
Copy link

@dgnorton We are using 1.5.0 for both the running instance and the backup.

@dgnorton
Copy link
Contributor

@iliketosneeze is TSI enabled?

@amerenda
Copy link

@dgnorton I'm having this issue on influxd-1.5.0-1 with TSI enabled.

@benceszikora
Copy link

@dgnorton Yes, we have TSI enabled as well.

@dgnorton
Copy link
Contributor

I have been able to reproduce this. In the repro, it happens fairly often but not every time.

Setup using AWS

  • One m4.10xlarge instance with 100GB gp2 storage
    • ssh into InfluxDB instance and update config to use tsi1 index
      • sudo systemctl stop influxdb.service
      • sudo sed -i.bak 's/# bind-address = "127.0.0.1:8088"/bind-address = ":8088"/' /etc/influxdb/influxdb.conf
      • sudo sed -i.bak 's/# index-version = "inmem"/index-version = "tsi1"/' /etc/influxdb/influxdb.conf
      • sudo systemctl start influxdb.service
  • One m4.10xlarge instance running the inch tool to generate high cardinality load.
  • Once inch is generating load, ssh into the InfluxDB instance and start creating backups.
    • rm bak/* && influxd backup -database stress bak/

@dgnorton dgnorton added this to the 1.5.2 milestone Mar 28, 2018
@aanthony1243
Copy link
Contributor

@dgnorton the above repro seems to occur if another snapshot is being taken at the same time, as in the case of periodic compactions. I've seen it fail out after 10 attempts. I've also seen it fail 5, 6 times and then succeed. We could consider this to be a resource/locking error, but regardless this scenario doesn't seem to be a blocker.

@iliketosneeze does the problem happen consistently, like 20/50/100% of the time? Or more sporadically? Could you share the influxd logs from the time of the backup?

@benceszikora
Copy link

@aanthony1243 It's not 100%, but it is very often. I also seems to depend on the size of the db/shards. Smaller ones don't seem to do it that often.
I have uploaded the logs from a time the backups were failing: https://gist.github.com/iliketosneeze/bc37b54f03219bba69266779185c9f61

@aanthony1243
Copy link
Contributor

@iliketosneeze I don't see any of the errors in your gist that I saw when reproducing above, but the symptoms are still there. It happens more frequently on larger DB/shards because there's a resource competition when taking shard snapshots. If another process holds the resource for too long the backup will fail. We're looking into the root cause now.

@aanthony1243
Copy link
Contributor

@iliketosneeze we've adjusted the backup to use an exponential backoff to give the server more time to free resources. It's out in the just-released 1.5.2. If you continue to see frequent occurrence of this after upgrading, please open a new issue and we will follow up.

@sjlongland
Copy link

Hate to comment on a closed bug, but it appears this is still a problem with later versions:

root@863b34bc159b:/# influxd version
InfluxDB v1.5.3 (git: 1.5 89e084a80fb1e0bf5e7d38038e3367f821fdf3d7)
root@863b34bc159b:/# influxd backup -portable -database historian /tmp/fullbackup/data
2018/09/27 05:17:02 backing up metastore to /tmp/fullbackup/data/meta.00
2018/09/27 05:17:02 backing up db=historian
2018/09/27 05:17:02 backing up db=historian rp=autogen shard=2 to /tmp/fullbackup/data/historian.autogen.00002.00 since 0001-01-01T00:00:00Z
2018/09/27 05:17:02 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (0)...
2018/09/27 05:17:04 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (1)...
2018/09/27 05:17:06 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (2)...
2018/09/27 05:17:08 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (3)...
2018/09/27 05:17:10 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (4)...
2018/09/27 05:17:12 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 2s and retrying (5)...
2018/09/27 05:17:14 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 3.01s and retrying (6)...
2018/09/27 05:17:17 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 11.441s and retrying (7)...
2018/09/27 05:17:28 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 43.477s and retrying (8)...
2018/09/27 05:18:12 Download shard 2 failed copy backup to file: err=<nil>, n=0.  Waiting 2m45.216s and retrying (9)...
2018/09/27 05:20:57 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0

We need the back-up as we have to migrate things around in the Docker container InfluxDB is installed in. Long story short, due to a typo in docker-compose.yml, our volume is mounted in one directory while InfluxDB is configured to use another directory entirely. So running docker rm ${CONTAINER} will blow away data -- we need to back up this data so we can fix the docker-compose.yml and get InfluxDB storing its data in that volume as intended.

So far, I have one instance running 1.3.2, and one running 1.5.3, and both exhibit the same problem regarding backups: the back up fails because it can't copy files (for no reason; if I'm to understand the err=<nil> n=0 properly).

What's the safest way to back up these instances if influxdb backup fails?

@aanthony1243
Copy link
Contributor

it sounds like you are prepared to tolerate some down time. It should be safe to stop influxd, move the entire influxdb directory to a new location, and then update your influxdb config to use that new location.

@sjlongland
Copy link

This is true, that is safe enough, but how about back-up? We'll ultimately want to bump these up to the latest release before long (there are a lot of features in the newer InfluxDB), but doing so without a good back-up is a risky proposition.

In the meantime we need to be able to back up that instance, and do so without disruption. We do these backups on a daily basis, re-starting InfluxDB every 24 hours will likely not be welcome. How do we safely back up the data without shutting down InfluxDB?

@aanthony1243
Copy link
Contributor

I misunderstood your issue. you can check the influxdb logs on the server side to see if an error is being logged when nothing is returned on the connection. If you find something, perhaps we should open it as a new issue and continue from there?

@samvruggink
Copy link

I have the same issue. I found out that whenever I mount my volume under /var/lib/influxdb/ it fails to backup and when I restart my instance container it cannot read the data anymore.

When I mount my volume under /var/lib/backup where backup is my own folder the backup works. I have no clue how to fix this issue. I'm using Microsoft Azure with Instance containers and File Shares.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests