Backup issue when continuously inserting rows. #256

goelsatyam · 2019-03-15T09:53:28Z

While doing insert queries to mysql cluster, i tried to take backup in parallel. For 2gb data backup size was 43MB and for 200GB data backup size was 33MB. And when i tried to restore the cluster from the backup file it got stuck with STATUS: Init:1/2.

AMecea · 2019-03-15T10:13:30Z

Hi @goelsatyam, thanks for reporting this!

Can you please give me the logs of init container by running:

kubectl logs <restore-cluster-name>-mysql-0 -c clone-mysql

Also, I think will be useful the logs of xtrabackup tool from the pod from where the backup was taken:

kubectl logs <original-cluster-name>-mysql-<slave index> -c sidecar

Note: make sure to strip sensitive data from logs if the case.

Thanks in advance!

goelsatyam · 2019-03-15T11:38:05Z

Logs of init container:-

Create rclone.conf file.
Create google-credentials.json file.
2019-03-15T09:34:48.604Z	INFO	sidecar	environment is not set	{"key": "MYSQL_REPLICATION_USER"}
2019-03-15T09:34:48.604Z	INFO	sidecar	environment is not set	{"key": "MYSQL_REPLICATION_PASSWORD"}
2019-03-15T09:34:48.604Z	INFO	sidecar	environment is not set	{"key": "MYSQL_METRICS_EXPORTER_USER"}
2019-03-15T09:34:48.604Z	INFO	sidecar	environment is not set	{"key": "MYSQL_METRICS_EXPORTER_PASSWORD"}
2019-03-15T09:34:48.604Z	INFO	sidecar	environment is not set	{"key": "MYSQL_ORC_TOPOLOGY_USER"}
2019-03-15T09:34:48.604Z	INFO	sidecar	environment is not set	{"key": "MYSQL_ORC_TOPOLOGY_PASSWORD"}
2019-03-15T09:34:48.604Z	INFO	sidecar	clonning command	{"host": "cluster-test-b-2-mysql-0"}
2019-03-15T09:34:48.614Z	WARN	sidecar	failed to obtain master from orchestrator, go for default master	{"master": "cluster-test-b-2-mysql-0.cluster-test-b-2-mysql-nodes.default"}
2019-03-15T09:34:48.614Z	INFO	sidecar	cloning from bucket	{"bucket": "s3:dataplatform-sql-backup/backup/cluster-test-2-2019-03-15T09:29:32.xbackup.gz"}
encryption: using gcrypt 1.7.6-beta
2019/03/15 09:34:48 DEBUG : rclone: Version "v1.46" starting with parameters ["rclone" "-vv" "--config=/etc/rclone.conf" "cat" "s3:dataplatform-sql-backup/backup/cluster-test-2-2019-03-15T09:29:32.xbackup.gz"]
2019/03/15 09:34:48 DEBUG : Using config file from "/etc/rclone.conf"
2019/03/15 09:34:49 DEBUG : pacer: Reducing sleep to 0s
xbstream: Can't create/write to file './xtrabackup_logfile' (Errcode: 17 - File exists)
xbstream: failed to create file.

AMecea · 2019-03-15T12:13:55Z

I think that those logs are not from the first fail, those are the logs after a failed happened before and leave behind some files and it does not reflect the real issue. Maybe the operator should run a cleanup before doing a backup restore (but this is a different issue).

goelsatyam · 2019-03-15T12:32:48Z

I think is problem is while taking the backup. Because when i am took the backup without doing any queries in parallel, 2gb data backup size was .35GB and for 200GB data backup size was 36GB. In both cases i was able to successfully restore the cluster from the backup file of both 2GB and 200GB.

AMecea · 2019-03-15T13:19:46Z

That may be a problem of xtrabackup. I will try to reproduce it.

Do you still have the logs of the container where the backup was taken, to examine them?

goelsatyam · 2019-03-18T06:06:50Z

logs of the container where the backup was taken:-

Create rclone.conf file.
Create google-credentials.json file.
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MY_CLUSTER_NAME"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MY_NAMESPACE"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MY_SERVICE_NAME"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "INIT_BUCKET_URI"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "ORCHESTRATOR_URI"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MYSQL_REPLICATION_USER"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MYSQL_REPLICATION_PASSWORD"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MYSQL_METRICS_EXPORTER_USER"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MYSQL_METRICS_EXPORTER_PASSWORD"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MYSQL_ORC_TOPOLOGY_USER"}
2019-03-15T11:49:01.566Z INFO sidecar environment is not set {"key": "MYSQL_ORC_TOPOLOGY_PASSWORD"}
2019-03-15T11:49:01.566Z INFO sidecar take a backup {"host": "cluster-test-2-mysql-1.cluster-test-2-mysql-nodes.default", "bucket": "s3://bucket-url/cluster-test-2-2019-03-15T11:48:50.xbackup.gz"}
2019-03-15T11:49:01.566Z INFO sidecar initialize a backup {"host": "cluster-test-2-mysql-1.cluster-test-2-mysql-nodes.default", "endpoint": "/xbackup"}

goelsatyam · 2019-03-18T11:36:15Z

@AMecea The table in which i was inserting the rows didn't had any primary key.
can this cause the problem?

goelsatyam · 2019-03-18T13:07:04Z

@AMecea Getting the same error when I tried it with table having primary key.

AMecea · 2019-03-18T13:57:26Z

Do you have a script to reproduce this? I need to reproduce it to inspect closely the logs and to tweak some parameters to find a solution. Also, I want to write an e2e test on this case.

goelsatyam · 2019-03-19T05:51:49Z

You can use this python script:-

import time
import sys
import mysql.connector

def insert():
    sql = "INSERT INTO user (firstname, lastname, address) VALUES (%s, %s, %s)"
    val = []
    for i in range(100000):
        val.append(("l_"+str(i), "f_"+str(i), "a_"+str(i)))

    mydb = mysql.connector.connect(
        host="master_cluster_IP",
        user="root",
        passwd="root_password",
        database="test_data"
        )
    mycursor = mydb.cursor()
    mycursor.executemany(sql, val)
    mydb.commit()

total_time = time.time()
for i in range(int(sys.argv[1])):
    start_time = time.time()
    insert()
    print ("time taken for " + str(i*100000) + "-" + str(i+1)*100000 + " :" + str(time.time() - start_time))

run the script by:- python app.py 100

I ran this script from two different shells at same time.

AMecea · 2019-03-19T14:57:09Z

Thanks for the script, I ran it and I was able to reproduce the error.

The logs that I was looking for are those:

xtrabackup: error: log block numbers mismatch:
xtrabackup: error: expected log block no. 567753, but got no. 1092033 from the log file.
xtrabackup: error: it looks like InnoDB log has wrapped around before xtrabackup could process all records due to either log copying being too slow, or  log files being too small.
xtrabackup: Error: xtrabackup_copy_logfile() failed.
2019-03-19T14:27:18.199Z	ERROR	sidecar	failed waiting for xtrabackup to finish	{"error": "exit status 1"}

Because the writes are too many and the innodb log files are too small the data from the log file is flushed before Xtraackup finishes copying files. What you can do is to increase the innodb_log_file_size (by default is computed based on requested memory (default 1G)).

This is a valid bug the backup should fail but instead succeeds.

goelsatyam · 2019-03-20T11:20:30Z

@AMecea Where i can see the xtrabackup logs?

AMecea · 2019-03-20T12:02:48Z

The Xtrabackup logs are in the sidecar container of the node from where the backup was taken, and you can see them by running:

kubectl logs <original-cluster-name>-mysql-0 -c sidecar

goelsatyam · 2019-03-21T09:56:03Z

If i got unexpected high number of insert requests and innodb_log_file_size is small for the amount of load server is getting, then my backup will fail. Can we lock the slave pod so that atleast we can take the backup of the data already present at that time?

AMecea · 2019-03-21T11:02:12Z

Yes, we can do that, we can issue a STOP SLAVE SQL_THREAD query before doing a backup. This logic can be under a feature flag on the MysqlClusterresource (like a backupMode field). But this is a different issue, so please open a new one. I will be happy to review and merge a PR that does that 😄

PR #260 will fix the issue of having corrupt backups with success status.

AMecea added this to the 0.2.x milestone Mar 18, 2019

calind modified the milestones: 0.2.x, 0.2.7 Mar 19, 2019

AMecea added the bug label Mar 19, 2019

AMecea mentioned this issue Mar 20, 2019

Fix backup under stress #260

Merged

AMecea added the in progress label Mar 20, 2019

calind closed this as completed in #260 Mar 21, 2019

AMecea removed the in progress label Mar 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup issue when continuously inserting rows. #256

Backup issue when continuously inserting rows. #256

goelsatyam commented Mar 15, 2019 •

edited

Loading

AMecea commented Mar 15, 2019

goelsatyam commented Mar 15, 2019 •

edited by AMecea

Loading

AMecea commented Mar 15, 2019

goelsatyam commented Mar 15, 2019

AMecea commented Mar 15, 2019

goelsatyam commented Mar 18, 2019

goelsatyam commented Mar 18, 2019

goelsatyam commented Mar 18, 2019

AMecea commented Mar 18, 2019

goelsatyam commented Mar 19, 2019 •

edited

Loading

AMecea commented Mar 19, 2019

goelsatyam commented Mar 20, 2019

AMecea commented Mar 20, 2019

goelsatyam commented Mar 21, 2019

AMecea commented Mar 21, 2019

Backup issue when continuously inserting rows. #256

Backup issue when continuously inserting rows. #256

Comments

goelsatyam commented Mar 15, 2019 • edited Loading

AMecea commented Mar 15, 2019

goelsatyam commented Mar 15, 2019 • edited by AMecea Loading

AMecea commented Mar 15, 2019

goelsatyam commented Mar 15, 2019

AMecea commented Mar 15, 2019

goelsatyam commented Mar 18, 2019

goelsatyam commented Mar 18, 2019

goelsatyam commented Mar 18, 2019

AMecea commented Mar 18, 2019

goelsatyam commented Mar 19, 2019 • edited Loading

AMecea commented Mar 19, 2019

goelsatyam commented Mar 20, 2019

AMecea commented Mar 20, 2019

goelsatyam commented Mar 21, 2019

AMecea commented Mar 21, 2019

goelsatyam commented Mar 15, 2019 •

edited

Loading

goelsatyam commented Mar 15, 2019 •

edited by AMecea

Loading

goelsatyam commented Mar 19, 2019 •

edited

Loading