Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

Restore from S3 compatible API? #69

Closed
morgo opened this issue Sep 14, 2018 · 15 comments · Fixed by #361
Closed

Restore from S3 compatible API? #69

morgo opened this issue Sep 14, 2018 · 15 comments · Fixed by #361
Assignees
Labels
difficulty/2-medium Medium-difficulty issue feature/accepted priority/P1 High priority issue, must be solved before next release
Milestone

Comments

@morgo
Copy link

morgo commented Sep 14, 2018

A feature request for your roadmap:

Can it be possible to restore directly from a mydumper backup stored in S3? In most cloud deployments this is where user backups will be stored (the S3 API is implemented by many other object stores).


Value

Value description

Support restore to TiDB via S3.

Value score

  • (TBD) / 5

Workload estimation

  • (TBD)

Time

GanttStart: 2020-07-27
GanttDue: 2020-09-04
GanttProgress: 100%

@kennytm
Copy link
Collaborator

kennytm commented Sep 14, 2018

(Now also tracked in Jira as TOOL-362)

We plan to support importing from Zip, FTP and maybe HDFS in version v1.1 by the end of 2018. I adding S3 API support is not hard if it can be abstracted as a VFS or something similar.

@gregwebs
Copy link

There are several S3 Fuse projects. I don't think it should be terribly difficult to make a VFS adapter (probably depends on error handling complexity). There are already apache VFS adapters.

@gregwebs
Copy link

We are starting to use go-cloud for cloud support. It also supports the filesystem as a backend.

@kennytm
Copy link
Collaborator

kennytm commented Oct 18, 2018

Nice! We could use the github.com/google/go-cloud/blob/* for actual cloud storage, but we can't use the fileblob due to the restriction:

Blob names must only contain alphanumeric characters, slashes, periods, spaces, underscores, and dashes.

I've seen at least one customer giving the table a Chinese name, and mydumper will not escape them in the filename, making this implementation not usable.

@gregwebs
Copy link

Yeah, looks like fileblob is just meant for testing purposes anyways, so two pathways (cloud or file) would still be needed.

@tennix
Copy link
Member

tennix commented Jul 3, 2020

Is there any update on this? To be more cloud-native, we need to support restoring from S3 storage. Though we have a workaround in tidb-operator to use rclone to download all backup files locally and then feed to tidb-lightning, it's time-consuming and not user friendly (users cannot determine how large a PV should be required).

@kennytm kennytm added priority/P1 High priority issue, must be solved before next release and removed priority/P2 Medium priority issue labels Jul 3, 2020
@overvenus
Copy link
Member

For the AWS Aurora scenario, Aurora exports data in CSV format, and it is partitioned into multiple files. It's worth taking into consideration. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.SaveIntoS3.html#AuroraMySQL.Integrating.SaveIntoS3.Grant

@kennytm
Copy link
Collaborator

kennytm commented Jul 27, 2020

🤔 Partitioning into multiple files isn't a problem (it is actually desired). The problem is the file name does not end in *.csv.

s3-region://bucket-name/file-prefix.part_00000

@IANTHEREAL IANTHEREAL added feature/accepted and removed feature-request This issue is a feature request labels Jul 27, 2020
@overvenus
Copy link
Member

Can lightning choose file format based on its content?

Also, Aurora can export data into TEXT format, and the name is the same as the CSV format.

@kennytm
Copy link
Collaborator

kennytm commented Jul 27, 2020

it can but i don't trust Lightning to do so 🙂.

perhaps we need RFC 5 anyway.

@glorv
Copy link
Contributor

glorv commented Jul 27, 2020

Also, Aurora can export data into TEXT format, and the name is the same as the CSV format.

Seems TEXT format likes CSV format, but use TAB as the delimiter and write the raw value for each field. So It's not a very good format, e.g. if some string fields contains TAB, then it's hard to distigulish which TABs are the delimiter and which are field values. So we should recommend customers to use csv instead of text.

@glorv
Copy link
Contributor

glorv commented Jul 27, 2020

it can but i don't trust Lightning to do so 🙂.

perhaps we need RFC 5 anyway.

Shall we provide an option to allow use explicitly set the input files format to csv or sql if they are not end with them.
Or we may support the sepcial pattern for s3's file partition pattern like part_0000. In the future, if we support read from compressed files, we should also support partitioned compression files like schema.table.csv.001.tar.gz

@glorv
Copy link
Contributor

glorv commented Jul 27, 2020

it can but i don't trust Lightning to do so 🙂.

perhaps we need RFC 5 anyway.

Seems if we want to support partitioned files in s3 buckets or partitioned compression files, RFC 5 needs to be updated. And I afraid if the route rule is complex, it will be hard to teach user to use this feature

@overvenus
Copy link
Member

For Aurora partition dump, Lightning could read Aurora dump manifest directly.

@kennytm
Copy link
Collaborator

kennytm commented Jul 27, 2020

That is a large departure from the existing model (walkdir the directory to discover files), and the existing model does work (if you don't scatter the data source around multiple irrelevant places), so I regard the manifest file support as low priority.

@scsldb scsldb added this to the 4.0.10 milestone Oct 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
difficulty/2-medium Medium-difficulty issue feature/accepted priority/P1 High priority issue, must be solved before next release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants