Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Iceberg metadata in case of DR #5779

Closed
asheeshgarg opened this issue Sep 16, 2022 · 9 comments
Closed

Update Iceberg metadata in case of DR #5779

asheeshgarg opened this issue Sep 16, 2022 · 9 comments
Labels

Comments

@asheeshgarg
Copy link

Query engine

Spark

Question

Lets say we have a DR Situation where we like to up the iceberg metadata and data copied to DR location. Since s3 buckets are global namespaces we will have different bucket names in DR locations.
How to rename the metadata so that it start pointing to the correct location of the DR s3 location? Do we have any util or spark procedure for it?

@asheeshgarg
Copy link
Author

Just seen this thread #1617

@asheeshgarg
Copy link
Author

Can we use migrate_table procedure for this to specify the s3 path that points to the destination location

@singhpk234
Copy link
Contributor

@asheeshgarg, will s3 access-points for iceberg, work for your use case ?

@asheeshgarg
Copy link
Author

asheeshgarg commented Sep 19, 2022

@singhpk234 s3 access points are still region specific Access point ARNs use the format arn:aws:s3:region:account-id:accesspoint/resource
When we enable
--conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap
--conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap
Does it use my-bucket1while writing the data in metadata? which we can map to specific bucket in case of DR?

@singhpk234
Copy link
Contributor

@asheeshgarg

The metadata files will still be pointing to my-bucket1 (actual s3 path) but while making s3 request via Iceberg (GET + PUT) the my-bucket1 path will be replaced by access-point. Now access point will take care of replication across buckets configured and choose the best available low latency bucket behind the access point.

@asheeshgarg
Copy link
Author

asheeshgarg commented Sep 19, 2022

@singhpk234 so just to understand it correctly we will define two buckets for cross region
--conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws-reigon1
--conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3-reigon2
and iceberg take care of replicating it across region.
Reigon1 meta data will be replaced by mybuket1 actual pointer of s3 in metadata
Region2 meta data will be replaced by mybuket2 actual pointer of s3 in metadata
and we just need to start the metastore in new region and it will work. Is this correct understanding

@singhpk234
Copy link
Contributor

singhpk234 commented Sep 20, 2022

yes, if you map both the bucket (present in different region) to a multi-region access-point.

can ref to this slack thread as well, where this idea originated : https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645066803099319

Reigon1 meta data will be replaced by mybuket1 actual pointer of s3 in metadata
Region2 meta data will be replaced by mybuket2 actual pointer of s3 in metadata

No, Let's say your table path is under mybucket1 so both mybucket1 in region1 and mybucket2 and region2 will have paths of mybucket1, inside the metadata files. It just at the time of S3 (GET / PUT) call we will replace mybucket1 reference with multi-region access point.

Now if you can use a mutli-region access-point pointing to mybucket1, and mybucket2. it acts a proxy and single global hostname between two and internally routes the request to location with lowest latency...

More about access-points here https://aws.amazon.com/s3/features/multi-region-access-points/

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Mar 20, 2023
@github-actions
Copy link

github-actions bot commented Apr 6, 2023

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants