-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Bucket as data_dir #1319
Comments
👋 Thanks for opening your first issue here! Please filled out the template with as much details as possible. We appreciate that you took the time to contribute! |
Potential solutionTo enable the usage of S3 buckets as data directories for HyP3 input files, we need to modify the existing codebase to recognize and handle S3 URLs. This involves integrating AWS S3 interaction capabilities using the How to implement
By following these steps, the codebase will be extended to support S3 bucket URLs as input paths, allowing for more flexible data storage options in cloud environments. Click here to create a Pull Request with the proposed solution Files used for this task: Changes on src/mintpy/load_data.pyTo enable the usage of S3 bucket URLs as data directories in the
Here's a basic outline of how you might implement these changes: import boto3
import os
import tempfile
def is_s3_url(path):
return path.startswith('s3://')
def download_from_s3(s3_url, local_dir):
s3 = boto3.client('s3')
bucket_name, key = s3_url.replace("s3://", "").split("/", 1)
local_path = os.path.join(local_dir, os.path.basename(key))
s3.download_file(bucket_name, key, local_path)
return local_path
def load_data_from_path(path):
if is_s3_url(path):
with tempfile.TemporaryDirectory() as temp_dir:
local_path = download_from_s3(path, temp_dir)
# Proceed with loading data from local_path
else:
# Proceed with loading data from the local filesystem
By following these steps, you can extend the functionality of the Changes on src/mintpy/prep_hyp3.pyTo modify the
Here is a concrete proposal for the changes: import datetime as dt
import os
import boto3
from urllib.parse import urlparse
from mintpy.constants import SPEED_OF_LIGHT
from mintpy.objects import sensor
from mintpy.utils import readfile, utils1 as ut, writefile
# Initialize S3 client
s3_client = boto3.client('s3')
def is_s3_url(url):
return url.startswith('s3://')
def download_from_s3(s3_url, local_path):
parsed_url = urlparse(s3_url)
bucket = parsed_url.netloc
key = parsed_url.path.lstrip('/')
s3_client.download_file(bucket, key, local_path)
def add_hyp3_metadata(fname, meta, is_ifg=True):
# Check if the file is on S3
if is_s3_url(fname):
local_fname = '/tmp/' + os.path.basename(fname)
download_from_s3(fname, local_fname)
fname = local_fname
# Existing logic follows...
job_id = '_'.join(os.path.basename(fname).split('_')[:8])
# ... rest of the function remains unchanged
def prep_hyp3(inps):
inps.file = ut.get_file_list(inps.file, abspath=True)
for fname in inps.file:
is_ifg = any([x in fname for x in ['unw_phase','corr']])
meta = readfile.read_gdal_vrt(fname)
meta = add_hyp3_metadata(fname, meta, is_ifg=is_ifg)
rsc_file = fname+'.rsc'
writefile.write_roipac_rsc(meta, out_file=rsc_file)
return Key Points:
This approach allows the existing logic to remain mostly unchanged while adding support for S3 URLs. Changes on src/mintpy/defaults/auto_path.pyTo extend the path configuration utilities in
Here's a conceptual example of how you might start modifying the def read_str2dict(inString, delimiter='=', print_msg=False):
'''Read multiple lines of string into dict
Based on mintpy.utils.readfile.read_template()
'''
strDict = {}
lines = inString.split('\n')
for line in lines:
c = [i.strip() for i in line.strip().split(delimiter, 1)]
if len(c) < 2 or line.startswith(('%', '#')):
continue
else:
key = c[0]
value = str.replace(c[1], '\n', '').split("#")[0].strip()
if value != '':
# Check if the value is an S3 URL
if value.startswith('s3://'):
# Handle S3 URL (e.g., store it differently or mark it for special processing)
strDict[key] = value
else:
strDict[key] = value
# set 'None' to None
for key, value in strDict.items():
if value.lower() == 'none':
strDict[key] = None
return strDict This is a starting point, and you'll need to integrate this with the rest of the codebase, ensuring that any S3-specific logic is appropriately handled in other parts of the application. This comment was generated by AI. Information provided may be incorrect. Current plan usage: 0% Have feedback or need help? |
When you say you want to use "S3 buckets as data directories for hyp3 input files", do you mean
(or something else?) The former is a smaller change which saves a little space; the latter would require a very large rewrite for most of Mintpy to read from and write to S3 buckets directly. |
The idea would be having the latter since it would reduce a lot of costs involved in storage. I took a look at the code and indeed there isn't an easy way of implementing an S3Path to MintPy right away. |
Do you mean expensive for long term storage? Or for any use at all? Based on Are you seeing different prices? Or were you picturing another more expensive use case? |
Our intended use case is to deploy over multiple AOI's globally. Right now, using a temporal baseline of 37 days, 10x2 looks and acquiring data from 2017 to 2025 we are getting roughly 700 pairs per AOI (burst), resulting in approximately 80 GB of input data + MintPy results / Burst. The biggest problem around storage costs is that we are leveraging AWS Batch / AWS Processing Jobs to execute AOI's in parallel to scale the workflow. When executing in parallel, i need to spin up multiple machines / workers with dedicated Storage space. I had a much larger number in mind than $0.30 cents a day, but i'll run some tests with this workflow and report back in terms of costs here. Thanks a lot for the great discussion @scottstanie |
@mthsdiniz-usp for HyP3, we use AWS Batch as well. We leverage the SSDs included on board some EC2s for local storage when processing, then just upload the results to S3 in the end (we've had good luck with Feel free to ping me if you want to chat about how we've got things set up and share experiences. |
Enable the usage of S3 buckets as data directories for hyp3 input files.
Is your feature request related to a problem? Please describe
Describe the solution you'd like
Usage of S3 bucket URL’s as data input directories for hyp3 files.
Describe alternatives you have considered
Additional context
Depending on temporal baseline the total storage size can be near 100 GB. When working on a cloud environment, storing files in S3 is cheaper than keeping files on EBS / EFS storage.
Are you willing to help implement and maintain this feature?
The text was updated successfully, but these errors were encountered: