Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Bucket as data_dir #1319

Open
1 of 2 tasks
mthsdiniz opened this issue Feb 7, 2025 · 7 comments
Open
1 of 2 tasks

S3 Bucket as data_dir #1319

mthsdiniz opened this issue Feb 7, 2025 · 7 comments

Comments

@mthsdiniz
Copy link

Enable the usage of S3 buckets as data directories for hyp3 input files.

Is your feature request related to a problem? Please describe

Describe the solution you'd like
Usage of S3 bucket URL’s as data input directories for hyp3 files.

Describe alternatives you have considered

Additional context
Depending on temporal baseline the total storage size can be near 100 GB. When working on a cloud environment, storing files in S3 is cheaper than keeping files on EBS / EFS storage.

Are you willing to help implement and maintain this feature?

  • Yes
  • No
Copy link

welcome bot commented Feb 7, 2025

👋 Thanks for opening your first issue here! Please filled out the template with as much details as possible. We appreciate that you took the time to contribute!
Make sure you read our contributing guidelines.

Copy link

codeautopilot bot commented Feb 7, 2025

Potential solution

To enable the usage of S3 buckets as data directories for HyP3 input files, we need to modify the existing codebase to recognize and handle S3 URLs. This involves integrating AWS S3 interaction capabilities using the boto3 library, which allows us to download or stream files from S3. The solution involves updating file handling logic across multiple files to support S3 URLs, ensuring that data can be accessed seamlessly whether it's stored locally or in an S3 bucket.

How to implement

  1. Install Boto3: Ensure that the boto3 library is installed in your environment. This library is essential for interacting with AWS S3.

  2. Modify src/mintpy/load_data.py:

    • Add functions to identify S3 URLs and download files from S3.
    • Update the data loading logic to handle S3 URLs by downloading files to a temporary directory.
    import boto3
    import os
    import tempfile
    
    def is_s3_url(path):
        return path.startswith('s3://')
    
    def download_from_s3(s3_url, local_dir):
        s3 = boto3.client('s3')
        bucket_name, key = s3_url.replace("s3://", "").split("/", 1)
        local_path = os.path.join(local_dir, os.path.basename(key))
        s3.download_file(bucket_name, key, local_path)
        return local_path
    
    def load_data_from_path(path):
        if is_s3_url(path):
            with tempfile.TemporaryDirectory() as temp_dir:
                local_path = download_from_s3(path, temp_dir)
                # Proceed with loading data from local_path
        else:
            # Proceed with loading data from the local filesystem
  3. Modify src/mintpy/prep_hyp3.py:

    • Import boto3 and add functions to handle S3 URLs.
    • Update the add_hyp3_metadata function to download files from S3 if necessary.
    import os
    import boto3
    from urllib.parse import urlparse
    
    s3_client = boto3.client('s3')
    
    def is_s3_url(url):
        return url.startswith('s3://')
    
    def download_from_s3(s3_url, local_path):
        parsed_url = urlparse(s3_url)
        bucket = parsed_url.netloc
        key = parsed_url.path.lstrip('/')
        s3_client.download_file(bucket, key, local_path)
    
    def add_hyp3_metadata(fname, meta, is_ifg=True):
        if is_s3_url(fname):
            local_fname = '/tmp/' + os.path.basename(fname)
            download_from_s3(fname, local_fname)
            fname = local_fname
    
        job_id = '_'.join(os.path.basename(fname).split('_')[:8])
        # ... rest of the function remains unchanged
    
    def prep_hyp3(inps):
        inps.file = ut.get_file_list(inps.file, abspath=True)
    
        for fname in inps.file:
            is_ifg = any([x in fname for x in ['unw_phase','corr']])
            meta = readfile.read_gdal_vrt(fname)
            meta = add_hyp3_metadata(fname, meta, is_ifg=is_ifg)
    
            rsc_file = fname+'.rsc'
            writefile.write_roipac_rsc(meta, out_file=rsc_file)
    
        return
  4. Modify src/mintpy/defaults/auto_path.py:

    • Update the read_str2dict function to recognize S3 URLs and handle them appropriately.
    def read_str2dict(inString, delimiter='=', print_msg=False):
        strDict = {}
        lines = inString.split('\n')
        for line in lines:
            c = [i.strip() for i in line.strip().split(delimiter, 1)]
            if len(c) < 2 or line.startswith(('%', '#')):
                continue
            else:
                key = c[0]
                value = str.replace(c[1], '\n', '').split("#")[0].strip()
                if value != '':
                    if value.startswith('s3://'):
                        strDict[key] = value
                    else:
                        strDict[key] = value
    
        for key, value in strDict.items():
            if value.lower() == 'none':
                strDict[key] = None
        return strDict
  5. Testing: Ensure that the changes are tested with various configurations to verify that S3 URLs are correctly recognized and processed. This includes testing with both local and S3 paths to ensure seamless integration.

By following these steps, the codebase will be extended to support S3 bucket URLs as input paths, allowing for more flexible data storage options in cloud environments.

Click here to create a Pull Request with the proposed solution

Files used for this task:

Changes on src/mintpy/load_data.py

To enable the usage of S3 bucket URLs as data directories in the src/mintpy/load_data.py file, you can follow these steps:

  1. Install Boto3: Ensure that the boto3 library is installed in your environment. This library is essential for interacting with AWS S3.

  2. Modify Data Loading Logic: Update the functions responsible for loading data to handle S3 URLs. This involves checking if the input path is an S3 URL and using boto3 to download or stream the data.

  3. Identify S3 URLs: You can identify S3 URLs by checking if the path starts with s3://.

  4. Download or Stream Data: Use boto3 to access the S3 bucket. You can either download the files to a temporary directory or stream them directly if your application supports it.

  5. Update File Handling: Ensure that all file handling logic can work with files downloaded from S3. This might involve using temporary files or in-memory file-like objects.

Here's a basic outline of how you might implement these changes:

import boto3
import os
import tempfile

def is_s3_url(path):
    return path.startswith('s3://')

def download_from_s3(s3_url, local_dir):
    s3 = boto3.client('s3')
    bucket_name, key = s3_url.replace("s3://", "").split("/", 1)
    local_path = os.path.join(local_dir, os.path.basename(key))
    s3.download_file(bucket_name, key, local_path)
    return local_path

def load_data_from_path(path):
    if is_s3_url(path):
        with tempfile.TemporaryDirectory() as temp_dir:
            local_path = download_from_s3(path, temp_dir)
            # Proceed with loading data from local_path
    else:
        # Proceed with loading data from the local filesystem
  1. Integrate with Existing Logic: Integrate the above logic into the existing data loading functions, ensuring that all parts of the code that handle file paths are updated to support S3 URLs.

By following these steps, you can extend the functionality of the load_data.py file to support S3 bucket URLs as input paths for data loading.

Changes on src/mintpy/prep_hyp3.py

To modify the src/mintpy/prep_hyp3.py file to support S3 bucket URLs as input paths for HyP3 metadata preparation, you can follow these steps:

  1. Import Necessary Libraries: You will need to use boto3 to interact with S3. Ensure that boto3 is imported at the beginning of the file.

  2. Modify File Handling: Update the file handling logic to check if the input path is an S3 URL. If it is, use boto3 to download the file locally or read it directly from S3.

  3. Update add_hyp3_metadata Function: Modify the function to handle S3 paths. This may involve checking if the fname is an S3 URL and then using boto3 to access the metadata file.

Here is a concrete proposal for the changes:

import datetime as dt
import os
import boto3
from urllib.parse import urlparse

from mintpy.constants import SPEED_OF_LIGHT
from mintpy.objects import sensor
from mintpy.utils import readfile, utils1 as ut, writefile

# Initialize S3 client
s3_client = boto3.client('s3')

def is_s3_url(url):
    return url.startswith('s3://')

def download_from_s3(s3_url, local_path):
    parsed_url = urlparse(s3_url)
    bucket = parsed_url.netloc
    key = parsed_url.path.lstrip('/')
    s3_client.download_file(bucket, key, local_path)

def add_hyp3_metadata(fname, meta, is_ifg=True):
    # Check if the file is on S3
    if is_s3_url(fname):
        local_fname = '/tmp/' + os.path.basename(fname)
        download_from_s3(fname, local_fname)
        fname = local_fname

    # Existing logic follows...
    job_id = '_'.join(os.path.basename(fname).split('_')[:8])
    # ... rest of the function remains unchanged

def prep_hyp3(inps):
    inps.file = ut.get_file_list(inps.file, abspath=True)

    for fname in inps.file:
        is_ifg = any([x in fname for x in ['unw_phase','corr']])
        meta = readfile.read_gdal_vrt(fname)
        meta = add_hyp3_metadata(fname, meta, is_ifg=is_ifg)

        rsc_file = fname+'.rsc'
        writefile.write_roipac_rsc(meta, out_file=rsc_file)

    return

Key Points:

  • S3 URL Handling: The is_s3_url function checks if a path is an S3 URL. If it is, the download_from_s3 function downloads the file to a temporary local path.
  • Temporary Local Storage: Files from S3 are downloaded to a temporary directory (/tmp/) for processing.
  • Boto3 Client: A boto3 client is used to interact with S3, which requires AWS credentials to be configured in the environment where this script runs.

This approach allows the existing logic to remain mostly unchanged while adding support for S3 URLs.

Changes on src/mintpy/defaults/auto_path.py

To extend the path configuration utilities in src/mintpy/defaults/auto_path.py to recognize S3 URLs, you can follow these steps:

  1. Identify S3 URL Patterns: Determine how S3 URLs will be specified in the configuration. Typically, they follow the pattern s3://bucket-name/path/to/object.

  2. Modify read_str2dict Function: Update the read_str2dict function to recognize and handle S3 URLs. You might want to add logic to check if a value starts with s3:// and handle it accordingly.

  3. Update get_auto_path Function: Modify the get_auto_path function to process S3 URLs. This might involve checking if any paths in the auto_path_dict are S3 URLs and ensuring they are handled correctly.

  4. Add S3 Handling Logic: If necessary, add logic to handle S3 paths, such as downloading files from S3 to a local directory or directly accessing them using a library like boto3.

  5. Testing: Ensure that the changes are tested with various configurations to verify that S3 URLs are correctly recognized and processed.

Here's a conceptual example of how you might start modifying the read_str2dict function:

def read_str2dict(inString, delimiter='=', print_msg=False):
    '''Read multiple lines of string into dict
    Based on mintpy.utils.readfile.read_template()
    '''
    strDict = {}
    lines = inString.split('\n')
    for line in lines:
        c = [i.strip() for i in line.strip().split(delimiter, 1)]
        if len(c) < 2 or line.startswith(('%', '#')):
            continue
        else:
            key = c[0]
            value = str.replace(c[1], '\n', '').split("#")[0].strip()
            if value != '':
                # Check if the value is an S3 URL
                if value.startswith('s3://'):
                    # Handle S3 URL (e.g., store it differently or mark it for special processing)
                    strDict[key] = value
                else:
                    strDict[key] = value

    # set 'None' to None
    for key, value in strDict.items():
        if value.lower() == 'none':
            strDict[key] = None
    return strDict

This is a starting point, and you'll need to integrate this with the rest of the codebase, ensuring that any S3-specific logic is appropriately handled in other parts of the application.

This comment was generated by AI. Information provided may be incorrect.

Current plan usage: 0%

Have feedback or need help?
Documentation
support@codeautopilot.com

@scottstanie
Copy link
Contributor

When you say you want to use "S3 buckets as data directories for hyp3 input files", do you mean

  • avoid downloading the Hyp3 geotiffs before creating the local mintpy HDF5 files
  • use s3 for all storage instead of the local mintpy HDF5 files

(or something else?)

The former is a smaller change which saves a little space; the latter would require a very large rewrite for most of Mintpy to read from and write to S3 buckets directly.

@mthsdiniz-usp
Copy link

mthsdiniz-usp commented Feb 7, 2025

When you say you want to use "S3 buckets as data directories for hyp3 input files", do you mean

  • avoid downloading the Hyp3 geotiffs before creating the local mintpy HDF5 files
  • use s3 for all storage instead of the local mintpy HDF5 files

(or something else?)

The former is a smaller change which saves a little space; the latter would require a very large rewrite for most of Mintpy to read from and write to S3 buckets directly.

The idea would be having the latter since it would reduce a lot of costs involved in storage. I took a look at the code and indeed there isn't an easy way of implementing an S3Path to MintPy right away.

@scottstanie
Copy link
Contributor

Do you mean expensive for long term storage? Or for any use at all?

Based on
https://aws.amazon.com/ebs/pricing/
Having a 100 GB block for 24 hours of processing would be about $0.30
Since most mintpy processing takes less time than that, it doesn't seem to expensive to run mintpy normally after provisioning a large disk space, then saving the HDF5 files to S3 afterwards.

Are you seeing different prices? Or were you picturing another more expensive use case?

@mthsdiniz-usp
Copy link

Our intended use case is to deploy over multiple AOI's globally. Right now, using a temporal baseline of 37 days, 10x2 looks and acquiring data from 2017 to 2025 we are getting roughly 700 pairs per AOI (burst), resulting in approximately 80 GB of input data + MintPy results / Burst.

The biggest problem around storage costs is that we are leveraging AWS Batch / AWS Processing Jobs to execute AOI's in parallel to scale the workflow. When executing in parallel, i need to spin up multiple machines / workers with dedicated Storage space.

I had a much larger number in mind than $0.30 cents a day, but i'll run some tests with this workflow and report back in terms of costs here.

Thanks a lot for the great discussion @scottstanie

@jhkennedy
Copy link
Collaborator

@mthsdiniz-usp for HyP3, we use AWS Batch as well. We leverage the SSDs included on board some EC2s for local storage when processing, then just upload the results to S3 in the end (we've had good luck with r6id instances), so we're not really paying for storage outside of S3. And if you're storing the products for a while, using intelligent teirring usual results in a cost savings.

Feel free to ping me if you want to chat about how we've got things set up and share experiences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants