More frictionless S3 direct access #431

abarciauskas-bgse · 2024-01-19T22:18:53Z

earthaccess allows for filtering datasets by cloud_hosted, and allows for discovering the S3 links using data_links(access="direct"), and even downloading. But I'm not able to use earthdata to open the data directly from S3 using the VEDA JupyterHub. Could this be because the VEDA JupyterHub is associated with a role for Earthdata cloud access?

Right now this is how the code is executing:

first_result = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=1
)
# Granules found: 7899

direct_link = first_result[0].data_links(access="direct")
direct_link
# ['s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc']

earthaccess.open(direct_link)
# We cannot open S3 links when we are not in-region, try using HTTPS links

earthaccess responds it can't open the dataset, even though this code was run in-region. I'm using the VEDA hub with direct access so I can resort to using xarray + s3fs to open the link, but having earthaccess.open work for direct access would be good to add for in-region users who are not using a NASA-managed hub like VEDA.

Ideally, this search and open would be like:

first_result = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=1,
    access="direct"
)
# Granules found: 7899

earthaccess.open(first_result) # opens the data directly from S3

This is very much the example from the README (minus the access="direct" parameter), but, at least in the VEDA JupyterHub results and .open are using an HTTPFileSystem not S3.

Perhaps the issue is it's not recognizing that the code is being run in-region?

Apologies if I missed something about how the library is supposed to work!

The text was updated successfully, but these errors were encountered:

abarciauskas-bgse · 2024-01-19T22:25:49Z

Also wondering if #424 relates to this, will need to dig into that a bit more to understand if will help.

sharkinsspatial · 2024-01-27T01:05:04Z

@abarciauskas-bgse Currently with access=direct earthaccess internally attempts to initialize an s3fs instance with its internal EDL auth chain and is unaware of the execution context's assumed role.

earthaccess/earthaccess/store.py

Lines 198 to 262 in dd61f23

    
               def get_s3fs_session( 
        
                   self, 
        
                   daac: Optional[str] = None, 
        
                   concept_id: Optional[str] = None, 
        
                   provider: Optional[str] = None, 
        
                   endpoint: Optional[str] = None, 
        
               ) -> s3fs.S3FileSystem: 
        
                   """ 
        
                   Returns a s3fs instance for a given cloud provider / DAAC 
        
                   Parameters: 
        
                       daac: any of the DAACs e.g. NSIDC, PODAAC 
        
                       provider: a data provider if we know them, e.g PODAAC -> POCLOUD 
        
                       endpoint: pass the URL for the credentials directly 
        
                   Returns: 
        
                       a s3fs file instance 
        
                   """ 
        
                   if self.auth is None: 
        
                       raise ValueError( 
        
                           "A valid Earthdata login instance is required to retrieve S3 credentials" 
        
                       ) 
        
                   if not any([concept_id, daac, provider, endpoint]): 
        
                       raise ValueError( 
        
                           "At least one of the concept_id, daac, provider or endpoint" 
        
                           "parameters must be specified. " 
        
                       ) 
        
                   if concept_id is not None: 
        
                       provider = self._derive_concept_provider(concept_id) 
        
                   # Get existing S3 credentials if we already have them 
        
                   location = ( 
        
                       daac, 
        
                       provider, 
        
                       endpoint, 
        
                   )  # Identifier for where to get S3 credentials from 
        
                   need_new_creds = False 
        
                   try: 
        
                       dt_init, creds = self._s3_credentials[location] 
        
                   except KeyError: 
        
                       need_new_creds = True 
        
                   else: 
        
                       # If cached credentials are expired, invalidate the cache 
        
                       delta = datetime.datetime.now() - dt_init 
        
                       if round(delta.seconds / 60, 2) > 55: 
        
                           need_new_creds = True 
        
                           self._s3_credentials.pop(location) 
        
                   if need_new_creds: 
        
                       # Don't have existing valid S3 credentials, so get new ones 
        
                       now = datetime.datetime.now() 
        
                       if endpoint is not None: 
        
                           creds = self.auth.get_s3_credentials(endpoint=endpoint) 
        
                       elif daac is not None: 
        
                           creds = self.auth.get_s3_credentials(daac=daac) 
        
                       elif provider is not None: 
        
                           creds = self.auth.get_s3_credentials(provider=provider) 
        
                       # Include new credentials in the cache 
        
                       self._s3_credentials[location] = now, creds 
        
                   return s3fs.S3FileSystem( 
        
                       key=creds["accessKeyId"], 
        
                       secret=creds["secretAccessKey"], 
        
                       token=creds["sessionToken"], 
        
                   )

Our case is fairly exceptional (the general public will never have access to an IAM role with direct DAAC bucket access 😸 ) and so as we discussed, we instead need an "escape hatch" in earthaccess which allows initializing this s3fs instance with a profile or IAM metadata option. This is unrelated to #424 as the same region requirement is only enforced for temporary access tokens generated by a DAAC's Cumulus s3_credentials endpoint. Let's circle up next week and we can kick off a PR for the IAM "escape hatch" since this will be pretty clutch functionality for improving the VEDA JupyterHub user experience 👍

abarciauskas-bgse · 2024-01-29T16:33:14Z

thanks @sharkinsspatial that all makes sense to me.

abarciauskas-bgse · 2024-02-05T20:34:11Z

@luzpaz - @sharkinsspatial and I discussed a proposal for how to implement S3 access using the IAM role instead of S3 credentials, so bypassing all the earthdata login methods.

The API we imagine is:

import earthaccess

earthaccess.login(strategy="iam")

An option strategy == "iam" would be added to https://github.com/nsidc/earthaccess/blob/main/earthaccess/auth.py#L65
I think we need to add an attribute to the Auth class, but open to suggestions on how to maintain the state of use_iam. My idea would be to modify the Auth.__init__ function https://github.com/nsidc/earthaccess/blob/main/earthaccess/auth.py#L56 to include the use_iam attribute and be initiated with self.use_iam = False and then passing strategy="iam" to earthaccess.login sets the auth.use_iam = True in the Auth class.

This use_iam attribute could be used to bypass the authentication check in https://github.com/nsidc/earthaccess/blob/main/earthaccess/store.py#L101 (or it could not, in which case we would just get a warning which may confuse people). More critically, get_s3fs_session which include, right before the rest of the function code https://github.com/nsidc/earthaccess/blob/main/earthaccess/store.py#L215:

if self.auth.use_iam:
   return s3fs.S3FileSystem(anon=False)

Let us know what you think.

betolink · 2024-02-12T03:25:00Z

Hi @abarciauskas-bgse I like the earthaccess.login(strategy="IAM") idea, I'm not sure if the s3fs session issue is related to this. I just ran this code in the Openscapes hub and it worked as expected returning a

first_result = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=1
)
# Granules found: 7899
fileset  = earthaccess.open(first_result)
fileset

[<File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>]

If we use the links directly we need to tell earthaccess which provider it should use so it can grab the credentials from a dictionary, although this should be more dynamic, in the near future earthaccess should infer which credential endpoint it needs to use.

results = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=3
)
# if this collection had more than one file per granule we'll have to flatten the list instead of grabbing the first link
links = [g.data_links(access="direct")[0] for g in results]

fileset = earthaccess.open(links, provider="POCLOUD")

[<File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>,
 <File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>,
 <File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>]

I'm curious if you get the same results, the Openscapes hub also has an assumed role in the environment but it may be different from what VEDA is doing. In any case, I think the earthaccess.login(strategy="IAM") is a valid and useful feature to have.

abarciauskas-bgse · 2024-02-14T03:51:12Z

Thank you @betolink! I'll put the IAM strategy implementation on our todo list, but happy if someone on your team gets to it first 😄

abarciauskas-bgse · 2024-02-14T03:54:13Z

I'm not sure if #444 solves the error I reported above but I'll take another look when I get a chance 👍🏽

abarciauskas-bgse · 2024-02-23T21:08:00Z

Just noting that an earthaccess upgrade (to v0.8.2) in the VEDA hub resolves the issue of .open(results) not using S3 links for direct access.

github-project-automation bot added this to earthaccess project Jan 19, 2024

github-project-automation bot moved this to 🆕 New in earthaccess project Jan 19, 2024

abarciauskas-bgse mentioned this issue Jan 19, 2024

CMR (earthaccess) Reader developmentseed/titiler-cmr#8

Closed

abarciauskas-bgse mentioned this issue Jan 27, 2024

start sketching the CMR mosaic backend developmentseed/titiler-cmr#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More frictionless S3 direct access #431

More frictionless S3 direct access #431

abarciauskas-bgse commented Jan 19, 2024 •

edited

Loading

abarciauskas-bgse commented Jan 19, 2024

sharkinsspatial commented Jan 27, 2024

abarciauskas-bgse commented Jan 29, 2024

abarciauskas-bgse commented Feb 5, 2024

betolink commented Feb 12, 2024

abarciauskas-bgse commented Feb 14, 2024

abarciauskas-bgse commented Feb 14, 2024

abarciauskas-bgse commented Feb 23, 2024

More frictionless S3 direct access #431

More frictionless S3 direct access #431

Comments

abarciauskas-bgse commented Jan 19, 2024 • edited Loading

abarciauskas-bgse commented Jan 19, 2024

sharkinsspatial commented Jan 27, 2024

abarciauskas-bgse commented Jan 29, 2024

abarciauskas-bgse commented Feb 5, 2024

betolink commented Feb 12, 2024

abarciauskas-bgse commented Feb 14, 2024

abarciauskas-bgse commented Feb 14, 2024

abarciauskas-bgse commented Feb 23, 2024

abarciauskas-bgse commented Jan 19, 2024 •

edited

Loading