Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More frictionless S3 direct access #431

Open
abarciauskas-bgse opened this issue Jan 19, 2024 · 8 comments
Open

More frictionless S3 direct access #431

abarciauskas-bgse opened this issue Jan 19, 2024 · 8 comments

Comments

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented Jan 19, 2024

earthaccess allows for filtering datasets by cloud_hosted, and allows for discovering the S3 links using data_links(access="direct"), and even downloading. But I'm not able to use earthdata to open the data directly from S3 using the VEDA JupyterHub. Could this be because the VEDA JupyterHub is associated with a role for Earthdata cloud access?

Right now this is how the code is executing:

first_result = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=1
)
# Granules found: 7899

direct_link = first_result[0].data_links(access="direct")
direct_link
# ['s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc']

earthaccess.open(direct_link)
# We cannot open S3 links when we are not in-region, try using HTTPS links

earthaccess responds it can't open the dataset, even though this code was run in-region. I'm using the VEDA hub with direct access so I can resort to using xarray + s3fs to open the link, but having earthaccess.open work for direct access would be good to add for in-region users who are not using a NASA-managed hub like VEDA.

Ideally, this search and open would be like:

first_result = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=1,
    access="direct"
)
# Granules found: 7899

earthaccess.open(first_result) # opens the data directly from S3

This is very much the example from the README (minus the access="direct" parameter), but, at least in the VEDA JupyterHub results and .open are using an HTTPFileSystem not S3.

Perhaps the issue is it's not recognizing that the code is being run in-region?

Apologies if I missed something about how the library is supposed to work!

@abarciauskas-bgse
Copy link
Contributor Author

Also wondering if #424 relates to this, will need to dig into that a bit more to understand if will help.

@sharkinsspatial
Copy link

@abarciauskas-bgse Currently with access=direct earthaccess internally attempts to initialize an s3fs instance with its internal EDL auth chain and is unaware of the execution context's assumed role.

def get_s3fs_session(
self,
daac: Optional[str] = None,
concept_id: Optional[str] = None,
provider: Optional[str] = None,
endpoint: Optional[str] = None,
) -> s3fs.S3FileSystem:
"""
Returns a s3fs instance for a given cloud provider / DAAC
Parameters:
daac: any of the DAACs e.g. NSIDC, PODAAC
provider: a data provider if we know them, e.g PODAAC -> POCLOUD
endpoint: pass the URL for the credentials directly
Returns:
a s3fs file instance
"""
if self.auth is None:
raise ValueError(
"A valid Earthdata login instance is required to retrieve S3 credentials"
)
if not any([concept_id, daac, provider, endpoint]):
raise ValueError(
"At least one of the concept_id, daac, provider or endpoint"
"parameters must be specified. "
)
if concept_id is not None:
provider = self._derive_concept_provider(concept_id)
# Get existing S3 credentials if we already have them
location = (
daac,
provider,
endpoint,
) # Identifier for where to get S3 credentials from
need_new_creds = False
try:
dt_init, creds = self._s3_credentials[location]
except KeyError:
need_new_creds = True
else:
# If cached credentials are expired, invalidate the cache
delta = datetime.datetime.now() - dt_init
if round(delta.seconds / 60, 2) > 55:
need_new_creds = True
self._s3_credentials.pop(location)
if need_new_creds:
# Don't have existing valid S3 credentials, so get new ones
now = datetime.datetime.now()
if endpoint is not None:
creds = self.auth.get_s3_credentials(endpoint=endpoint)
elif daac is not None:
creds = self.auth.get_s3_credentials(daac=daac)
elif provider is not None:
creds = self.auth.get_s3_credentials(provider=provider)
# Include new credentials in the cache
self._s3_credentials[location] = now, creds
return s3fs.S3FileSystem(
key=creds["accessKeyId"],
secret=creds["secretAccessKey"],
token=creds["sessionToken"],
)
Our case is fairly exceptional (the general public will never have access to an IAM role with direct DAAC bucket access 😸 ) and so as we discussed, we instead need an "escape hatch" in earthaccess which allows initializing this s3fs instance with a profile or IAM metadata option. This is unrelated to #424 as the same region requirement is only enforced for temporary access tokens generated by a DAAC's Cumulus s3_credentials endpoint. Let's circle up next week and we can kick off a PR for the IAM "escape hatch" since this will be pretty clutch functionality for improving the VEDA JupyterHub user experience 👍

@abarciauskas-bgse
Copy link
Contributor Author

thanks @sharkinsspatial that all makes sense to me.

@abarciauskas-bgse
Copy link
Contributor Author

@luzpaz - @sharkinsspatial and I discussed a proposal for how to implement S3 access using the IAM role instead of S3 credentials, so bypassing all the earthdata login methods.

The API we imagine is:

import earthaccess

earthaccess.login(strategy="iam")

An option strategy == "iam" would be added to https://github.com/nsidc/earthaccess/blob/main/earthaccess/auth.py#L65
I think we need to add an attribute to the Auth class, but open to suggestions on how to maintain the state of use_iam. My idea would be to modify the Auth.__init__ function https://github.com/nsidc/earthaccess/blob/main/earthaccess/auth.py#L56 to include the use_iam attribute and be initiated with self.use_iam = False and then passing strategy="iam" to earthaccess.login sets the auth.use_iam = True in the Auth class.

This use_iam attribute could be used to bypass the authentication check in https://github.com/nsidc/earthaccess/blob/main/earthaccess/store.py#L101 (or it could not, in which case we would just get a warning which may confuse people). More critically, get_s3fs_session which include, right before the rest of the function code https://github.com/nsidc/earthaccess/blob/main/earthaccess/store.py#L215:

if self.auth.use_iam:
   return s3fs.S3FileSystem(anon=False)

Let us know what you think.

@betolink
Copy link
Member

Hi @abarciauskas-bgse I like the earthaccess.login(strategy="IAM") idea, I'm not sure if the s3fs session issue is related to this. I just ran this code in the Openscapes hub and it worked as expected returning a

first_result = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=1
)
# Granules found: 7899
fileset  = earthaccess.open(first_result)
fileset
[<File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>]

If we use the links directly we need to tell earthaccess which provider it should use so it can grab the credentials from a dictionary, although this should be more dynamic, in the near future earthaccess should infer which credential endpoint it needs to use.

results = earthaccess.search_data(
    short_name='MUR-JPL-L4-GLOB-v4.1',
    cloud_hosted=True,
    count=3
)
# if this collection had more than one file per granule we'll have to flatten the list instead of grabbing the first link
links = [g.data_links(access="direct")[0] for g in results]

fileset = earthaccess.open(links, provider="POCLOUD")
[<File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>,
 <File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>,
 <File-like object S3FileSystem, podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc>]

I'm curious if you get the same results, the Openscapes hub also has an assumed role in the environment but it may be different from what VEDA is doing. In any case, I think the earthaccess.login(strategy="IAM") is a valid and useful feature to have.

@abarciauskas-bgse
Copy link
Contributor Author

Thank you @betolink! I'll put the IAM strategy implementation on our todo list, but happy if someone on your team gets to it first 😄

@abarciauskas-bgse
Copy link
Contributor Author

I'm not sure if #444 solves the error I reported above but I'll take another look when I get a chance 👍🏽

@abarciauskas-bgse
Copy link
Contributor Author

Just noting that an earthaccess upgrade (to v0.8.2) in the VEDA hub resolves the issue of .open(results) not using S3 links for direct access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants