Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download while on AWS creates one file #562

Closed
itcarroll opened this issue May 8, 2024 · 11 comments · Fixed by #570
Closed

download while on AWS creates one file #562

itcarroll opened this issue May 8, 2024 · 11 comments · Fixed by #570
Labels
bug Something isn't working

Comments

@itcarroll
Copy link
Collaborator

The earthaccess.download function appears to be (over)writing all data received to a single file, rather than to separate files in the provided directory, when working on AWS. This example uses earthaccess 0.9.0 on cryointhecloud.

import earthaccess

earthaccess.login()
results = earthaccess.search_data(short_name="PACE_OCI_L2_AOP_NRT", count=2)
earthaccess.download(results, "test")
!ls -l test
Granules found: 4148
 Getting 2 granules, approx download size: 0.07 GB
Accessing cloud dataset using dataset endpoint credentials: https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials
Downloaded: test/PACE_OCI.20240409T203648.L2.OC_AOP.V1_0_0.NRT.nc
Downloaded: test/PACE_OCI.20240409T204148.L2.OC_AOP.V1_0_0.NRT.nc
-rw-r--r-- 1 jovyan jovyan 9516948 May  8 21:04 test

Instead of a "test" folder, I get one "test" file.

Expected behavior would be either to raise "STAHP! Wut R u DoINg?" or just download the files.

@chuckwondo
Copy link
Collaborator

chuckwondo commented May 8, 2024

@itcarroll, I suspect this is not specific to cryointhecloud, nor AWS in general.

You likely need to first create the target directory (I don't know if it will be created for you, which may be the problem). Delete the erroneous file, create the test directory, then try again.

If that still doesn't work, then in addition, specify "test/" instead of simply "test" in your call to download.

Please report back your findings. Whatever your findings, this appears to be behavior that we must correct (and likely clearly document as well).

@chuckwondo chuckwondo added the bug Something isn't working label May 8, 2024
@itcarroll
Copy link
Collaborator Author

Oh, I should have mentioned that the function works as documented, even creating the directory if it doesn't exist, without direct access.

@chuckwondo
Copy link
Collaborator

Oh, I should have mentioned that the function works as documented, even creating the directory if it doesn't exist, without direct access.

Oh, very interesting. Would you mind trying my suggestions above, just to see if/how things are affected in that environment, just so we have more information about how things work/don't work?

@mfisher87
Copy link
Member

Oh, I should have mentioned that the function works as documented, even creating the directory if it doesn't exist, without direct access.

👀 Good catch!!!

@itcarroll
Copy link
Collaborator Author

Oh, I should have mentioned that the function works as documented, even creating the directory if it doesn't exist, without direct access.

Oh, very interesting. Would you mind trying my suggestions above, just to see if/how things are affected in that environment, just so we have more information about how things work/don't work?

Creating the folder first causes expected behavior (i.e. multiple files in the folder). Using "test/" as the argument is equivalent to "test".

It's possible, based on fsspec's get() documentation, that a fix could be changing store.py#L574 to

s3_fs.get(file, str(local_path) + "/")

although that's a bit flimsy looking.

@chuckwondo
Copy link
Collaborator

Oh, I should have mentioned that the function works as documented, even creating the directory if it doesn't exist, without direct access.

Oh, very interesting. Would you mind trying my suggestions above, just to see if/how things are affected in that environment, just so we have more information about how things work/don't work?

Creating the folder first causes expected behavior (i.e. multiple files in the folder). Using "test/" as the argument is equivalent to "test".

It's possible, based on fsspec's get() documentation, that a fix could be changing store.py#L574 to

s3_fs.get(file, str(local_path) + "/")

although that's a bit flimsy looking.

Cool. Would you mind trying one more test? Delete the files in the test dir and also remove the dir again, so it doesn't exist. Then try downloading with "test/" as the argument. I want to see if including the trailing slash will then cause fsspec to create the directory for you.

@itcarroll
Copy link
Collaborator Author

If a "test" folder exists, download works with "test" and "test/". If there is nothing called "test", one file called "test" is created for either argument.

Conversion to pathlib.Path and back to str seems like it eliminates the slash: str(pathlib.Path("test/")) is just test.

@chuckwondo
Copy link
Collaborator

chuckwondo commented May 13, 2024

After a bit of digging into fsspec, I see the problem, which boils down to this: either the local (destination) directory must exist (and be a directory) OR the specified path must include a trailing slash. Otherwise, the destination is interpreted as a file rather than a directory, thus resulting in the reported behavior in this issue.

@chuckwondo
Copy link
Collaborator

@itcarroll, I don't have a CryoCloud account (I just submitted a request), but I did attempt to reproduce the error in the MAAP environment.

For reference, I noticed this in your reported output:

 Getting 2 granules, approx download size: 0.07 GB
Accessing cloud dataset using dataset endpoint credentials: https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials
Downloaded: test/PACE_OCI.20240409T203648.L2.OC_AOP.V1_0_0.NRT.nc
Downloaded: test/PACE_OCI.20240409T204148.L2.OC_AOP.V1_0_0.NRT.nc
-rw-r--r-- 1 jovyan jovyan 9516948 May  8 21:04 test

Here's the output that I got:

 Getting 2 granules, approx download size: 0.07 GB
QUEUEING TASKS | : 100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4344.18it/s]
PROCESSING TASKS | : 100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.97s/it]
COLLECTING RESULTS | : 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 59074.70it/s]
['test/PACE_OCI.20240409T203648.L2.OC_AOP.V1_0_0.NRT.nc', 'test/PACE_OCI.20240409T204148.L2.OC_AOP.V1_0_0.NRT.nc']

That resulted in the expected test dir being created, and the 2 files downloaded into that dir.

However, I noticed that my output was different than yours, and it turns out that when I ran the code, earthaccess did not correctly determine that I was "in region," thus did not use the S3 download URLs, and instead used the HTTPS URLs.

Therefore, I removed the test dir, and did this:

earthaccess.__store__.in_region = True

Then I ran the download again, and this time I reproduced the behavior you reported:

 Getting 2 granules, approx download size: 0.07 GB
Accessing cloud dataset using dataset endpoint credentials: https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials
Downloaded: test/PACE_OCI.20240409T203648.L2.OC_AOP.V1_0_0.NRT.nc
Downloaded: test/PACE_OCI.20240409T204148.L2.OC_AOP.V1_0_0.NRT.nc
[PosixPath('test/PACE_OCI.20240409T203648.L2.OC_AOP.V1_0_0.NRT.nc'), PosixPath('test/PACE_OCI.20240409T204148.L2.OC_AOP.V1_0_0.NRT.nc')]

This resulted in a single file named test containing the contents of the 2nd file.

The difference is that when earthaccess thinks it's not running "in region," it takes a different path in the code, and calls the Store._download_onprem_granules method, and that method explicitly creates the local directory (test in this case) before downloading the files, whereas the "in region" code path does not create the directory beforehand.

chuckwondo added a commit to chuckwondo/earthaccess that referenced this issue May 13, 2024
chuckwondo added a commit to chuckwondo/earthaccess that referenced this issue May 13, 2024
@itcarroll
Copy link
Collaborator Author

That all checks out with my understanding, and I like the explicit fix. Thanks!

I don't know what the MAAP is, but perhaps you are also hitting #444.

@chuckwondo
Copy link
Collaborator

That all checks out with my understanding, and I like the explicit fix. Thanks!

I don't know what the MAAP is, but perhaps you are also hitting #444.

Sorry, FYI, here's MAAP: https://maap-project.org/

I suspect I'm indeed hitting #444.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants