Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem Accessing NASA Ocean Color Data with Earth Access tools #563

Closed
zfasnacht opened this issue May 10, 2024 · 14 comments · Fixed by #620
Closed

Problem Accessing NASA Ocean Color Data with Earth Access tools #563

zfasnacht opened this issue May 10, 2024 · 14 comments · Fixed by #620
Labels
type: bug Something isn't working type: question A question needs to be answered to proceed

Comments

@zfasnacht
Copy link

zfasnacht commented May 10, 2024

I'm having problems reading ocean color files with the earthaccess tools. The following code correctly returns 55 granules (grabs 10 of them), but then when it tries to open a file it returns

OSError: Unable to open file (file signature not found)

Example file

<File-like object HTTPFileSystem, https://oceandata.sci.gsfc.nasa.gov/cmr/getfile/JPSS1_VIIRS.20240227T000000.L2.OC.nc>

I've downloaded this file with the https link provided and the file downloads/opens with no trouble. Any idea what causes this problem?

import earthaccess
import h5py
import numpy as np

earthaccess.login(persist=True)

results = earthaccess.search_data(
    short_name =  'VIIRSJ1_L2_OC',
    version = 'R2022.0',
    cloud_hosted=True,
    temporal=("2024-02-27 00:00:00", "2024-02-27 23:59:00"),
    count=10,
    bounding_box = (-180,0,0,90)
)

files = earthaccess.open(results)
for filename in files:
    print(filename)
    f = h5py.File(filename,'r')
@mfisher87
Copy link
Collaborator

Can you attach the full traceback? I think the error is occurring when you pass an fsspec file handle to h5py.

@zfasnacht
Copy link
Author

zfasnacht commented May 10, 2024

Granules found: 55
Opening 10 granules, approx size: 0.64 GB
QUEUEING TASKS | : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 510.64it/s]
PROCESSING TASKS | : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.34it/s]
COLLECTING RESULTS | : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 100582.83it/s]
<File-like object HTTPFileSystem, https://oceandata.sci.gsfc.nasa.gov/cmr/getfile/JPSS1_VIIRS.20240227T000000.L2.OC.nc>
Traceback (most recent call last):
  File "/run/cephfs/ACPS_Scratch/zfasnach/outgoing/earthaccess_test.py", line 21, in <module>
    f = h5py.File(filename,'r')
  File "/home/zfasnach/miniconda3/lib/python3.9/site-packages/h5py/_hl/files.py", line 442, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/home/zfasnach/miniconda3/lib/python3.9/site-packages/h5py/_hl/files.py", line 195, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 96, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

@zfasnacht
Copy link
Author

The interesting part is that if I do the same thing for a different file, like a TEMPO L1b file, the same code runs smoothly, so it's something specific to the ocean color files for some reason

@mfisher87 mfisher87 added the type: question A question needs to be answered to proceed label May 10, 2024
@betolink
Copy link
Member

Seems like this server https://oceandata.sci.gsfc.nasa.gov/ is not accepting HTTP range requests, I guess for now the only workaround is to download the files. @zfasnacht

@zfasnacht
Copy link
Author

Any idea who to possibly contact about working to change this? Is it a general daac issue or an ocean specific daac issue?

@betolink
Copy link
Member

I think it will affect any data being serve from here, maybe @itcarroll knows who to contact to verify this.
This should work if we have a .netrc in place.

curl  --range 0-999 -vL https://oceandata.sci.gsfc.nasa.gov/getfile/JPSS1_VIIRS.20240227T051801.L2.OC.nc

Instead I'm getting

* [HTTP/2] [1] OPENED stream for https://oceandata.sci.gsfc.nasa.gov/getfile/JPSS1_VIIRS.20240227T051801.L2.OC.nc
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: oceandata.sci.gsfc.nasa.gov]
* [HTTP/2] [1] [:path: /getfile/JPSS1_VIIRS.20240227T051801.L2.OC.nc]
* [HTTP/2] [1] [range: bytes=0-20]
* [HTTP/2] [1] [user-agent: curl/8.7.1]
* [HTTP/2] [1] [accept: */*]
> GET /getfile/JPSS1_VIIRS.20240227T051801.L2.OC.nc HTTP/2
> Host: oceandata.sci.gsfc.nasa.gov
> Range: bytes=0-20
> User-Agent: curl/8.7.1
> Accept: */*
> 
* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/2 416 
< server: nginx
< date: Fri, 10 May 2024 20:15:46 GMT
< content-type: text/html
< x-frame-options: SAMEORIGIN
< 
<html>
<head><title>416 Requested Range Not Satisfiable</title></head>
<body>
<center><h1>416 Requested Range Not Satisfiable</h1></center>
<hr><center>nginx</center>
</body>
</html>

@itcarroll
Copy link
Collaborator

The VIIRSJ1_L2_OC collection is not in the Earthdata Cloud, and it appears correct that the DAAC's on-prem server is not accepting byte range requests at the getfile endpoint. Was an outdated configuration, and I thought they had changed it last year. I will ask and report back.

Question though for @betolink, why is cloud_hosted=True filter not having the expected effect?

@zfasnacht
Copy link
Author

@itcarroll so what's the best way to access the MODIS/VIIRS L2 OC data? Still through ftp? Is there a way to access it with the earthdata tools without copying files locally?

@chuckwondo
Copy link
Collaborator

The VIIRSJ1_L2_OC collection is not in the Earthdata Cloud, and it appears correct that the DAAC's on-prem server is not accepting byte range requests at the getfile endpoint. Was an outdated configuration, and I thought they had changed it last year. I will ask and report back.

Question though for @betolink, why is cloud_hosted=True filter not having the expected effect?

@itcarroll, cloud_hosted is working as expected. The code in the email notification is the point of confusion. I believe the OP edited the code in the issue after the notification. If you run the code in the email, you get 0 results because cloud_hosted is working as expected.

@zfasnacht
Copy link
Author

zfasnacht commented May 10, 2024

@chuckwondo not quite, I edited the code because I accidentally changed 'VIIRSJ1_L2_OC' to 'MODISA_L2_OC'. cloud_hosted is not working as expected

>>> import earthaccess
>>> import h5py
>>> import numpy as np
>>> from netCDF4 import Dataset
>>> 
>>> earthaccess.login(persist=True)
<earthaccess.auth.Auth object at 0x1477be9b2670>
>>> results = earthaccess.search_data(short_name =  'VIIRSJ1_L2_OC_NRT',cloud_hosted=True,temporal=("2024-02-27 00:00:00", "2024-02-27 23:59:00"),count=10)
Granules found: 142
>>> files = earthaccess.open(results)
Opening 10 granules, approx size: 0.76 GB
QUEUEING TASKS | : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 494.01it/s]
PROCESSING TASKS | :   0%|                                                                                                                                                              | 0/10 [00:00<?, ?it/s]PROCESSING TASKS | : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.33it/s]
COLLECTING RESULTS | : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 85250.08it/s]
>>> 

@betolink
Copy link
Member

Weird, I ran the code with VIIRSJ1_L2_OC and I get results back regardless of cloud_hosted I wonder if this may be a metadata related issue (or a bug in earthaccess), I'll take a look. I know that for NSIDC datasets it works as expected e.g.

results = earthaccess.search_data(
    short_name =  'ATL06',
    # cloud_hosted=False,
    temporal=("2023-02-01 00:00:00", "2024-02-27 23:59:00"),
    bounding_box= (10,0,20,90),
    count=1
)

# returns ~4k granules but half are from the cloud hosted collections

results = earthaccess.search_data(
    short_name =  'ATL06',
    cloud_hosted=True,
    temporal=("2023-02-01 00:00:00", "2024-02-27 23:59:00"),
    bounding_box= (10,0,20,90),
    count=10
)

# returns only the cloud hosted granules.

@chuckwondo
Copy link
Collaborator

Weird, I ran the code with VIIRSJ1_L2_OC and I get results back regardless of cloud_hosted I wonder if this may be a metadata related issue (or a bug in earthaccess), I'll take a look. I know that for NSIDC datasets it works as expected e.g.

I believe it's a bug.

@mfisher87 mfisher87 added the type: bug Something isn't working label May 21, 2024
@mfisher87 mfisher87 changed the title Problem Accessing NASA Ocean Color Data with Earth Access tools cloud_hosted flag not working as expected May 21, 2024
@mfisher87
Copy link
Collaborator

mfisher87 commented May 21, 2024

If anyone can improve the issue title further that would be appreciated :) EDIT: Realized we have a separate issue.

Should we move this to a discussion? I think #565 covers the work we need to do?

@mfisher87 mfisher87 changed the title cloud_hosted flag not working as expected Problem Accessing NASA Ocean Color Data with Earth Access tools May 21, 2024
@itcarroll
Copy link
Collaborator

Leaving aside the cloud_hosted argument issue, I'd suggest we close this as "won't fix" with a redirection to fsspec. I've opened discussion on the source of trouble. In short, there's a miscommunication of sorts between the DAAC server and fsspec about handling requests for parts of a file. I don't believe we should change anything in earthaccess. Sound okay to you @zfasnacht?

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in earthaccess project Sep 23, 2024
@itcarroll itcarroll closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working type: question A question needs to be answered to proceed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants