Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAMOS Recipe failing #120

Open
SBS-EREHM opened this issue Mar 4, 2022 · 3 comments
Open

SAMOS Recipe failing #120

SBS-EREHM opened this issue Mar 4, 2022 · 3 comments

Comments

@SBS-EREHM
Copy link

SBS-EREHM commented Mar 4, 2022

Used example recipe and changed make_url. It makes the proper urls that when executed in a browser, do download the NetCDF data, e.g.,
Index({DimIndex(name='time', index=0, sequence_len=2, operation=<CombineOp.CONCAT: 2>)}) http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210101v30001.nc Index({DimIndex(name='time', index=1, sequence_len=2, operation=<CombineOp.CONCAT: 2>)}) http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210102v30001.nc

(Template URL: See here and copy/paste #3 (HTTPServer)

When it runs, it seems to access first file, but has errors like:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp3zzmk1oe/ri5TAkGy/.zmetadata'

# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.13.7
#   kernelspec:
#     display_name: Python 3 (ipykernel)
#     language: python
#     name: python3
# ---

# +
# start coding here
# -

# ## Format Function

# +
# def make_url(time):
#     yyyy = time.strftime('%Y')
#     yyyymmdd = time.strftime('%Y%m%d')
#     return (
#         'https://coastwatch.noaa.gov/pub/socd/lsa/rads/sla/daily/dt'
#         f'/{yyyy}/rads_global_dt_sla_{yyyymmdd}_001.nc'
#     )

# http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210101v30001.nc

def make_url(time):
    yyyy = time.strftime('%Y')
    yyyymmdd = time.strftime('%Y%m%d')
    return (
        f'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/{yyyy}/ZCYL5_{yyyymmdd}v30001.nc'
    )

# +
#dates[0]

# +
#make_url(dates[0])
# -





# ## Combine Dimension

# +
import pandas as pd

dates = pd.date_range('2021-01-01', '2021-01-02', freq='D')
# print the first 4 dates
dates[:]

make_url(dates[0])

# +
from pangeo_forge_recipes.patterns import ConcatDim

# only one day in each file --> nitems_per_file=1
time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
time_concat_dim
# -

# ## FilePattern

# +
from pangeo_forge_recipes.patterns import FilePattern

pattern = FilePattern(make_url, time_concat_dim)
pattern

# + [markdown] tags=[]
# ### Iterate through FilePattern
# -

for index, url in pattern.items() :
    print(index)
    print(url)
    if '20120103' in url:
            break



# ## Create Recipe Object

# +
from pangeo_forge_recipes.recipes import XarrayZarrRecipe

recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=20)
recipe
# -

# ## Setup Logging

# +
from pangeo_forge_recipes.recipes import setup_logging

setup_logging()
# -

# ## Prune (built in smaller copy)

recipe_pruned = recipe.copy_pruned()  # Removes all but first two items

# +
## Run!
# -

run_function = recipe_pruned.to_function()
run_function()

# ## Check the output 

import xarray as xr
recipe_pruned.storage_config

# (NOT YET MODIFIED FOR SAMOS variables) 
sla_zarr = xr.open_zarr(recipe.target_mapper, consolidated=True)
sla_zarr

sla_zarr['sla'].isel(time=1).plot(robust=True)
@SBS-EREHM
Copy link
Author

SBS-EREHM commented Mar 4, 2022

More details:
Each file has data at 1 minute date per day. But the time dimension of the SAMOS NetCDF file is in "minutes since 1-1-1980 00:00 UTC". (See ncdump below.). Note that the number values in this dimension can vary from file to file (1439, 1440, etc.). So, ConcatDimz() and and XarrayZarrRecipe() had to change.

While i can convert the dates[] array to this 1980 epoch, I don't know how to tell ConcatDim to use this epoch as the index dimension of the NetCDF file. The following doesn't work:

import numpy as np 
epoch1980 = pd.to_datetime(dates - pd.Timestamp("1980-01-01") + pd.Timestamp("1970-01-01")).values.astype(np.int64)/60/10**9
print(epoch1980[:])
time_concat_dim = ConcatDim("time", epoch1980)
pattern = FilePattern(make_url, time_concat_dim)
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=20, target_chunks={'time':200})

array([21565440., 21566880.]) ## This epoch calculation is correct.

(base) ERICs-MBP-2:3VbRfPbW ericrehm$ ncdump ~/Downloads/ZCYL5_20210101v30001.nc
netcdf ZCYL5_20210101v30001 {
dimensions:
time = UNLIMITED ; // (1439 currently)
f_string = 35 ;
h_string = 236 ;
h_num = 50 ;
variables:
int time(time) ;
time:long_name = "time" ;
time:units = "minutes since 1-1-1980 00:00 UTC" ;
time:original_units = "hhmmss UTC" ;
time:data_interval = 60 ;
time:observation_type = "measured" ;
time:actual_range = 21565440, 21566879 ;
time:qcindex = 1 ;
time:metadata_retrieved_from = "ZCYL5_20210101v10001.nc" ;
float lat(time) ;

[skip other variables]

data:

time = 21565440, 21565441, 21565442, 21565443, 21565444, 21565445, 21565446,
21565447, 21565448, 21565449, 21565450, 21565451, 21565452, 21565453,
21565454, 21565455, 21565456, 21565457, 21565458, 21565459, 21565460,

@cisaacstern
Copy link
Member

cisaacstern commented May 10, 2022

@reint-fischer and @SBS-EREHM, as promised in pangeo-forge/pangeo-forge-recipes#315 (comment), here's a summary of my progress debugging your recipe:

  • The initial problem, FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp3zzmk1oe/ri5TAkGy/.zmetadata', pointed to the fact that the Zarr store was not being properly initialized. It turned out this was because your input files are, it seems, in NetCDF3 format (not NetCDF4) and therefore cannot be opened with our default xarray backend ("h5netcdf"). While we do support passing explicit xarray_open_kwargs (which can include overrides of the default backend), we thought this particular situation was likely common enough to warrant a specific feature. This was implemented in Implement FilePattern.file_type pangeo-forge-recipes#322 and released with pangeo-forge-recipes==0.8.3. Now, if we run the code provided in run_function metadata error pangeo-forge-recipes#315 (comment) with 0.8.3 (which is currently available to pip-install, coming soon via conda available on conda), we get a much more descriptive error:

    OSError: Unable to open file /tmp/tmpn8uqqe5a/xrGOjIl4/777be2b9214151be7e2c4f211c36a334-http_tds.coaps.fsu.edu_thredds_fileserver_samos_data_research_zcyl5_2021_zcyl5_20210101v30001.nc with `{engine: h5netcdf}`, which was set automatically based on the fact that `FilePattern.file_type` is using the default value of 'netcdf4'. It seems likely that this input file is in NetCDF3 format. If that is the case, please re-instantiate your `FilePattern` with `FilePattern(..., file_type="netcdf3")`.
  • If we follow the suggestion made in this error message, and specify FilePattern(..., file_type="netcdf3"), pangeo-forge-recipes then automatically selects the correct xarray backend for opening the source files, and we move past the Zarr initialization problem. However, during execution we then hit a new error:

    ValueError: Invalid dtype for data variable: <xarray.DataArray 'flag' (time: 2879)> dask.array<concatenate, shape=(2879,), dtype=|S35, chunksize=(1440,), chunktype=numpy.ndarray> ... dtype must be a subtype of number, datetime, bool, a fixed sized string, a fixed size unicode string or an object
  • It turns out this is an upstream issue in xarray, which has to do with the fact that one of your data variables has a dtype of "|S35". Prompted by the fact that this is a blocker for your recipe, I've proposed a solution to this problem in Fix zarr append dtype checks pydata/xarray#6476. We're still waiting on review of that PR; hopefully it will be merged in time to be included in the next xarray release. In the meantime, I've confirmed that installing xarray from that PR branch allows us to move past this error:

    pip install -U "git+https://github.com/cisaacstern/xarray.git@zarr-append-fix"
  • With pangeo-forge-recipes==0.8.3 and xarray installed from that PR branch, we're very close to being able to run this recipe. To make this work, I did find a few changes to be necessary to the kwargs passed to both ConcatDim and XarrayZarrRecipe. In brief, ConcatDim.nitems_per_file should only be passed if your input files are all the same length in the concatenation dimension, which it appears yours are not. And the value you'd initially set on XarrayZarrRecipe.inputs_per_chunk seemed like it would result in the target Zarr chunks being too large, so I adjusted the chunk sizing with the alternative target_chunks kwarg. The complete diff of the recipe code as you'd initially provided it, and the way I was ultimately able to run it, is as follows:

      import pandas as pd
      import xarray as xr
      from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
      from pangeo_forge_recipes.recipes import XarrayZarrRecipe, setup_logging
    
      def make_url(time):
          year=time.strftime('%Y')
          year_month_day = time.strftime('%Y%m%d')
          return(f'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/{year}/ZCYL5_{year_month_day}v30001.nc')
    
      dates = pd.date_range('2021-01-01','2021-01-03', freq='D')
    
      time_concat_dim = ConcatDim(
          "time",
          dates,
    -     nitems_per_file=1,
      )
    
      pattern = FilePattern(
          make_url,
          time_concat_dim,
    +     file_type="netcdf3",
      )
    
      recipe = XarrayZarrRecipe(
          pattern,
    -     inputs_per_chunk=30,
    +     target_chunks={"time": 4},
      )
    
      setup_logging()
      recipe_pruned = recipe.copy_pruned()
      run_function = recipe_pruned.to_function()
      run_function()
    
  • And here's the copy-and-pastable code (without the diff) which works for me:

    import pandas as pd
    import xarray as xr
    from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
    from pangeo_forge_recipes.recipes import XarrayZarrRecipe, setup_logging
    
    def make_url(time):
        year=time.strftime('%Y')
        year_month_day = time.strftime('%Y%m%d')
        return(f'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/{year}/ZCYL5_{year_month_day}v30001.nc')
    
    dates = pd.date_range('2021-01-01','2021-01-03', freq='D')
    
    time_concat_dim = ConcatDim(
        "time",
        dates,
    )
    
    pattern = FilePattern(
        make_url,
        time_concat_dim,
        file_type="netcdf3",
    )
    
    recipe = XarrayZarrRecipe(
        pattern,
        target_chunks={"time": 4},
    )
    
    setup_logging()
    recipe_pruned = recipe.copy_pruned()
    run_function = recipe_pruned.to_function()
    run_function()
  • After running this code, I can then open the test dataset as follows:

    import xarray as xr
    ds = xr.open_zarr(recipe.target_mapper, consolidated=True)
    print(ds)
    <xarray.Dataset>
    Dimensions:      (time: 2879, h_num: 50)
    Coordinates:
      * time         (time) datetime64[ns] 2021-01-01 ... 2021-01-02T23:59:00
    Dimensions without coordinates: h_num
    Data variables: (12/38)
        CNDC         (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        DIR          (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        DIR2         (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        DIR3         (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        P            (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        P2           (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        ...           ...
        date         (time) int32 dask.array<chunksize=(4,), meta=np.ndarray>
        flag         (time) |S35 dask.array<chunksize=(4,), meta=np.ndarray>
        history      (h_num) |S236 dask.array<chunksize=(50,), meta=np.ndarray>
        lat          (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        lon          (time) float32 dask.array<chunksize=(4,), meta=np.ndarray>
        time_of_day  (time) int32 dask.array<chunksize=(4,), meta=np.ndarray>
    Attributes: (12/22)
        Cruise_id:                   Cruise_id undefined for now
        Data_modification_date:      01/12/2021 13:21:16 EST
        EXPOCODE:                    EXPOCODE undefined for now
        ID:                          ZCYL5
        IMO:                         007928677
        Metadata_modification_date:  01/12/2021 13:21:16 EST
        ...                          ...
        platform:                    unknown at this time
        platform_version:            unknown at this time
        receipt_order:               01
        site:                        FALKOR
        start_date_time:             2021/01/01 -- 00:00 UTC
        title:                       FALKOR Meteorological Data
    

@cisaacstern
Copy link
Member

Update: pydata/xarray#6476 was merged today, so following the next Xarray release, there should not be any other blockers for this recipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants