Skip to content

cache specification

Sandy Antunes edited this page Jan 22, 2024 · 25 revisions

Use Cases

What is the default behavior? Intent? Difference between performance cacheing and 'save local'

User Case I: Refresh cache if server data has possibly changed (based off server 'last-modified')

User Case II: I need replicable data. Do not refresh cache unless I tell you.

User Case III: Poor connection. If no internet, use cache but warn

User Case IV or compromise: 'Aggression' setting?

User Case V: ???

Idea: Cache includes seperate 'Aggression' setting ranging from 'never' to 'only on update' to 'always refresh'

TODO

  1. Java CLI interface code (Nobes)
  2. Review doc (All)
  3. Unit tests (All)
  4. Extract existing Java code (Jeremy)

Recent notes

Usage

java -jar hapi-cache.jar --url "https://server/hapi/data?dataset=...parameters=...&start=...&stop=...&format={csv,bin}"

java -jar hapi-cache.jar \
     --server "https://server/hapi" --dataset=... --parameters=... --start=... --stop=... --format={csv,bin}

Response is csv or binary according to format. Default behavior when used as client is to use HTTP headers + existing cache to make decision as to how to return data (use cache or make new request). For server is to use file timestamps (or HTTP headers on back-end server if used in pass thru mode).

Other options:

--cache-dir DIR
--write-cache {T,F} (write cache if not there)
--read-cache {T,F} (use cache if there)
--expire-after N{y,d,h,m,s} (use this word? Don't use cache if written > N{y,d,h,m,s} ago - this is a feature of Python `requests_cache` lib; default is never)
--cache-exact (only cache exact request; will lead to less cache hits, but fast cache response if exact request made again)

Notes

Some code that implements this is located in https://github.com/hapi-server/cache-tools

Issues:

  • Should metadata (http headers) be cached as well?
  • Should the scientist be able to lock the cache so that updates will not occur?

The following is a description of a recommended directory and file schema for programs that cache HAPI data.

HAPI_DATA should be the environment variable indicating the HAPI cache directory. If not specified, the logic of Python's tempfile module should be used to get the system temporary directory to which hapi_data should be appended, e.g., /tmp/hapi_data will be a common default.

Data directory naming: If cadence is given

  • cadence < PT1S - files should contain 1 hour of data and be in subdirectory DATASET_ID/$Y/$m/$d/. File names should be $Y$m$dT$H.VARIABLE.EXT.
  • PT1S <= cadence <= PT1H - files should contain 1 day of data and be in subdirectory DATASET_ID/$Y/$m/. File names should be $Y$m$d.VARIABLE.EXT.
  • cadence > PT1H - files should contain 1 month of data of data and be in a subdirectory of DATASET_ID/$Y/. File names should be $Y$m.VARIABLE.EXT.

If cadence is not given, the caching software should (use the rule ... always do daily (Jeremy)? Or more well defined (Nobes)?) and choose the appropriate directory structure. Likewise, software using the cache should assume that other software may have different logic and should check all resolutions.

Files should contain only data for the parameter, e.g., 19991201.Time.csv will contain a single column with just the timestamps that are common to all parameters in the dataset. The file 19991201.Parameter1.csv would not contain timestamps. If a user requests Parameter1, a program reading the cache will need to read two files, the Time file and the Parameter1 file, to return the required data for Parameter1.

Directory structure for PT1S <= cadence <= PT1H:

hapi_data/
  # http://hapi-server.org/servers/SSCWeb/hapi
  http/
    hapi-server.org/
      servers/
        SSCWeb/
          hapi/
            capabilities.json
            capabilities.json.httpheaders
            catalog.json
            catalog.json.httpheaders
            data/
            info/
              
  # https://cdaweb.gsfc.nasa.gov/hapi
  https/
    cdaweb.gsfc.nasa.gov/
          hapi/
            capabilities.json
            capabilities.json.httpheaders
            catalog.json
            catalog.json.httpheaders
            data/
              A1_K0_MPA/2008/01/
                20080103.csv{.gz}         # All parameters   
                20080103.csv{.gz}.httpheaders  # All parameters   
                20080103.binary{.gz}      # All parameters       
                20080103.Time.csv{.gz}    # Single column 
                20080103.Time.binary{.gz}
                20080103.sc_pot.csv{.gz}  # Single column 
                20080103.sc_pot.binary{.gz} 
                ...
              AC_AT_DEF/2009/02/
              ...              
            info/
              A1_K0_MPA.json
              AC_AT_DEF.json
              ...

To be handled later

Thread safety. As we develop, continue to ask if this can be added later without complication.

Clone this wiki locally