-
Notifications
You must be signed in to change notification settings - Fork 7
cache specification
What is the default behavior? Intent? Difference between performance cacheing and 'save local'
User Case I: Refresh cache if server data has possibly changed (based off server 'last-modified')
User Case II: I need replicable data. Do not refresh cache unless I tell you.
User Case III: Poor connection. If no internet, use cache but warn
User Case IV or compromise: 'Aggression' setting?
User Case V: ???
Idea: Cache includes seperate 'Aggression' setting ranging from 'never' to 'only on update' to 'always refresh'
- Java CLI interface code (Nobes)
- Review doc (All)
- Unit tests (All)
- Extract existing Java code (Jeremy)
- Paper that tests time series databases for IoT: https://arxiv.org/pdf/1901.08304.pdf
- Comment on caching by jeandet
- Possible locking mechanism: https://stackoverflow.com/questions/11787567/cache-locking-for-lots-of-processes
Usage
java -jar hapi-cache.jar --url "https://server/hapi/data?dataset=...parameters=...&start=...&stop=...&format={csv,bin}"
java -jar hapi-cache.jar \
--server "https://server/hapi" --dataset=... --parameters=... --start=... --stop=... --format={csv,bin}
Response is csv or binary according to format
. Default behavior when used as client is to use HTTP headers + existing cache to make decision as to how to return data (use cache or make new request). For server is to use file timestamps (or HTTP headers on back-end server if used in pass thru mode).
Other options:
--cache-dir DIR
--write-cache {T,F} (write cache if not there)
--read-cache {T,F} (use cache if there)
--expire-after N{y,d,h,m,s} (use this word? Don't use cache if written > N{y,d,h,m,s} ago - this is a feature of Python `requests_cache` lib; default is never)
--cache-exact (only cache exact request; will lead to less cache hits, but fast cache response if exact request made again)
Some code that implements this is located in https://github.com/hapi-server/cache-tools
Issues:
- Should metadata (http headers) be cached as well?
- Should the scientist be able to lock the cache so that updates will not occur?
The following is a description of a recommended directory and file schema for programs that cache HAPI data.
HAPI_DATA
should be the environment variable indicating the HAPI cache directory. If not specified, the logic of Python's tempfile
module should be used to get the system temporary directory to which hapi_data
should be appended, e.g., /tmp/hapi_data
will be a common default.
Data directory naming: If cadence
is given
-
cadence < PT1S
- files should contain 1 hour of data and be in subdirectoryDATASET_ID/$Y/$m/$d/
. File names should be$Y$m$dT$H.VARIABLE.EXT
. -
PT1S <= cadence <= PT1H
- files should contain 1 day of data and be in subdirectoryDATASET_ID/$Y/$m/
. File names should be$Y$m$d.VARIABLE.EXT
. -
cadence > PT1H
- files should contain 1 month of data of data and be in a subdirectory ofDATASET_ID/$Y/
. File names should be$Y$m.VARIABLE.EXT
.
If cadence
is not given, the caching software should (use the rule ... always do daily (Jeremy)? Or more well defined (Nobes)?) and choose the appropriate directory structure. Likewise, software using the cache should assume that other software may have different logic and should check all resolutions.
Files should contain only data for the parameter, e.g., 19991201.Time.csv
will contain a single column with just the timestamps that are common to all parameters in the dataset. The file 19991201.Parameter1.csv
would not contain timestamps. If a user requests Parameter1
, a program reading the cache will need to read two files, the Time
file and the Parameter1
file, to return the required data for Parameter1
.
Directory structure for PT1S <= cadence <= PT1H
:
hapi_data/
# http://hapi-server.org/servers/SSCWeb/hapi
http/
hapi-server.org/
servers/
SSCWeb/
hapi/
capabilities.json
capabilities.json.httpheaders
catalog.json
catalog.json.httpheaders
data/
info/
# https://cdaweb.gsfc.nasa.gov/hapi
https/
cdaweb.gsfc.nasa.gov/
hapi/
capabilities.json
capabilities.json.httpheaders
catalog.json
catalog.json.httpheaders
data/
A1_K0_MPA/2008/01/
20080103.csv{.gz} # All parameters
20080103.csv{.gz}.httpheaders # All parameters
20080103.binary{.gz} # All parameters
20080103.Time.csv{.gz} # Single column
20080103.Time.binary{.gz}
20080103.sc_pot.csv{.gz} # Single column
20080103.sc_pot.binary{.gz}
...
AC_AT_DEF/2009/02/
...
info/
A1_K0_MPA.json
AC_AT_DEF.json
...
Thread safety. As we develop, continue to ask if this can be added later without complication.