Skip to content

Commit

Permalink
Develop (podaac#32)
Browse files Browse the repository at this point in the history
* Don't consume all arguments after --extensions

This behavior is now more like other utilities where specifying the flag
multiple times extends the value of the argument.  For example,
    -e '.nc .h5 .zip'
becomes
    -e '.nc' -e '.h5' -e '.zip'

This is less fragile for the user and possibly less confusing how the
argument should be formatted on the command line.

* Add ability to execute arbitrary commands on each downloaded file

I did it this way so each file could be compressed without hard-coding
the compression algorithm. But I could see this being used to run a
pre-processing script on each downloaded file.

* updated README and tests for additive -e examples

* force 'action'

* merged code for extensions, process call, and updated documentation

* fix for podaac#28

* updated CHANGELOG

Co-authored-by: Joe Sapp <joe.sapp@noaa.gov>
Co-authored-by: mgangl <mike.gangl@gmail.com>
  • Loading branch information
3 people authored Nov 15, 2021
1 parent f108dca commit d230a43
Show file tree
Hide file tree
Showing 5 changed files with 71 additions and 26 deletions.
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,17 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)

## [1.7.0]
### Added
- Added ability to call a process on downlaoded files. [Thank to Joe Sapp](https://github.com/sappjw).
### Changed
- Turned -e option into 'additive' mode (multiple -e options allowed.) [Thanks to Joe Sapp](https://github.com/sappjw)
### Deprecated
### Removed
### Fixed
- issue not being able to find granuleUR [#28](https://github.com/podaac/data-subscriber/issues/28)
### Security

## [1.6.1]
### Added
- added warning for more than 2k granules
Expand Down
48 changes: 27 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,36 +33,38 @@ you should now have access to the subscriber CLI:

```
$> podaac-data-subscriber -h
usage: podaac-data-subscriber [-h] -c COLLECTION -d OUTPUTDIRECTORY [-m MINUTES] [-b BBOX] [-e [EXTENSIONS [EXTENSIONS ...]]] [-ds DATASINCE] [--version] [--verbose]
usage: podaac_data_subscriber.py [-h] -c COLLECTION -d OUTPUTDIRECTORY [-sd STARTDATE] [-ed ENDDATE] [-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET] [-m MINUTES]
[-e EXTENSIONS] [--process PROCESS_CMD] [--version] [--verbose] [-p PROVIDER]
optional arguments:
-h, --help show this help message and exit
-c COLLECTION, --collection-shortname COLLECTION
The collection shortname for which you want to retrieve data.
-d OUTPUTDIRECTORY, --data-dir OUTPUTDIRECTORY
The directory where data products will be downloaded.
-sd STARTDATE, --start-date STARTDATE
The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z
-ed ENDDATE, --end-date ENDDATE
The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z
-b BBOX, --bounds BBOX
The bounding rectangle to filter result in. Format is W Longitude,S Latitude,E Longitude,N Latitude without spaces. Due to an issue with parsing
arguments, to use this command, please use the -b="-180,-90,180,90" syntax when calling from the command line. Default: "-180,-90,180,90".
-dc Flag to use cycle number for directory where data products will be downloaded.
-dydoy Flag to use start time (Year/DOY) of downloaded data for directory where data products will be downloaded.
-dymd Flag to use start time (Year/Month/Day) of downloaded data for directory where data products will be downloaded.
-dy Flag to use start time (Year) of downloaded data for directory where data products will be downloaded.
--offset OFFSET Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.
-m MINUTES, --minutes MINUTES
How far back in time, in minutes, should the script look for data. If running this script as a cron, this value should be equal to or greater than how often your
cron runs (default: 60 minutes).
-b BBOX, --bounds BBOX
The bounding rectangle to filter result in. Format is W Longitude,S Latitude,E Longitude,N Latitude without spaces. Due to an issue with parsing arguments, to use
this command, please use the -b="-180,-90,180,90" syntax when calling from the command line. Default: "-180,-90,180,90\.
-e [EXTENSIONS [EXTENSIONS ...]], --extensions [EXTENSIONS [EXTENSIONS ...]]
The extensions of products to download. Default is [.nc, .h5]
-sd STARTDATE, --start-date STARTDATE
The ISO date time before which data should be retrieved. For Example, --start-date 2021-01-14T00:00:00Z
-ed ENDDATE, --end-date ENDDATE
The ISO date time after which data should be retrieved. For Example, --end-date 2021-01-14T00:00:00Z
How far back in time, in minutes, should the script look for data. If running this script as a cron, this value should be equal to or greater than how
often your cron runs (default: 60 minutes).
-e EXTENSIONS, --extensions EXTENSIONS
The extensions of products to download. Default is [.nc, .h5, .zip]
--process PROCESS_CMD
Processing command to run on each downloaded file (e.g., compression). Can be specified multiple times.
--version Display script version information and exit.
--verbose Verbose mode.
-p PROVIDER, --provider PROVIDER
Specify a provider for collection search. Default is POCLOUD.
```

One can also call the python package directly:
Expand Down Expand Up @@ -95,7 +97,8 @@ For setting up your authentication, see the notes on the `netrc` file below.

Usage:
```
usage: podaac-data-subscriber [-h] -c COLLECTION -d OUTPUTDIRECTORY [-m MINUTES] [-b BBOX] [-e [EXTENSIONS [EXTENSIONS ...]]] [-ds DATASINCE] [--version]
usage: podaac_data_subscriber.py [-h] -c COLLECTION -d OUTPUTDIRECTORY [-sd STARTDATE] [-ed ENDDATE] [-b BBOX] [-dc] [-dydoy] [-dymd] [-dy] [--offset OFFSET]
[-m MINUTES] [-e EXTENSIONS] [--version] [--verbose] [-p PROVIDER]
```

To run the script, the following parameters are required:
Expand Down Expand Up @@ -206,7 +209,7 @@ The subscriber allows the placement of downloaded files into one of several dire
To automatically run and update a local file system with data files from a collection, one can use a syntax like the following:

```
10 * * * * podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d /path/to/data/VIIRS_N20-OSPO-L2P-v2.61 -e .nc .h5 -m 60 -b="-180,-90,180,90" --verbose >> ~/.subscriber.log
10 * * * * podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d /path/to/data/VIIRS_N20-OSPO-L2P-v2.61 -e .nc -e .h5 -m 60 -b="-180,-90,180,90" --verbose >> ~/.subscriber.log
```

Expand All @@ -232,17 +235,20 @@ podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -b="-180,-90,180,90

### Setting extensions

Some collections have many files. To download a specific set of files, you can set the extensions on which downloads are filtered. By default, ".nc" and ".h5" files are downloaded by default.
Some collections have many files. To download a specific set of files, you can set the extensions on which downloads are filtered. By default, ".nc", ".h5", and ".zip" files are downloaded by default.

```
-e [EXTENSIONS [EXTENSIONS ...]], --extensions [EXTENSIONS [EXTENSIONS ...]]
The extensions of products to download. Default is [.nc, .h5]
-e EXTENSIONS, --extensions EXTENSIONS
The extensions of products to download. Default is [.nc, .h5, .zip]
```

An example of the -e usage:
An example of the -e usage- note the -e option is additive:
```
podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e .nc .h5
podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e .nc -e .h5
```
### run a post download process

Using the `--process` option, you can run a simple command agaisnt the "just" downloaded file. This will take the format of "<command> <path/to/file>". This means you can run a command like `--process gzip` to gzip all downloaded files. We do not support more advanced processes at this time (piping, running a process on a directory, etc).


### Changing how far back the script looks for data
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
long_description = fh.read()

setup(name='podaac-data-subscriber',
version='1.6.1',
version='1.7.0',
description='PO.DAAC Data Susbcriber Command Line Tool',
url='https://github.com/podaac/data-subscriber',
long_description=long_description,
Expand Down
34 changes: 31 additions & 3 deletions subscriber/podaac_data_subscriber.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@
import os
from os import makedirs
from os.path import isdir, basename, join, splitext
import subprocess
from urllib.parse import urlencode
from urllib.request import urlopen, urlretrieve
from datetime import datetime, timedelta

__version__ = "1.6.1"
__version__ = "1.7.0"

LOGLEVEL = os.environ.get('SUBSCRIBER_LOGLEVEL', 'WARNING').upper()
logging.basicConfig(level=LOGLEVEL)
Expand Down Expand Up @@ -207,7 +208,9 @@ def create_parser():
parser.add_argument("--offset", dest="offset", help = "Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.") # noqa E501

parser.add_argument("-m", "--minutes", dest="minutes", help = "How far back in time, in minutes, should the script look for data. If running this script as a cron, this value should be equal to or greater than how often your cron runs (default: 60 minutes).", type=int, default=60) # noqa E501
parser.add_argument("-e", "--extensions", dest="extensions", help = "The extensions of products to download. Default is [.nc, .h5, .zip]", default=[".nc", ".h5", ".zip"], nargs='*') # noqa E501
parser.add_argument("-e", "--extensions", dest="extensions", help = "The extensions of products to download. Default is [.nc, .h5, .zip]", default=None, action='append') # noqa E501
parser.add_argument("--process", dest="process_cmd", help = "Processing command to run on each downloaded file (e.g., compression). Can be specified multiple times.", action='append')


parser.add_argument("--version", dest="version", action="store_true",help="Display script version information and exit.") # noqa E501
parser.add_argument("--verbose", dest="verbose", action="store_true",help="Verbose mode.") # noqa E501
Expand Down Expand Up @@ -244,6 +247,7 @@ def run():

short_name = args.collection
extensions = args.extensions
process_cmd = args.process_cmd

data_path = args.outputDirectory
# You should change `data_path` to a suitable download path on your file system.
Expand Down Expand Up @@ -357,6 +361,7 @@ def run():
'Specify an output directory or '
'choose another output directory flag other than -dc.') # noqa E501


timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")

# Neatly print the first granule record (if one was returned):
Expand Down Expand Up @@ -386,6 +391,8 @@ def run():


#filter list based on extension
if not extensions:
extensions = [".nc", ".h5", ".zip"]
filtered_downloads = []
for f in downloads:
for extension in extensions:
Expand Down Expand Up @@ -429,8 +436,18 @@ def prepare_time_output(times, prefix, file):
write_path
string path to where granules will be written
"""

time_match = [dt for dt in
times if dt[0] == splitext(basename(file))[0]][0][1]
times if dt[0] == splitext(basename(file))[0]]

# Found on 11/11/21
# https://github.com/podaac/data-subscriber/issues/28
# if we don't find the time match array, try again using the
# filename AND its suffix (above removes it...)
if len(time_match) == 0:
time_match = [dt for dt in
times if dt[0] == basename(file)]
time_match = time_match[0][1]

# offset timestamp for output paths
if args.offset:
Expand Down Expand Up @@ -480,6 +497,16 @@ def prepare_cycles_output(data_cycles, prefix, file):
write_path = join(prefix, cycle_dir, basename(file))
return write_path

def process_file(output_path):
if not process_cmd:
return
else:
for cmd in process_cmd:
if args.verbose:
print(f'Running: {cmd} {output_path}')
subprocess.run(cmd.split() + [output_path],
check=True)

for f in downloads:
try:
for extension in extensions:
Expand All @@ -495,6 +522,7 @@ def prepare_cycles_output(data_cycles, prefix, file):
output_path = prepare_cycles_output(
cycles, data_path, f)
urlretrieve(f, output_path)
process_file(output_path)
print(str(datetime.now()) + " SUCCESS: " + f)
success_cnt = success_cnt + 1
except Exception as e:
Expand Down
2 changes: 1 addition & 1 deletion tests/test_subscriber.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def test_validate():
a = validate(["-c", "viirs", "-d", "/data", "-b=-180,-90,180,90", "-m", "100"])
assert a.minutes == 100, "should equal 100"

a = validate(["-c", "viirs", "-d", "/data", "-b=-180,-90,180,90", "-e", ".txt", ".nc"])
a = validate(["-c", "viirs", "-d", "/data", "-b=-180,-90,180,90", "-e", ".txt", "-e", ".nc"])
assert ".txt" in a.extensions
assert ".nc" in a.extensions

Expand Down

0 comments on commit d230a43

Please sign in to comment.