Skip to content

Commit

Permalink
Extension regex (#121)
Browse files Browse the repository at this point in the history
* extend -e option to handle regular expressions (#115)

* Develop into Main (1.12.0) (#114)

* Issues/91 (#92)

* added citation creation tests and functionality to subscriber and downloader

* added verbose option to create_citation_file command, previously hard coded

* updated changelog (whoops) and fixed regression test:
1. Issue where the citation file now downloaded affected the counts
2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date.

* updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors

* changed a warning to debug for citation file. fixed test issues

* Enable debug logging during regression tests and set max parallel workflows to 2

* added output to pytest

* fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation.

* added mock testing and retry on 503

* added 503 fixes

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* fixed issues where token was not proagated to CMR queries (#95)

* Misc fixes (#101)

* added ".tiff" to default extensions to address #100

* removed 'warning' message on not downloading all data to close #99

* updated help documentation for start/end times to close #79

* added version update, updates to CHANGELOG

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* Update python-app.yml

* updated poetry version 

Version matches build/test versions.

* Issues/98 (#107)

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* added  EDL (not cmr-token) based get, list,delete, refresh token

* updated token regression tests

* updates and tests for subscriber moving to EDL.

* marked tests as regression test

* Update subscriber/podaac_data_downloader.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_data_subscriber.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Update subscriber/podaac_access.py

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* added exec info to errors, cleaned up some log statements

Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>

* Issues/109 (#111)

* Develop (#103)

* Issues/91 (#92)

* added citation creation tests and functionality to subscriber and downloader

* added verbose option to create_citation_file command, previously hard coded

* updated changelog (whoops) and fixed regression test:
1. Issue where the citation file now downloaded affected the counts
2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date.

* updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors

* changed a warning to debug for citation file. fixed test issues

* Enable debug logging during regression tests and set max parallel workflows to 2

* added output to pytest

* fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation.

* added mock testing and retry on 503

* added 503 fixes

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* fixed issues where token was not proagated to CMR queries (#95)

* Misc fixes (#101)

* added ".tiff" to default extensions to address #100

* removed 'warning' message on not downloading all data to close #99

* updated help documentation for start/end times to close #79

* added version update, updates to CHANGELOG

* added token get,delete, refresh and list operations

* Revert "added token get,delete, refresh and list operations"

This reverts commit 15aba90.

* Update python-app.yml

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>

* updated poetry version 

Version matches build/test versions.

* Update README.md

* Update podaac_data_downloader.py

Fixing for issues 109 - adding capability to download by granule-name

* Update Downloader.md

Fixed the help file

* added changelog entries, regressiont ests

* added poetry lock cleanup

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* added README information and updates (#113)

* fixed pymock issues... again

Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* extend -e option to handle regular expressions

formerly, -e could not handle PTM_\d+ extensions without the user explicitly
calling all of them.

---------

Co-authored-by: mike-gangl <59702631+mike-gangl@users.noreply.github.com>
Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>

* added dcoumentation and tests for regex

* converted defaults to regexes, added gtiff test

---------

Co-authored-by: Peter Mao <peter.mao@gmail.com>
Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov>
Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com>
Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com>
Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>
  • Loading branch information
6 people authored Feb 3, 2023
1 parent 35c8fc2 commit ad50178
Show file tree
Hide file tree
Showing 7 changed files with 50 additions and 12 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)

## Unreleased
### Added
- Added new feature allowing regex to be used in `--extension` `-e` options. For example using -e `PTM_\\d+` would match data files like `filename.PTM_1`, `filename.PTM_2` and `filename.PTM_10`, instead of specifying all possible combinations (``-e PTM_1, -e PTM_2, ..., -e PMT_10`)

## 1.12.0
### Fixed
- Added EDL based token downloading, removing CMR tokens [98](https://github.com/podaac/data-subscriber/issues/98),
Expand Down
14 changes: 12 additions & 2 deletions Downloader.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ optional arguments:
-dy Flag to use start time (Year) of downloaded data for directory where data products will be downloaded.
--offset OFFSET Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.
-e EXTENSIONS, --extensions EXTENSIONS
The extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz]
Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
-gr GRANULE, --granule-name GRANULE
The name of the granule to download. Only one granule name can be specified. Script will download all files matching similar granule name sans extension.
--process PROCESS_CMD
Expand Down Expand Up @@ -219,13 +219,23 @@ Some collections have many files. To download a specific set of files, you can s

```
-e EXTENSIONS, --extensions EXTENSIONS
The extensions of products to download. Default is [.nc, .h5, .zip]
Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
```

An example of the -e usage- note the -e option is additive:
```
podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e .nc -e .h5 -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z
```

One may also specify a regular expression to select files. For example, the following are equivalent:

`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_1, -e PTM_2, ..., -e PMT_10 -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`

and

`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_\\d+ -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`


### run a post download process

Using the `--process` option, you can run a simple command agaisnt the "just" downloaded file. This will take the format of "<command> <path/to/file>". This means you can run a command like `--process gzip` to gzip all downloaded files. We do not support more advanced processes at this time (piping, running a process on a directory, etc).
Expand Down
15 changes: 12 additions & 3 deletions Subscriber.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ optional arguments:
--offset OFFSET Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.
-m MINUTES, --minutes MINUTES
How far back in time, in minutes, should the script look for data. If running this script as a cron, this value should be equal to or greater than how often your cron runs (default: 60 minutes).
-e EXTENSIONS, --extensions EXTENSIONS
The extensions of products to download. Default is [.nc, .h5, .zip]
-e EXTENSIONS, --extensions EXTENSIONS
Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
--process PROCESS_CMD
Processing command to run on each downloaded file (e.g., compression). Can be specified multiple times.
--version Display script version information and exit.
Expand Down Expand Up @@ -193,13 +193,22 @@ Some collections have many files. To download a specific set of files, you can s

```
-e EXTENSIONS, --extensions EXTENSIONS
The extensions of products to download. Default is [.nc, .h5, .zip]
Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
```

An example of the -e usage- note the -e option is additive:
```
podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e .nc -e .h5
```

One may also specify a regular expression to select files. For example, the following are equivalent:

`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_1, -e PTM_2, ..., -e PMT_10 -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`

and

`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_\\d+ -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`

### run a post download process

Using the `--process` option, you can run a simple command agaisnt the "just" downloaded file. This will take the format of "<command> <path/to/file>". This means you can run a command like `--process gzip` to gzip all downloaded files. We do not support more advanced processes at this time (piping, running a process on a directory, etc).
Expand Down
8 changes: 7 additions & 1 deletion subscriber/podaac_access.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import logging
import netrc
import subprocess
import re
from datetime import datetime
from http.cookiejar import CookieJar
from os import makedirs
Expand All @@ -28,7 +29,7 @@
from datetime import datetime

__version__ = "1.12.0"
extensions = [".nc", ".h5", ".zip", ".tar.gz", ".tiff"]
extensions = ["\\.nc", "\\.h5", "\\.zip", "\\.tar.gz", "\\.tiff"]
edl = "urs.earthdata.nasa.gov"
cmr = "cmr.earthdata.nasa.gov"
token_url = "https://" + edl + "/api/users"
Expand Down Expand Up @@ -531,6 +532,11 @@ def create_citation(collection_json, access_date):
year = datetime.strptime(release_date, "%Y-%m-%dT%H:%M:%S.000Z").year
return citation_template.format(creator=creator, year=year, title=title, version=version, doi_authority=doi_authority, doi=doi, access_date=access_date)

def search_extension(extension, filename):
if re.search(extension + "$", filename) is not None:
return True
return False

def create_citation_file(short_name, provider, data_path, token=None, verbose=False):
# get collection umm-c METADATA
params = [
Expand Down
6 changes: 3 additions & 3 deletions subscriber/podaac_data_downloader.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/env python3
import argparse
import logging
import os
import os, re
import sys
from datetime import datetime, timedelta
from os import makedirs
Expand Down Expand Up @@ -86,7 +86,7 @@ def create_parser():
help="Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.") # noqa E501

parser.add_argument("-e", "--extensions", dest="extensions",
help="The extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz]",
help="Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]",
default=None, action='append') # noqa E501

# Get specific granule from the search
Expand Down Expand Up @@ -253,7 +253,7 @@ def run(args=None):
filtered_downloads = []
for f in downloads:
for extension in extensions:
if f.lower().endswith(extension):
if pa.search_extension(extension, f):
filtered_downloads.append(f)

downloads = filtered_downloads
Expand Down
6 changes: 3 additions & 3 deletions subscriber/podaac_data_subscriber.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# Accounts are free to create and take just a moment to set up.
import argparse
import logging
import os
import os, re
import sys
from datetime import datetime, timedelta
from os import makedirs
Expand Down Expand Up @@ -92,7 +92,7 @@ def create_parser():
help="How far back in time, in minutes, should the script look for data. If running this script as a cron, this value should be equal to or greater than how often your cron runs.",
type=int, default=None) # noqa E501
parser.add_argument("-e", "--extensions", dest="extensions",
help="The extensions of products to download. Default is [.nc, .h5, .zip]", default=None,
help="Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]", default=None,
action='append') # noqa E501
parser.add_argument("--process", dest="process_cmd",
help="Processing command to run on each downloaded file (e.g., compression). Can be specified multiple times.",
Expand Down Expand Up @@ -260,7 +260,7 @@ def run(args=None):
filtered_downloads = []
for f in downloads:
for extension in extensions:
if f.lower().endswith(extension):
if pa.search_extension(extension, f):
filtered_downloads.append(f)

downloads = filtered_downloads
Expand Down
9 changes: 9 additions & 0 deletions tests/test_subscriber.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,3 +206,12 @@ def validate(args):
args2 = parser.parse_args(args)
pa.validate(args2)
return args2

def test_extensions():
assert pa.search_extension('\\.tiff', "myfile.tiff") == True
assert pa.search_extension('\\.tiff', "myfile.tif") == False
assert pa.search_extension('\\.tiff', "myfile.gtiff") == False
assert pa.search_extension('PTM_\\d+', "myfile.PTM_1") == True
assert pa.search_extension('PTM_\\d+', "myfile.PTM_10") == True
assert pa.search_extension('PTM_\\d+', "myfile.PTM_09") == True
assert pa.search_extension('PTM_\\d+', "myfile.PTM_9") == True

0 comments on commit ad50178

Please sign in to comment.