Extension regex (#121)

* extend -e option to handle regular expressions (#115) * Develop into Main (1.12.0) (#114) * Issues/91 (#92) * added citation creation tests and functionality to subscriber and downloader * added verbose option to create_citation_file command, previously hard coded * updated changelog (whoops) and fixed regression test: 1. Issue where the citation file now downloaded affected the counts 2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date. * updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors * changed a warning to debug for citation file. fixed test issues * Enable debug logging during regression tests and set max parallel workflows to 2 * added output to pytest * fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation. * added mock testing and retry on 503 * added 503 fixes Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov> * fixed issues where token was not proagated to CMR queries (#95) * Misc fixes (#101) * added ".tiff" to default extensions to address #100 * removed 'warning' message on not downloading all data to close #99 * updated help documentation for start/end times to close #79 * added version update, updates to CHANGELOG * added token get,delete, refresh and list operations * Revert "added token get,delete, refresh and list operations" This reverts commit 15aba90. * Update python-app.yml * updated poetry version Version matches build/test versions. * Issues/98 (#107) * added token get,delete, refresh and list operations * Revert "added token get,delete, refresh and list operations" This reverts commit 15aba90. * added EDL (not cmr-token) based get, list,delete, refresh token * updated token regression tests * updates and tests for subscriber moving to EDL. * marked tests as regression test * Update subscriber/podaac_data_downloader.py Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> * Update subscriber/podaac_data_subscriber.py Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> * Update subscriber/podaac_access.py Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> * Update subscriber/podaac_access.py Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> * Update subscriber/podaac_access.py Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> * added exec info to errors, cleaned up some log statements Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> * Issues/109 (#111) * Develop (#103) * Issues/91 (#92) * added citation creation tests and functionality to subscriber and downloader * added verbose option to create_citation_file command, previously hard coded * updated changelog (whoops) and fixed regression test: 1. Issue where the citation file now downloaded affected the counts 2. Issue where the logic for determining if a file modified time was changing or not was picking up the new citation file which _always_ gets rewritten to update the 'last accessed' date. * updated request to include exec_info in warning; fixed issue with params not being a dictionary caused errors * changed a warning to debug for citation file. fixed test issues * Enable debug logging during regression tests and set max parallel workflows to 2 * added output to pytest * fixed test to only look for downlaoded data files not citation file due to 'random' cmr errors when creating a citation. * added mock testing and retry on 503 * added 503 fixes Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov> * fixed issues where token was not proagated to CMR queries (#95) * Misc fixes (#101) * added ".tiff" to default extensions to address #100 * removed 'warning' message on not downloading all data to close #99 * updated help documentation for start/end times to close #79 * added version update, updates to CHANGELOG * added token get,delete, refresh and list operations * Revert "added token get,delete, refresh and list operations" This reverts commit 15aba90. * Update python-app.yml Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov> * updated poetry version Version matches build/test versions. * Update README.md * Update podaac_data_downloader.py Fixing for issues 109 - adding capability to download by granule-name * Update Downloader.md Fixed the help file * added changelog entries, regressiont ests * added poetry lock cleanup Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov> Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com> Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov> * added README information and updates (#113) * fixed pymock issues... again Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov> Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com> Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov> * extend -e option to handle regular expressions formerly, -e could not handle PTM_\d+ extensions without the user explicitly calling all of them. --------- Co-authored-by: mike-gangl <59702631+mike-gangl@users.noreply.github.com> Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov> Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com> Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov> * added dcoumentation and tests for regex * converted defaults to regexes, added gtiff test --------- Co-authored-by: Peter Mao <peter.mao@gmail.com> Co-authored-by: Frank Greguska <Francis.Greguska@jpl.nasa.gov> Co-authored-by: Frank Greguska <89428916+frankinspace@users.noreply.github.com> Co-authored-by: sureshshsv <45676320+sureshshsv@users.noreply.github.com> Co-authored-by: sureshshsv <suresh.vannan@jpl.nasa.gov>
podaac · Feb 3, 2023 · ad50178 · ad50178
1 parent 35c8fc2
commit ad50178
Show file tree

Hide file tree

Showing 7 changed files with 50 additions and 12 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,10 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 
+## Unreleased
+### Added
+- Added new feature allowing regex to be used in `--extension` `-e` options. For example using -e `PTM_\\d+` would match data files like `filename.PTM_1`, `filename.PTM_2` and `filename.PTM_10`, instead of specifying all possible combinations (``-e PTM_1, -e PTM_2, ...,  -e PMT_10`)
+
 ## 1.12.0
 ### Fixed
 - Added EDL based token downloading, removing CMR tokens [98](https://github.com/podaac/data-subscriber/issues/98),

diff --git a/Downloader.md b/Downloader.md
@@ -33,7 +33,7 @@ optional arguments:
   -dy                   Flag to use start time (Year) of downloaded data for directory where data products will be downloaded.
   --offset OFFSET       Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.
   -e EXTENSIONS, --extensions EXTENSIONS
-                        The extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz]
+                      Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
   -gr GRANULE, --granule-name GRANULE
   						The name of the granule to download. Only one granule name can be specified. Script will download all files matching similar granule name sans extension.
   --process PROCESS_CMD
@@ -219,13 +219,23 @@ Some collections have many files. To download a specific set of files, you can s
 
 ```
 -e EXTENSIONS, --extensions EXTENSIONS
-                       The extensions of products to download. Default is [.nc, .h5, .zip]
+                      Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
 ```
 
 An example of the -e usage- note the -e option is additive:
 ```
 podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e .nc -e .h5 -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z
 ```
+
+One may also specify a regular expression to select files. For example, the following are equivalent:
+
+`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_1, -e PTM_2, ...,  -e PMT_10 -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`
+
+and
+
+`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_\\d+ -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`
+
+
 ### run a post download process
 
 Using the `--process` option, you can run a simple command agaisnt the "just" downloaded file. This will take the format of "<command> <path/to/file>". This means you can run a command like `--process gzip` to gzip all downloaded files. We do not support more advanced processes at this time (piping, running a process on a directory, etc).

diff --git a/Subscriber.md b/Subscriber.md
@@ -28,8 +28,8 @@ optional arguments:
   --offset OFFSET       Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.
   -m MINUTES, --minutes MINUTES
                         How far back in time, in minutes, should the script look for data. If running this script as a cron, this value should be equal to or greater than how often your cron runs (default: 60 minutes).
-  -e EXTENSIONS, --extensions EXTENSIONS
-                        The extensions of products to download. Default is [.nc, .h5, .zip]
+-e EXTENSIONS, --extensions EXTENSIONS
+                        Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
   --process PROCESS_CMD
                         Processing command to run on each downloaded file (e.g., compression). Can be specified multiple times.
   --version             Display script version information and exit.
@@ -193,13 +193,22 @@ Some collections have many files. To download a specific set of files, you can s
 
 ```
 -e EXTENSIONS, --extensions EXTENSIONS
-                       The extensions of products to download. Default is [.nc, .h5, .zip]
+                      Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]
 ```
 
 An example of the -e usage- note the -e option is additive:
 ```
 podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e .nc -e .h5
 ```
+
+One may also specify a regular expression to select files. For example, the following are equivalent:
+
+`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_1, -e PTM_2, ...,  -e PMT_10 -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`
+
+and
+
+`podaac-data-subscriber -c VIIRS_N20-OSPO-L2P-v2.61 -d ./data -e PTM_\\d+ -sd 2020-06-01T00:46:02Z -ed 2020-07-01T00:46:02Z`
+
 ### run a post download process
 
 Using the `--process` option, you can run a simple command agaisnt the "just" downloaded file. This will take the format of "<command> <path/to/file>". This means you can run a command like `--process gzip` to gzip all downloaded files. We do not support more advanced processes at this time (piping, running a process on a directory, etc).

diff --git a/subscriber/podaac_access.py b/subscriber/podaac_access.py
@@ -2,6 +2,7 @@
 import logging
 import netrc
 import subprocess
+import re
 from datetime import datetime
 from http.cookiejar import CookieJar
 from os import makedirs
@@ -28,7 +29,7 @@
 from datetime import datetime
 
 __version__ = "1.12.0"
-extensions = [".nc", ".h5", ".zip", ".tar.gz", ".tiff"]
+extensions = ["\\.nc", "\\.h5", "\\.zip", "\\.tar.gz", "\\.tiff"]
 edl = "urs.earthdata.nasa.gov"
 cmr = "cmr.earthdata.nasa.gov"
 token_url = "https://" + edl + "/api/users"
@@ -531,6 +532,11 @@ def create_citation(collection_json, access_date):
     year = datetime.strptime(release_date, "%Y-%m-%dT%H:%M:%S.000Z").year
     return citation_template.format(creator=creator, year=year, title=title, version=version, doi_authority=doi_authority, doi=doi, access_date=access_date)
 
+def search_extension(extension, filename):
+    if re.search(extension + "$", filename) is not None:
+        return True
+    return False
+
 def create_citation_file(short_name, provider, data_path, token=None, verbose=False):
     # get collection umm-c METADATA
     params = [

diff --git a/subscriber/podaac_data_downloader.py b/subscriber/podaac_data_downloader.py
@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
 import argparse
 import logging
-import os
+import os, re
 import sys
 from datetime import datetime, timedelta
 from os import makedirs
@@ -86,7 +86,7 @@ def create_parser():
                         help="Flag used to shift timestamp. Units are in hours, e.g. 10 or -10.")  # noqa E501
 
     parser.add_argument("-e", "--extensions", dest="extensions",
-                        help="The extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz]",
+                        help="Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]",
                         default=None, action='append')  # noqa E501
 
    # Get specific granule from the search
@@ -253,7 +253,7 @@ def run(args=None):
     filtered_downloads = []
     for f in downloads:
         for extension in extensions:
-            if f.lower().endswith(extension):
+            if pa.search_extension(extension, f):
                 filtered_downloads.append(f)
 
     downloads = filtered_downloads

diff --git a/subscriber/podaac_data_subscriber.py b/subscriber/podaac_data_subscriber.py
@@ -14,7 +14,7 @@
 # Accounts are free to create and take just a moment to set up.
 import argparse
 import logging
-import os
+import os, re
 import sys
 from datetime import datetime, timedelta
 from os import makedirs
@@ -92,7 +92,7 @@ def create_parser():
                         help="How far back in time, in minutes, should the script look for data. If running this script as a cron, this value should be equal to or greater than how often your cron runs.",
                         type=int, default=None)  # noqa E501
     parser.add_argument("-e", "--extensions", dest="extensions",
-                        help="The extensions of products to download. Default is [.nc, .h5, .zip]", default=None,
+                        help="Regexps of extensions of products to download. Default is [.nc, .h5, .zip, .tar.gz, .tiff]", default=None,
                         action='append')  # noqa E501
     parser.add_argument("--process", dest="process_cmd",
                         help="Processing command to run on each downloaded file (e.g., compression). Can be specified multiple times.",
@@ -260,7 +260,7 @@ def run(args=None):
     filtered_downloads = []
     for f in downloads:
         for extension in extensions:
-            if f.lower().endswith(extension):
+            if pa.search_extension(extension, f):
                 filtered_downloads.append(f)
 
     downloads = filtered_downloads

diff --git a/tests/test_subscriber.py b/tests/test_subscriber.py
@@ -206,3 +206,12 @@ def validate(args):
     args2 = parser.parse_args(args)
     pa.validate(args2)
     return args2
+
+def test_extensions():
+    assert pa.search_extension('\\.tiff', "myfile.tiff") == True
+    assert pa.search_extension('\\.tiff', "myfile.tif") == False
+    assert pa.search_extension('\\.tiff', "myfile.gtiff") == False
+    assert pa.search_extension('PTM_\\d+', "myfile.PTM_1") == True
+    assert pa.search_extension('PTM_\\d+', "myfile.PTM_10") == True
+    assert pa.search_extension('PTM_\\d+', "myfile.PTM_09") == True
+    assert pa.search_extension('PTM_\\d+', "myfile.PTM_9") == True