Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUMULUS-3368]: [User Contribution] Introducing allFilesPresent metadata field for Cumulus Collections. Fixes Issue #3193 #3492

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ Users/clients that do not make use of these endpoints will not be impacted.
- Added optional `maxDownloadTime` field to `provider` schema
- Added `max_download_time` column to PostgreSQL `providers` table
- Updated `@cumulus/ingest/lock` to check expired locks based on `provider.maxDownloadTime`
- **CUMULUS-3368**
- Added `allFilesPresent` field for `collection.meta` that defines whether or not all files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since allFilesPresent is used in our task, suggest to add it to https://github.com/nasa/cumulus/blob/master/packages/api/lib/schemas.js#L211

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add the following to collection and granule mapping:
https://github.com/nasa/cumulus/blob/master/packages/es-client/config/mappings.json
"meta": {
"type": "object",
"dynamic": false
},
There is a ticket CUMULUS-3417 about mapping issue, and it can happen for collection record as well

should be present in a granule before it can be discovered. If `allFilesPresent` is true,
the granules missing files will be removed. Otherwise, the default behavior is discover-granules
will ignore missing files in granules.

### Changed

Expand Down
4 changes: 4 additions & 0 deletions docs/workflow_tasks/discover_granules.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ A list of buckets with types that will be used to assign bucket targets based on

A Cumulus [collection](https://github.com/nasa/cumulus/blob/master/packages/api/lib/schemas.js) object. Used to define granule file groupings and granule metadata for discovered files. The collection object utilizes the collection type key to generate types in the output object on discovery.

##### Collection Meta

If the collection is configured with `collection.meta.allFilesPresent` set to `true`, the task will remove granules's missing files. Otherwise, the default behavior ignores missing files in granules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the collection is configured with `collection.meta.allFilesPresent` set to `true`, the task will remove granules's missing files. Otherwise, the default behavior ignores missing files in granules.
If the collection is configured with `collection.meta.allFilesPresent` set to `true`, the task will remove granules with all the files. Otherwise, by default, task continue to process granules without all the files.

remove granules's missing files is confusing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording above is still confusing. Do you mean something like the following?

If the collection is configured with collection.meta.allFilesPresent set to true, the task will include only granules where all files specified in the collection.files list are present, thus excluding granules where at least one file is absent. Otherwise, by default (i.e., collection.meta.allFilesPresent defaults to false), the task includes every granule, whether or not all of its files are present.


#### DuplicateGranuleHandling

A string configuration that configures the step to filter the granules discovered:
Expand Down
4 changes: 4 additions & 0 deletions packages/api/lib/schemas.js
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,10 @@ module.exports.collection = {
description: 'chunk size of the S3 multipart uploads for the collection',
type: 'number',
},
allFilesPresent: {
description: 'boolean flag used for DiscoverGranules task. If the collection is configured with `allFilesPresent` set to `true`, the task will remove granules\'s missing files. Otherwise, the default behavior ignores missing files in granules.',
type: 'boolean',
},
},
tags: {
title: 'Optional tags for search',
Expand Down
4 changes: 4 additions & 0 deletions packages/es-client/config/mappings.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@
"timestamp": {
"type": "date"
}
},
"meta": {
"type": "object",
"dynamic": false
}
},
"asyncOperation": {
Expand Down
103 changes: 83 additions & 20 deletions tasks/discover-granules/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,14 @@ const logger = (logOptions) => new Logger({
/**
* Fetch a list of files from the provider
*
* @param {Object} params
* @param {Object} params.providerConfig - the connection config for the provider
* @param {object} params
* @param {object} params.providerConfig - the connection config for the provider
* @param {bool} params.useList - flag to tell ftp server to use 'LIST'
* instead of 'STAT'
* @param {number} [params.httpRequestTimeout=300] - seconds for http provider
* to wait before timing out
* @param {string} params.path - the provider path to search
* @returns {Array<Object>} a list of discovered file objects
* @returns {Array<object>} a list of discovered file objects
*/
const listFiles = async (params) => {
const { providerConfig, useList, httpRequestTimeout = 300, path } = params;
Expand All @@ -56,7 +56,7 @@ const listFiles = async (params) => {
*
* @param {RegExp} granuleIdRegex - a regular expression where the first
* matching group is the granule id
* @param {Object} file - a file containing a `name` property
* @param {object} file - a file containing a `name` property
* @returns {string|null} returns the granule id, if one could be extracted,
* or null otherwise
*/
Expand All @@ -76,8 +76,8 @@ const granuleIdOfFile = curry(
*
* @param {RegExp} granuleIdRegex - a regular expression where the first
* matching group is the granule id
* @param {Array<Object>} files - a list of files containing a `name` property
* @returns {Object<Array>} the files, grouped by granule id
* @param {Array<object>} files - a list of files containing a `name` property
* @returns {object<Array>} the files, grouped by granule id
*/
const groupFilesByGranuleId = (granuleIdRegex, files) => {
const result = groupBy(files, granuleIdOfFile(granuleIdRegex));
Expand All @@ -88,10 +88,10 @@ const groupFilesByGranuleId = (granuleIdRegex, files) => {
/**
* Find the collection file config associated with the file
*
* @param {Array<Object>} collectionFileConfigs - a list of collection file
* @param {Array<object>} collectionFileConfigs - a list of collection file
* configs
* @param {Object} file - a file
* @returns {Object|undefined} returns the matching collection file config, or
* @param {object} file - a file
* @returns {object|undefined} returns the matching collection file config, or
* `undefined` if a matching config could not be found
*/
const getCollectionFileConfig = (collectionFileConfigs, file) =>
Expand All @@ -100,9 +100,9 @@ const getCollectionFileConfig = (collectionFileConfigs, file) =>
/**
* Check to see if a file has an associated collection file config
*
* @param {Array<Object>} collectionFileConfigs - a list of collection file
* @param {Array<object>} collectionFileConfigs - a list of collection file
* configs
* @param {Object} file - a file
* @param {object} file - a file
* @returns {boolean}
*/
const fileHasCollectionFileConfig = curry(
Expand Down Expand Up @@ -137,10 +137,10 @@ const returnAllFiles = (config) => {
* with the file. If one is found, add `bucket`, `url_path`, and `type`
* properties to the file.
*
* @param {Object} config - a config object containing `buckets` and
* @param {object} config - a config object containing `buckets` and
* `collection` properties
* @param {Object} file - a file object
* @returns {Object} a file object, possibly with three additional properties
* @param {object} file - a file object
* @returns {object} a file object, possibly with three additional properties
*/
const updateFileFromCollectionFileConfig = curry(
({ buckets, collection }, file) => {
Expand Down Expand Up @@ -225,6 +225,64 @@ const checkGranuleHasNoDuplicate = async (granuleId, duplicateHandling) => {
throw new Error(`Unexpected return from Private API lambda: ${JSON.stringify(response)}`);
};

/**
* Checks if all files from a collection are present for files from a granuleId
*
*
* @param {object} config - the event config
* @param {object} filesByGranuleId - Object with granuleId for keys with an array of
* matching files for each
* @param {string} granuleId - the Id of a granule
* @returns {boolean} - returns true if all file in collection config are present for granuleId
*/
const checkGranuleHasAllFiles = (config, filesByGranuleId, granuleId) => {
const granuleHasAllFiles = filesByGranuleId[granuleId].length >= config.collection.files.length;
Copy link
Contributor

@jennyhliu jennyhliu Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If collection.files[].regex matchs multiple granule files, how do we make sure each regex has at least one granule file? The assertion is weak.

Copy link
Contributor

@jennyhliu jennyhliu Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A granule can have 10 files but only 5 collection.files configuration.

return granuleHasAllFiles;
};

/**
* Filters out granules that are missing files from a collection
*
*
* @param {Object} params.filesByGranuleId - Object with granuleId for keys with an array of
* matching files for each
* @param {Object} params.config - the event config
* @returns {Array.string} - returns granuleIds that contain all files from a collection
*/
const filterGranulesWithoutAllFiles = ({ filesByGranuleId, config }) => {
const checkResults = Object.keys(filesByGranuleId).filter(
checkGranuleHasAllFiles.bind(this, config, filesByGranuleId)
);
return checkResults;
};

/**
* Handles granules without all files present according to the allFilesPresent boolean
*
* allFilesPresent = true : granules missing files will be removed
* allFilesPresent = false: ignore missing files in granules
*
* @param {object} params.filesByGranuleId - Object with granuleId for keys with an array of
* matching files for each
* @param {object} params.config - the event config
* @returns {object} returns filesByGranuleId with all the granules missing files removed
*/
const handleGranulesWithoutAllFilesPresent = ({ filesByGranuleId, config }) => {
let allFilesPresent = false;
if (config.collection.meta) {
allFilesPresent = config.collection.meta.allFilesPresent || false;
}
if (!allFilesPresent) {
return filesByGranuleId;
}
const filteredKeys = filterGranulesWithoutAllFiles({
granuleIds: Object.keys(filesByGranuleId),
filesByGranuleId,
config,
});
return pick(filesByGranuleId, filteredKeys);
};

/**
* Filters granule duplicates from a list of granuleIds according to the
* configuration in duplicateHandling:
Expand Down Expand Up @@ -258,16 +316,16 @@ const filterDuplicates = async ({ granuleIds, duplicateHandling, concurrency })
* error: Duplicates encountered will result in a thrown error
* replace, version: Duplicates will be ignored
*
* @param {Object} params - params object
* @param {Object} params.filesByGranuleId - Object with granuleId for keys with an array of
* @param {object} params - params object
* @param {object} params.filesByGranuleId - Object with granuleId for keys with an array of
* matching files for each
*
* @param {string} params.duplicateHandling - flag that defines this function's behavior
* (see description)
* @param {number} params.concurrency - granule duplicate filtering max concurrency
* (`skip` or `error` handling only)
*
* @returns {Object} returns filesByGranuleId with applicable duplciates removed
* @returns {object} returns filesByGranuleId with applicable duplciates removed
*/
const handleDuplicates = async ({ filesByGranuleId, duplicateHandling, concurrency }) => {
logger().info(`Running discoverGranules with duplicateHandling set to ${duplicateHandling}`);
Expand Down Expand Up @@ -314,6 +372,11 @@ const discoverGranules = async ({ config }) => {
concurrency: get(config, 'concurrency', 3),
});

filesByGranuleId = handleGranulesWithoutAllFilesPresent({
filesByGranuleId,
config,
});

const discoveredGranules = map(filesByGranuleId, buildGranule(config));

logger({ granules: discoveredGranules.map((g) => g.granuleId) }).info(`Discovered ${discoveredGranules.length} granules.`);
Expand All @@ -323,9 +386,9 @@ const discoverGranules = async ({ config }) => {
/**
* Lambda handler.
*
* @param {Object} event - a Cumulus Message
* @param {Object} context - an AWS Lambda context
* @returns {Promise<Object>} - Returns output from task.
* @param {object} event - a Cumulus Message
* @param {object} context - an AWS Lambda context
* @returns {Promise<object>} - Returns output from task.
* See schemas/output.json for detailed output schema
*/
const handler = async (event, context) => await runCumulusTask(discoverGranules, event, context);
Expand Down
124 changes: 122 additions & 2 deletions tasks/discover-granules/tests/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -190,8 +190,8 @@ const discoverGranulesUsingS3 = (configure, assert = assertDiscoveredGranules) =
// State sample files
const files = [
'granule-1.nc', 'granule-1.nc.md5',
'granule-2.nc', 'granule-2.nc.md5',
'granule-3.nc', 'granule-3.nc.md5',
'granule-2.nc', 'granule-2.nc.md5', 'granule-2.nc.cmr.json',
'granule-3.nc', 'granule-3.nc.md5', 'granule-3.nc.cmr.json', 'granule-3.nc.xml',
];

config.sourceBucketName = randomString();
Expand All @@ -216,6 +216,126 @@ const discoverGranulesUsingS3 = (configure, assert = assertDiscoveredGranules) =
}
};

test('discover granules with allFilesPresent set to true',
discoverGranulesUsingS3(({ context: { event: { config } } }) => {
config.provider = {
id: 'MODAPS',
protocol: 's3',
host: config.sourceBucketName,
};
// Adds cmr.json files to collection config
config.collection.granuleIdExtraction = '^(.*)\\.(nc|nc\\.md5|nc\\.cmr\\.json)$';
config.collection.meta = {
allFilesPresent: true,
};
config.collection.files = [
{
regex: '.*.nc.cmr.json$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.cmr.json',
bucket: 'protected',
type: 'metadata',
},
{
regex: '.*.nc$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
bucket: 'protected',
type: 'data',
},
{
regex: '.*.nc.md5$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.md5',
bucket: 'public',
type: 'metadata',
},
];
}, async (t, output) => {
await validateOutput(t, output);
t.is(output.granules.length, 2);
output.granules.forEach(({ files }) => t.is(files.length, 3));
}));

test('discover granules task discovers no granules when allFilesPresent is set to true with files missing',
discoverGranulesUsingS3(({ context: { event: { config } } }) => {
config.provider = {
id: 'MODAPS',
protocol: 's3',
host: config.sourceBucketName,
};
// Adds cmr.json files to collection config
config.collection.granuleIdExtraction = '^(.*)\\.(nc|nc\\.md5|nc\\.cmr\\.json|nc\\.png)$';
config.collection.meta = {
allFilesPresent: true,
};
config.collection.files = [
{
regex: '.*.nc.cmr.json$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.cmr.json',
bucket: 'protected',
type: 'metadata',
},
{
regex: '.*.nc$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
bucket: 'protected',
type: 'data',
},
{
regex: '.*.nc.md5$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.md5',
bucket: 'public',
type: 'metadata',
},
{
regex: '.*.nc.png$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.md5',
bucket: 'public',
type: 'metadata',
},
];
}, async (t, output) => {
await validateOutput(t, output);
t.is(output.granules.length, 0);
output.granules.forEach(({ files }) => t.is(files.length, 0));
}));

test('discover granules task discovers granules with allFilesPresent set to true with extra files for granule',
discoverGranulesUsingS3(({ context: { event: { config } } }) => {
config.provider = {
id: 'MODAPS',
protocol: 's3',
host: config.sourceBucketName,
};
// Adds cmr.json and .xml granuleId
config.collection.granuleIdExtraction = '^(.*)\\.(nc|nc\\.md5|nc\\.cmr\\.json|nc\\.xml)$';
config.collection.meta = {
allFilesPresent: true,
};
config.collection.files = [
{
regex: '.*.nc.cmr.json$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.cmr.json',
bucket: 'protected',
type: 'metadata',
},
{
regex: '.*.nc$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
bucket: 'protected',
type: 'data',
},
{
regex: '.*.nc.md5$',
sampleFileName: '20170603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.md5',
bucket: 'public',
type: 'metadata',
},
];
}, async (t, output) => {
await validateOutput(t, output);
t.is(output.granules.length, 2);
output.granules.forEach(({ files }) => t.is(files.length, 3));
}));

test('discover granules using S3',
discoverGranulesUsingS3(({ context: { event: { config } } }) => {
config.provider = {
Expand Down