Ingest Service

The scope of this document is capabilities needed for AR7 FastTrack timeline.

The Ingest Service allows authorized users in the federation to publish and manage metadata in the federation managed indices. The service is an authorized producer to the ESGF event stream service, and publishes an event to the ESGF event service.

When a request is received, the ingest service:

Authenticates the requestor
Checks if the user is authorized to perform the type of publication action being requested
Performs a well-defined set of ESGF-specific schema/CV validation on the request's contents
Publishes an event to the ESGF event stream service for any valid requests

Publication actions

The publication actions that need to be supported, along with the policy that should be enforced for such requests, is documented in Publication Actions

Ingest service API

The ingest Service must provide the STAC API's Transaction extension.

Publication actions (see above) are mapped to actions supported by the STAC Transaction extension. We need details here about how each of the publication actions is expressed as a STAC transaction. This looks like a non-trivial issue! (Especially the replica added/removed actions.)

Authentication

Clients must authenticate with the auth service used by the Ingest Service.

The west Ingest Service uses Globus.
The east Ingest Service uses EGI Check-In.

The result of successful authentication should be an OAuth 2.0 access token scoped for use with the Ingest Service. ESGF has no other requirements on this access token.

Authorization

Authorization policies for each publication action are defined by the ESGF Federation.

We need detail here about how these policies are defined and maintained. Is there a policy team or policy board? How are policy changes determined? How are policy changes communicated? Where is the source of truth for the current policy?

The Ingest Service must enforce the authorization policy for each publication action.

Each publication action MUST have a corresponding authorization policy, and that policy MAY be distinct from the policy for other actions.

The authorization policies MUST be implemented in the authorization services offered by the auth service used by the Ingest Service. (Globus offers a group service. EGI Check-In offers an attribute service.)

Authorization policies may utilize the following information from the target of the request (the part of the STAC catalog being modified):

A specific Collection ID (aka ESGF Project)
A specific ESGF Data Node ID identified in a replica add or remove action We need details here about how this is expressed in the STAC Transaction API request. See corresponding note in the "Ingest service API" section.
Is anything else about the request itself (not who is making the request) used as a basis for authorization decisions? Seems likely that publish/retract policies will need more than just the collection ID?

Authorization policies may utilize the following elements of the auth context of the STAC Transaction API request:

User identity (an OAuth sub or preferred_username value)
Specific attribute IDs provided by EGI Check-In (attributes possessed by the requester)
Specific group IDs provided by Globus (groups of which the requester is a member)

Request validation

Need details here about the specific schema and/or CV validation that must be performed. These details need to include what sources of truth we have for schemas and CVs, how they are communicated to the operations team, and how the operations team configures the ingest service to enforce them.

Note: This isn't just about publish actions! Replica added and replica removed actions also likely need some validation to make sure requests reference known data node IDs.

Payload format

This section describes the payload of the event published to the stream.

Initial work on a PoC for a kafka based event system has been defined here. Please see the kafka branch and the producer.py file for more on events. In my example, there is a simple AWS MSK Kafka cluster running with one broker with events streaming to S3 (json) via an AWS Firehose.

Steps to reproduce the event system:

Create a serverless AWS MSK cluster a. Create an S3 bucket and connect it the cluster via Firehose
Make sure a client machine is also running
Clone the kafka branch from the above event-stream-service repo. Make sure to change the broker url to your cluster.
Grab some CMIP6 data and put it somewhere the script can read it and run generate_events.py
```
python generate_events.py --path \
CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1
```
a. If you’re running a consumer, you will also see the events stream to stdout.

Here are two sample payloads:

~~https://github.com/esgf2-us/event-stream-service/blob/main/event-producer/globus-publish-payload.json~~
https://github.com/esgf2-us/event-stream-service/blob/main/event-producer/stac-publish-payload.json
Event type field
- Should be set to one of the type of actions defined in Publication Actions
Identity of the user publishing (janedoe@insitution.org)
Authorization server that asserted the identity (Globus, EGI)
Timestamp
Metadata as needed for supporting STAC
Host urls for delivery types
- HTTP/S
- OpenDAP
- Globus Collection
Project-specific, examples:
- WCRP projects
  - CMIP6Plus
  - CMIP7
  - CORDEX
- E3SM
CV version used to validate the record
- This will be used for a downstream check

Publishing an event to stream

Will use the native authentication for the event stream service that will be used. This does not preclude adding a layer in front of the service with different security/protocol in the future

CEDA publishing

Our current STAC is generated using the STAC Generator
- It has a range of plugins to generate the records
  - Inputs
    - Define where to collect messages from
  - Outputs
    - Define where to send the generated records
    - We use the STAC FastAPI output plugin which expects STAC formatted records
      - STAC Item Transaction extension
        
        Example item payload (an item record)
      - STAC Collection Transaction extension
        
        Example collection payload (a collection record)
- For each message
  - The message’s path is used to select the relevant “Recipe”
  - The recipe defines a list of extraction methods that extract/augment metadata
  - The extraction methods are run in sequence
  - The produced record is passed to any output plugins
There is an example of its use in the repo here.

STAC Generator configuration

This cmip6 recipe
This STAC FastAPI output plugin
This text file input plugin

Example text file contents:

{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/clt/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/hurs/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/huss/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/pr/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/prsn/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/psl/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rlds/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rlus/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rsds/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rsus/gn/v20191108"}
…

Notes

I thought we would have a universal payload (not STAC/Globus specific)
The consumers would then read the message and map this into the desired format
For CEDA I imagined this is where the STAC Generator would fit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest_service.md

ingest_service.md

Ingest Service

Publication actions

Ingest service API

Authentication

Authorization

Request validation

Payload format

Publishing an event to stream

CEDA publishing

STAC Generator configuration

Notes

Use cases/ideas for timelines beyond AR7 Fasttrack

Open questions

Files

ingest_service.md

Latest commit

History

ingest_service.md

File metadata and controls

Ingest Service

Publication actions

Ingest service API

Authentication

Authorization

Request validation

Payload format

Publishing an event to stream

CEDA publishing

STAC Generator configuration

Notes

Use cases/ideas for timelines beyond AR7 Fasttrack

Open questions