Skip to content

Latest commit

 

History

History
160 lines (119 loc) · 9.33 KB

ingest_service.md

File metadata and controls

160 lines (119 loc) · 9.33 KB

Ingest Service

The scope of this document is capabilities needed for AR7 FastTrack timeline.

The Ingest Service allows authorized users in the federation to publish and manage metadata in the federation managed indices. The service is an authorized producer to the ESGF event stream service, and publishes an event to the ESGF event service.

When a request is received, the ingest service:

  • Authenticates the requestor
  • Checks if the user is authorized to perform the type of publication action being requested
  • Performs a well-defined set of ESGF-specific schema/CV validation on the request's contents
  • Publishes an event to the ESGF event stream service for any valid requests

Publication actions

The publication actions that need to be supported, along with the policy that should be enforced for such requests, is documented in Publication Actions

Ingest service API

The ingest Service must provide the STAC API's Transaction extension.

Publication actions (see above) are mapped to actions supported by the STAC Transaction extension. We need details here about how each of the publication actions is expressed as a STAC transaction. This looks like a non-trivial issue! (Especially the replica added/removed actions.)

Authentication

Clients must authenticate with the auth service used by the Ingest Service.

  • The west Ingest Service uses Globus.
  • The east Ingest Service uses EGI Check-In.

The result of successful authentication should be an OAuth 2.0 access token scoped for use with the Ingest Service. ESGF has no other requirements on this access token.

Authorization

Authorization policies for each publication action are defined by the ESGF Federation.

We need detail here about how these policies are defined and maintained. Is there a policy team or policy board? How are policy changes determined? How are policy changes communicated? Where is the source of truth for the current policy?

The Ingest Service must enforce the authorization policy for each publication action.

Each publication action MUST have a corresponding authorization policy, and that policy MAY be distinct from the policy for other actions.

The authorization policies MUST be implemented in the authorization services offered by the auth service used by the Ingest Service. (Globus offers a group service. EGI Check-In offers an attribute service.)

Authorization policies may utilize the following information from the target of the request (the part of the STAC catalog being modified):

  • A specific Collection ID (aka ESGF Project)
  • A specific ESGF Data Node ID identified in a replica add or remove action We need details here about how this is expressed in the STAC Transaction API request. See corresponding note in the "Ingest service API" section.
  • Is anything else about the request itself (not who is making the request) used as a basis for authorization decisions? Seems likely that publish/retract policies will need more than just the collection ID?

Authorization policies may utilize the following elements of the auth context of the STAC Transaction API request:

  • User identity (an OAuth sub or preferred_username value)
  • Specific attribute IDs provided by EGI Check-In (attributes possessed by the requester)
  • Specific group IDs provided by Globus (groups of which the requester is a member)

Request validation

Need details here about the specific schema and/or CV validation that must be performed. These details need to include what sources of truth we have for schemas and CVs, how they are communicated to the operations team, and how the operations team configures the ingest service to enforce them.

Note: This isn't just about publish actions! Replica added and replica removed actions also likely need some validation to make sure requests reference known data node IDs.

Payload format

This section describes the payload of the event published to the stream.

Initial work on a PoC for a kafka based event system has been defined here. Please see the kafka branch and the producer.py file for more on events. In my example, there is a simple AWS MSK Kafka cluster running with one broker with events streaming to S3 (json) via an AWS Firehose.

Steps to reproduce the event system:

  1. Create a serverless AWS MSK cluster a. Create an S3 bucket and connect it the cluster via Firehose
  2. Make sure a client machine is also running
  3. Clone the kafka branch from the above event-stream-service repo. Make sure to change the broker url to your cluster.
  4. Grab some CMIP6 data and put it somewhere the script can read it and run generate_events.py
    python generate_events.py --path \
    CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1
    
    a. If you’re running a consumer, you will also see the events stream to stdout.

Here are two sample payloads:

Publishing an event to stream

  • Will use the native authentication for the event stream service that will be used. This does not preclude adding a layer in front of the service with different security/protocol in the future

CEDA publishing

  • Our current STAC is generated using the STAC Generator
    • It has a range of plugins to generate the records
    • For each message
      • The message’s path is used to select the relevant “Recipe”
      • The recipe defines a list of extraction methods that extract/augment metadata
      • The extraction methods are run in sequence
      • The produced record is passed to any output plugins
  • There is an example of its use in the repo here.

STAC Generator configuration

Example text file contents:

{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/clt/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/hurs/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/huss/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/pr/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/prsn/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/psl/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rlds/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rlus/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rsds/gn/v20191108"}
{"uri": "/badc/cmip6/data/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/Amon/rsus/gn/v20191108"}
…

Notes

  • I thought we would have a universal payload (not STAC/Globus specific)
  • The consumers would then read the message and map this into the desired format
  • For CEDA I imagined this is where the STAC Generator would fit

Use cases/ideas for timelines beyond AR7 Fasttrack

Open questions