Skip to content
This repository has been archived by the owner on Jun 9, 2022. It is now read-only.
/ data-extractor Public archive

Simple data extractor/shuffler

License

Notifications You must be signed in to change notification settings

hmcts/data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

002e0af · Jun 9, 2022

History

91 Commits
Feb 26, 2020
Aug 25, 2021
Jan 10, 2022
Nov 29, 2021
Nov 29, 2019
Jan 10, 2022
May 7, 2019
Nov 29, 2019
Jan 29, 2020
Mar 31, 2020
Apr 30, 2019
Jun 9, 2022
Jan 28, 2021
Jan 10, 2022
Jan 28, 2022
Nov 29, 2021
Nov 29, 2021
Oct 1, 2020

Repository files navigation

data-extractor

Archive notice

This repository is archive now, SDP ingestion has moved to Azure data factory https://github.com/hmcts/mi-data-ingestion

Simple data extractor/shuffler

Runs a sql query against a database and saves the results in an azure storage container.

It can export the results as one of:

  • csv
  • json
  • jsonlines

It obtains its configuration from the following environment variables:

  • ETL_DB_URL: jdbc connection url (e.g. "jdbc:postgresql://localhost:5432")
  • ETL_DB_USER_FILE: file containing the db username relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-user)
  • ETL_DB_PASSWORD_FILE: file containing the db password relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-pwd)
  • ETL_MSI_CLIENT_ID: pod identity client id to get credentials from keyvault and write access to blob storage.
  • ETL_CONNECTIONSTRING: This is an alternative authentication system in case manage identity is not setup in the cluster.
  • ETL_ACCOUNT: Azure storage account where output should be saved (e.g. "devstoreaccount1")
  • EXTRACTION_CASETYPES: Configuration list to extract each case type. This list can contain the following values:
    • CONTAINER: Azure storage container where output should be saved (e.g. "testcontainer")
    • TYPE: output file type. One of: jsonlines, csv, json (default "jsonlines")
    • PREFIX: prefix for the output file (e.g. "test01").
    • CASETYPE: CCD case type to extract

The 2 values: ETL_DB_USER_FILE and ETL_DB_PASSWORD_FILE are useful if the username and password are retrieved from Azure keyvault and exposed as flexvolumes. The same username and password can alternatively be passed as environment variables (ETL_DB_USER and ETL_DB_PASSWORD).

The output file obtained contains all the records generated by the query. File naming follows this convention: <prefix>-<datetime>.<type>

Helm chart

The easiest way to run a job is by using the included helm chart which is based on chart-job. This can be done running the following command: helm install hmcts/data-extractor-job --name data-extractor-job-001 --namespace mi -f job-values.yaml --wait where job-values.yaml is:

job:
  image: hmcts.azurecr.io/hmcts/data-extractor-job:prod-f888e665
  aadIdentityName: mi
  keyVaults:
    "data-extractor":
      resourceGroup: data-extractor 
      secrets:
        - ccdro-user
        - ccdro-password
        - appinsights-instrumentationkey
  labels:
    app.kubernetes.io/instance : data-extractor-job-001
    app.kubernetes.io/name: data-extractor-job
  environment:
    ETL_DB_URL: jdbc:postgresql://ccd-data-store-api-postgres-db-aat.postgres.database.azure.com:5432/ccd_data_store
    ETL_DB_USER_FILE: data-extractor/ccdro-user
    ETL_DB_PASSWORD_FILE: data-extractor/ccdro-password
    ETL_SQL: >
      SELECT id, created_date, event_id, summary, description, user_id, case_data_id,
      case_type_id, case_type_version, state_id, user_first_name, user_last_name,
      event_name, state_name, security_classification
      FROM case_event
      WHERE created_date >= (current_date-1 + time '00:00')
      AND created_date < (current_date + time '00:00')
      ORDER BY created_date ASC;
    ETL_ACCOUNT: midatastg
    ETL_MSI_CLIENT_ID: 1461ff03-675c-423c-95a4-fb50d31254ff
    EXTRACTION_CASETYPES_0__CONTAINER: "test-container"
    EXTRACTION_CASETYPES_0__CASETYPE: "Caveat"
    EXTRACTION_CASETYPES_0__TYPE: "jsonlines"
    EXTRACTION_CASETYPES_0__PREFIX: "CCD-TEST"
global:
  job:
    kind: Job
  subscriptionId: "1c4f0704-a29e-403d-b719-b90c34ef14c9"
  tenantId: "531ff96d-0ae9-462a-8d2d-bec7c0b42082"
  environment: aat

For an example of how to run this using flux please see: flux github repo

About

Simple data extractor/shuffler

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published