Skip to content

qzhou-hmcts/data-extractor

 
 

Repository files navigation

data-extractor

Simple data extractor/shuffler

Runs a sql query against a database and saves the results in an azure storage container.

It can export the results as one of:

  • csv
  • json
  • jsonlines

It obtains its configuration from the following environment variables:

  • ETL_DB_URL: jdbc connection url (e.g. "jdbc:postgresql://localhost:5432")
  • ETL_DB_USER_FILE: file containing the db username relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-user)
  • ETL_DB_PASSWORD_FILE: file containing the db password relative to "/mnt/secrets" (e.g. data-extractor/aat-ccd-pwd)
  • ETL_SQL: sql statement to execute (e.g. "SELECT ID FROM parent WHERE ID = 1")
  • ETL_MSI_CLIENT_ID: pod identity client id to get credentials from keyvault and write access to blob storage.
  • ETL_ACCOUNT: Azure storage account where output should be saved (e.g. "devstoreaccount1")
  • ETL_CONTAINER: Azure storage container where output should be saved (e.g. "testcontainer")
  • ETL_FILE_TYPE: output file type. One of: jsonlines, csv, json (default "jsonlines")
  • ETL_FILE_PREFIX: prefix for the output file (e.g. "test01").

The 2 values: ETL_DB_USER_FILE and ETL_DB_PASSWORD_FILE are useful if the username and password are retrieved from Azure keyvault and exposed as flexvolumes. The same username and password can alternatively be passed as environment variables (ETL_DB_USER and ETL_DB_PASSWORD).

The output file obtained contains all the records generated by the query. File naming follows this convention: <prefix>-<datetime>.<type>

Helm chart

The easiest way to run a job is by using the included helm chart which is based on chart-job. This can be done running the following command: helm install hmcts/data-extractor-job --name data-extractor-job-001 --namespace mi -f job-values.yaml --wait where job-values.yaml is:

job:
  image: hmcts.azurecr.io/hmcts/data-extractor-job:prod-f888e665
  aadIdentityName: mi
  keyVaults:
    "data-extractor":
      resourceGroup: data-extractor
      secrets:
        - aat-ccdro-user
        - aat-ccdro-password
  labels:
    app.kubernetes.io/instance : data-extractor-job-001
    app.kubernetes.io/name: data-extractor-job
  environment:
    ETL_DB_URL: jdbc:postgresql://ccd-data-store-api-postgres-db-aat.postgres.database.azure.com:5432/ccd_data_store
    ETL_DB_USER_FILE: data-extractor/aat-ccdro-user
    ETL_DB_PASSWORD_FILE: data-extractor/aat-ccdro-password
    ETL_SQL: >
      SELECT id, created_date, event_id, summary, description, user_id, case_data_id,
      case_type_id, case_type_version, state_id, user_first_name, user_last_name,
      event_name, state_name, security_classification
      FROM case_event
      WHERE created_date >= (current_date-1 + time '00:00')
      AND created_date < (current_date + time '00:00')
      ORDER BY created_date ASC;
    ETL_ACCOUNT: midatastg
    ETL_CONTAINER: probate
    ETL_FILE_TYPE: jsonlines
    ETL_FILE_PREFIX: test01
    ETL_MSI_CLIENT_ID: 1461ff03-675c-423c-95a4-fb50d31254ff
global:
  job:
    kind: Job
  subscriptionId: "1c4f0704-a29e-403d-b719-b90c34ef14c9"
  tenantId: "531ff96d-0ae9-462a-8d2d-bec7c0b42082"
  environment: aat

For an example of how to run this using flux please see: flux github repo

About

Simple data extractor/shuffler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 98.8%
  • Other 1.2%