Update readMe (#178)

* Update how-to and diagrams * Fix codacy issues * Fix codacy issues * Add line about empty bucket first * Fix codacy issues * Fix codacy issues
BIBSYSDEV · Sep 30, 2024 · 4eb18f7 · 4eb18f7
1 parent b028a03
commit 4eb18f7
Show file tree

Hide file tree

Showing 4 changed files with 79 additions and 70 deletions.
diff --git a/README.md b/README.md
@@ -1,80 +1,22 @@
 # NVA Data Report API
 
-This repository contains the NVA data report API.
-
+This repository contains functions for generating csv reports of data from NVA.
 See [reportTypes](documentation/reportTypes.md) for a list of reports and data types.
 
-## How to run a bulk upload
-
-The steps below can be outlined briefly as:
-
-- Pre-run
-    - Stop incoming live-update events
-    - Delete data from previous runs
-    - Delete all data in database
-- Bulk upload
-    - Generate batches of document keys for upload
-    - Transform the data to a format compatible with the bulk-upload action
-    - Initiate bulk upload
-    - Verify data integrity
-- Post-run
-    - Start incoming live-update events
-
-### Pre-run steps
+## Architectural overview
 
-1. Remove all objects from S3 bucket `loader-input-files-{accountName}`
-2. Turn off S3 event notifications for bucket `persisted-resources-{accountName}`
-   In aws console, go
-   to
-   <br>_S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> _Amazon EventBridge_ ->
-   _Edit_ -> _Off_
-3. Press `ResetDatabaseButton` (Trigger `DatabaseResetHandler`). This might take around a minute to
-   complete.
-4. Verify that database is empty. You can use SageMaker notebook to query the database*. Example
-   sparql queries:
-   ```
-   SELECT (COUNT(DISTINCT ?g) as ?gCount) WHERE {GRAPH ?g {?s ?p ?o}}
-   ```
-   or
-   ```
-   SELECT ?g ?s ?p ?o WHERE {GRAPH ?g {?s ?p ?o}} LIMIT 100
-   ```
+![Architecture](documentation/images/data_export_overview.png)
 
-### Bulk upload steps
+## Integration overview
 
-1. Generate key batches for both locations: `resources` and `nvi-candidates`. Manually trigger
-   `GenerateKeyBatchesHandler` with the following input:
-   ```json
-   {
-      "detail": {
-         "location": "resources|nvi-candidates"
-      }
-   }
-   ```
-2. Verify that `GenerateKeyBatchesHandler` is done processing (i.e. check logs) and that key batches
-   have been generated S3 bucket `data-report-key-batches-{accountName}`
-3. Trigger `BulkTransformerHandler`
-4. Verify that `BulkTransformerHandler` is done processing (i.e. check logs) and that nquads
-   have been generated S3 bucket `loader-input-files-{accountName}`
-5. Trigger `BulkDataLoader`
-6. To check progress for bulk upload to Neptune. Trigger `BulkDataLoader` with the following input:
-    ```json
-    {
-     "loadId": "{copy loadId UUID from test log}"
-    }
-    ```
-7. Verify that expected count is in database. Query for counting distinct named graphs:
-   ```
-   SELECT (COUNT(DISTINCT ?g) as ?gCount) WHERE {GRAPH ?g {?s ?p ?o}}
-   ```
+The s3 bucket `data-report-csv-export-{accountName}` (defined in template) is
+set up as a data source in Databricks (in another AWS account) following
+databricks [guide _Create a storage credential for connecting to AWS S3_](https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html#create-a-storage-credential-for-connecting-to-aws-s3).
+This is how the data platform accesses files from
+`data-report-csv-export-{accountName}`:
 
-### Post-run steps
+![Databricks integration](documentation/images/data_report_aws_databricks_storage_credential.png)
 
-1. Turn on S3 event notifications for bucket `persisted-resources-{accountName}`.
-   In aws console, go
-   to
-   <br> _S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> _Amazon EventBridge_ ->
-   _Edit_ -> _On_
+## How-to guides
 
-*Note: You can use SageMaker notebook to query the database. Notebook can be opened from the AWS
-console through _SageMaker_ -> _Notebooks_ -> _Notebook instances_ -> _Open JupyterLab_
+- [Run bulk export](documentation/bulkExport.md)
diff --git a/documentation/bulkExport.md b/documentation/bulkExport.md
@@ -0,0 +1,67 @@
+# How to run a bulk export
+
+This process is intended as an "initial export"/database dump for the first
+export to the data platform. It can also be used if changes in the data model
+require a full re-import of the data.
+
+The steps below can be outlined briefly as:
+
+- Pre-run
+  - Stop incoming live-update events
+  - Delete data from previous runs
+- Bulk export
+  - Generate batches of document keys for export
+  - Transform the key batches to csv files
+- Post-run
+  - Start incoming live-update events
+
+## Pre-run steps
+
+1. Turn off S3 event notifications for bucket `persisted-resources-{accountName}`
+   In aws console, go
+   to
+
+    _S3_ -> _persisted-resources-{accountName}_ -> _Properties_ ->
+    _Amazon EventBridge_ -> _Edit_ -> _Off_
+2. Remove all objects from S3 bucket `data-report-csv-export-{accountName}`
+
+## Bulk export steps
+
+1. Generate key batches for both locations: `resources` and `nvi-candidates`.
+Manually trigger `GenerateKeyBatchesHandler` with the following input:
+
+   ```json
+   {
+      "detail": {
+         "location": "resources|nvi-candidates"
+      }
+   }
+   ```
+
+2. Verify that `GenerateKeyBatchesHandler` is done processing (i.e. check logs
+   and that key batches have been generated S3 bucket
+   `data-report-key-batches-{accountName}`
+3. Process the key batches and generate csv files for both locations: `resources`
+   and `nvi-candidates`.
+   Manually trigger `CsvBulkTransformerHandler` with the following input:
+
+   ```json
+   {
+      "detail": {
+         "location": "resources|nvi-candidates"
+      }
+   }
+   ```
+
+4. Verify that `CsvBulkTransformerHandler` is done processing (i.e. check logs)
+and that csv files have been generated S3 bucket
+`data-report-csv-export-{accountName}`
+
+## Post-run steps
+
+1. Turn on S3 event notifications for bucket `persisted-resources-{accountName}`.
+   In aws console, go
+   to
+
+   _S3_ -> _persisted-resources-{accountName}_ -> _Properties_ ->
+   _Amazon EventBridge_ -> _Edit_ -> _On_
diff --git a/documentation/images/data_export_overview.png b/documentation/images/data_export_overview.png
diff --git a/documentation/images/data_report_aws_databricks_storage_credential.png b/documentation/images/data_report_aws_databricks_storage_credential.png