-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update how-to and diagrams * Fix codacy issues * Fix codacy issues * Add line about empty bucket first * Fix codacy issues * Fix codacy issues
- Loading branch information
Showing
4 changed files
with
79 additions
and
70 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,80 +1,22 @@ | ||
# NVA Data Report API | ||
|
||
This repository contains the NVA data report API. | ||
|
||
This repository contains functions for generating csv reports of data from NVA. | ||
See [reportTypes](documentation/reportTypes.md) for a list of reports and data types. | ||
|
||
## How to run a bulk upload | ||
|
||
The steps below can be outlined briefly as: | ||
|
||
- Pre-run | ||
- Stop incoming live-update events | ||
- Delete data from previous runs | ||
- Delete all data in database | ||
- Bulk upload | ||
- Generate batches of document keys for upload | ||
- Transform the data to a format compatible with the bulk-upload action | ||
- Initiate bulk upload | ||
- Verify data integrity | ||
- Post-run | ||
- Start incoming live-update events | ||
|
||
### Pre-run steps | ||
## Architectural overview | ||
|
||
1. Remove all objects from S3 bucket `loader-input-files-{accountName}` | ||
2. Turn off S3 event notifications for bucket `persisted-resources-{accountName}` | ||
In aws console, go | ||
to | ||
<br>_S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> _Amazon EventBridge_ -> | ||
_Edit_ -> _Off_ | ||
3. Press `ResetDatabaseButton` (Trigger `DatabaseResetHandler`). This might take around a minute to | ||
complete. | ||
4. Verify that database is empty. You can use SageMaker notebook to query the database*. Example | ||
sparql queries: | ||
``` | ||
SELECT (COUNT(DISTINCT ?g) as ?gCount) WHERE {GRAPH ?g {?s ?p ?o}} | ||
``` | ||
or | ||
``` | ||
SELECT ?g ?s ?p ?o WHERE {GRAPH ?g {?s ?p ?o}} LIMIT 100 | ||
``` | ||
![Architecture](documentation/images/data_export_overview.png) | ||
|
||
### Bulk upload steps | ||
## Integration overview | ||
|
||
1. Generate key batches for both locations: `resources` and `nvi-candidates`. Manually trigger | ||
`GenerateKeyBatchesHandler` with the following input: | ||
```json | ||
{ | ||
"detail": { | ||
"location": "resources|nvi-candidates" | ||
} | ||
} | ||
``` | ||
2. Verify that `GenerateKeyBatchesHandler` is done processing (i.e. check logs) and that key batches | ||
have been generated S3 bucket `data-report-key-batches-{accountName}` | ||
3. Trigger `BulkTransformerHandler` | ||
4. Verify that `BulkTransformerHandler` is done processing (i.e. check logs) and that nquads | ||
have been generated S3 bucket `loader-input-files-{accountName}` | ||
5. Trigger `BulkDataLoader` | ||
6. To check progress for bulk upload to Neptune. Trigger `BulkDataLoader` with the following input: | ||
```json | ||
{ | ||
"loadId": "{copy loadId UUID from test log}" | ||
} | ||
``` | ||
7. Verify that expected count is in database. Query for counting distinct named graphs: | ||
``` | ||
SELECT (COUNT(DISTINCT ?g) as ?gCount) WHERE {GRAPH ?g {?s ?p ?o}} | ||
``` | ||
The s3 bucket `data-report-csv-export-{accountName}` (defined in template) is | ||
set up as a data source in Databricks (in another AWS account) following | ||
databricks [guide _Create a storage credential for connecting to AWS S3_](https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html#create-a-storage-credential-for-connecting-to-aws-s3). | ||
This is how the data platform accesses files from | ||
`data-report-csv-export-{accountName}`: | ||
|
||
### Post-run steps | ||
![Databricks integration](documentation/images/data_report_aws_databricks_storage_credential.png) | ||
|
||
1. Turn on S3 event notifications for bucket `persisted-resources-{accountName}`. | ||
In aws console, go | ||
to | ||
<br> _S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> _Amazon EventBridge_ -> | ||
_Edit_ -> _On_ | ||
## How-to guides | ||
|
||
*Note: You can use SageMaker notebook to query the database. Notebook can be opened from the AWS | ||
console through _SageMaker_ -> _Notebooks_ -> _Notebook instances_ -> _Open JupyterLab_ | ||
- [Run bulk export](documentation/bulkExport.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# How to run a bulk export | ||
|
||
This process is intended as an "initial export"/database dump for the first | ||
export to the data platform. It can also be used if changes in the data model | ||
require a full re-import of the data. | ||
|
||
The steps below can be outlined briefly as: | ||
|
||
- Pre-run | ||
- Stop incoming live-update events | ||
- Delete data from previous runs | ||
- Bulk export | ||
- Generate batches of document keys for export | ||
- Transform the key batches to csv files | ||
- Post-run | ||
- Start incoming live-update events | ||
|
||
## Pre-run steps | ||
|
||
1. Turn off S3 event notifications for bucket `persisted-resources-{accountName}` | ||
In aws console, go | ||
to | ||
|
||
_S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> | ||
_Amazon EventBridge_ -> _Edit_ -> _Off_ | ||
2. Remove all objects from S3 bucket `data-report-csv-export-{accountName}` | ||
|
||
## Bulk export steps | ||
|
||
1. Generate key batches for both locations: `resources` and `nvi-candidates`. | ||
Manually trigger `GenerateKeyBatchesHandler` with the following input: | ||
|
||
```json | ||
{ | ||
"detail": { | ||
"location": "resources|nvi-candidates" | ||
} | ||
} | ||
``` | ||
|
||
2. Verify that `GenerateKeyBatchesHandler` is done processing (i.e. check logs | ||
and that key batches have been generated S3 bucket | ||
`data-report-key-batches-{accountName}` | ||
3. Process the key batches and generate csv files for both locations: `resources` | ||
and `nvi-candidates`. | ||
Manually trigger `CsvBulkTransformerHandler` with the following input: | ||
|
||
```json | ||
{ | ||
"detail": { | ||
"location": "resources|nvi-candidates" | ||
} | ||
} | ||
``` | ||
|
||
4. Verify that `CsvBulkTransformerHandler` is done processing (i.e. check logs) | ||
and that csv files have been generated S3 bucket | ||
`data-report-csv-export-{accountName}` | ||
|
||
## Post-run steps | ||
|
||
1. Turn on S3 event notifications for bucket `persisted-resources-{accountName}`. | ||
In aws console, go | ||
to | ||
|
||
_S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> | ||
_Amazon EventBridge_ -> _Edit_ -> _On_ |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+21.8 KB
documentation/images/data_report_aws_databricks_storage_credential.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.