Skip to content

Commit

Permalink
Update readMe (#178)
Browse files Browse the repository at this point in the history
* Update how-to and diagrams

* Fix codacy issues

* Fix codacy issues

* Add line about empty bucket first

* Fix codacy issues

* Fix codacy issues
  • Loading branch information
monaullah authored Sep 30, 2024
1 parent b028a03 commit 4eb18f7
Show file tree
Hide file tree
Showing 4 changed files with 79 additions and 70 deletions.
82 changes: 12 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,80 +1,22 @@
# NVA Data Report API

This repository contains the NVA data report API.

This repository contains functions for generating csv reports of data from NVA.
See [reportTypes](documentation/reportTypes.md) for a list of reports and data types.

## How to run a bulk upload

The steps below can be outlined briefly as:

- Pre-run
- Stop incoming live-update events
- Delete data from previous runs
- Delete all data in database
- Bulk upload
- Generate batches of document keys for upload
- Transform the data to a format compatible with the bulk-upload action
- Initiate bulk upload
- Verify data integrity
- Post-run
- Start incoming live-update events

### Pre-run steps
## Architectural overview

1. Remove all objects from S3 bucket `loader-input-files-{accountName}`
2. Turn off S3 event notifications for bucket `persisted-resources-{accountName}`
In aws console, go
to
<br>_S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> _Amazon EventBridge_ ->
_Edit_ -> _Off_
3. Press `ResetDatabaseButton` (Trigger `DatabaseResetHandler`). This might take around a minute to
complete.
4. Verify that database is empty. You can use SageMaker notebook to query the database*. Example
sparql queries:
```
SELECT (COUNT(DISTINCT ?g) as ?gCount) WHERE {GRAPH ?g {?s ?p ?o}}
```
or
```
SELECT ?g ?s ?p ?o WHERE {GRAPH ?g {?s ?p ?o}} LIMIT 100
```
![Architecture](documentation/images/data_export_overview.png)

### Bulk upload steps
## Integration overview

1. Generate key batches for both locations: `resources` and `nvi-candidates`. Manually trigger
`GenerateKeyBatchesHandler` with the following input:
```json
{
"detail": {
"location": "resources|nvi-candidates"
}
}
```
2. Verify that `GenerateKeyBatchesHandler` is done processing (i.e. check logs) and that key batches
have been generated S3 bucket `data-report-key-batches-{accountName}`
3. Trigger `BulkTransformerHandler`
4. Verify that `BulkTransformerHandler` is done processing (i.e. check logs) and that nquads
have been generated S3 bucket `loader-input-files-{accountName}`
5. Trigger `BulkDataLoader`
6. To check progress for bulk upload to Neptune. Trigger `BulkDataLoader` with the following input:
```json
{
"loadId": "{copy loadId UUID from test log}"
}
```
7. Verify that expected count is in database. Query for counting distinct named graphs:
```
SELECT (COUNT(DISTINCT ?g) as ?gCount) WHERE {GRAPH ?g {?s ?p ?o}}
```
The s3 bucket `data-report-csv-export-{accountName}` (defined in template) is
set up as a data source in Databricks (in another AWS account) following
databricks [guide _Create a storage credential for connecting to AWS S3_](https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html#create-a-storage-credential-for-connecting-to-aws-s3).
This is how the data platform accesses files from
`data-report-csv-export-{accountName}`:

### Post-run steps
![Databricks integration](documentation/images/data_report_aws_databricks_storage_credential.png)

1. Turn on S3 event notifications for bucket `persisted-resources-{accountName}`.
In aws console, go
to
<br> _S3_ -> _persisted-resources-{accountName}_ -> _Properties_ -> _Amazon EventBridge_ ->
_Edit_ -> _On_
## How-to guides

*Note: You can use SageMaker notebook to query the database. Notebook can be opened from the AWS
console through _SageMaker_ -> _Notebooks_ -> _Notebook instances_ -> _Open JupyterLab_
- [Run bulk export](documentation/bulkExport.md)
67 changes: 67 additions & 0 deletions documentation/bulkExport.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# How to run a bulk export

This process is intended as an "initial export"/database dump for the first
export to the data platform. It can also be used if changes in the data model
require a full re-import of the data.

The steps below can be outlined briefly as:

- Pre-run
- Stop incoming live-update events
- Delete data from previous runs
- Bulk export
- Generate batches of document keys for export
- Transform the key batches to csv files
- Post-run
- Start incoming live-update events

## Pre-run steps

1. Turn off S3 event notifications for bucket `persisted-resources-{accountName}`
In aws console, go
to

_S3_ -> _persisted-resources-{accountName}_ -> _Properties_ ->
_Amazon EventBridge_ -> _Edit_ -> _Off_
2. Remove all objects from S3 bucket `data-report-csv-export-{accountName}`

## Bulk export steps

1. Generate key batches for both locations: `resources` and `nvi-candidates`.
Manually trigger `GenerateKeyBatchesHandler` with the following input:

```json
{
"detail": {
"location": "resources|nvi-candidates"
}
}
```

2. Verify that `GenerateKeyBatchesHandler` is done processing (i.e. check logs
and that key batches have been generated S3 bucket
`data-report-key-batches-{accountName}`
3. Process the key batches and generate csv files for both locations: `resources`
and `nvi-candidates`.
Manually trigger `CsvBulkTransformerHandler` with the following input:

```json
{
"detail": {
"location": "resources|nvi-candidates"
}
}
```

4. Verify that `CsvBulkTransformerHandler` is done processing (i.e. check logs)
and that csv files have been generated S3 bucket
`data-report-csv-export-{accountName}`

## Post-run steps

1. Turn on S3 event notifications for bucket `persisted-resources-{accountName}`.
In aws console, go
to

_S3_ -> _persisted-resources-{accountName}_ -> _Properties_ ->
_Amazon EventBridge_ -> _Edit_ -> _On_
Binary file added documentation/images/data_export_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4eb18f7

Please sign in to comment.