This document serves as a general focal point for planning and coding the implementation of the Knowledge Graph Exchange Archive system ("Archive"), the design vision for which is discussed in the high level Architectural Vision document, and the practical deployment specifics for which are discussed in the Getting Started document.
In discussing the plan, the 'primary' client pertains to the initial user transaction with the Archive of uploading the original KGE File Set to the Archive. A 'secondary' client pertains to a second user activity (for the same, or a different, user) of accessing that KGE File Set, after it has been uploaded to the Archive.
The components and operations of the KGE Archive system (including related Translator Smart-API Registry functions) are:
- client user authentication and authorization.
- a client interface (web form and, possibly, a command line interface modality) to upload KGX format-compliant KGE File Sets capturing the content of Biolink Model-compliant Translator knowledge graphs, with partial or complete metadata;
- a web server application manager of the KGE File Sets;
- persistence of KGE File Sets (and their metadata) in online network storage (i.e. an NCATS-hosted Amazon S3 bucket);
- if complete content metadata was not already generated by the user who uploaded a given KGE File Set, a generation of missing content metadata for the KGX files by (re-)processing those files again through KGX;
- publication Translator SmartAPI Registry ("Registry") entries pointing to (meta-)data access details for each KGE File Set, one per distinct knowledge graph.
- client access to KGE File Set metadata from the Register (Note that the details and implementation of the indexing and accessing of KGE entries in the Registry will be within the technical scope of the Registry team, not this Archive code base).
- a file streaming gateway/protocol to download KGE File Sets, using information from KGEA API entries retrieved from the Registry.
The proposed implementation of the Archive is as a Python web application consisting of components running within a Docker Compose coordinated set of Docker containers, hosted on a (Translator hosted AWS EC2 cloud?) server instance and accessing suitable (Translator hosted AWS S3 or EBS cloud) storage.
Client communication with the web application will generally be through an OpenAPI 3 templated web service specified in kgea_api.yaml. A human browser accessible web form and/or a (KGX-based?) command line interface will be implemented to upload KGX files to the Archive web server, for further processing.
Once on the server, if the associated Translator Resource "Content Metadata" of the KGE files is incomplete, KGX may be run (as a background, asynchronous task) to generate the required content metadata. To the resulting (uploaded or generated) metadata, suitable additional Translator Resource "Provider Metadata" will be added.
The web application will publish KGE SmartAPI entries to the Translator SmartAPI Registry through an outgoing (web service?) automated protocol to be negotiated with the SmartAPI team. Provider Metadata will generally be hard-coded into the API yaml file uploaded to SmartAPI (which will be similar to the kgea_api.yaml but rather, based on a KGE file set parameterized kge_smartapi.yaml template); access to Content Metadata will be deferred to the API's "knowledge_map" endpoint published in the entry (see below). Clients (likely different from the clients which uploaded the original KGE files) will access KGE SmartAPI entries through the normal modalities of Translator SmartAPI site access.
Using accessed KGE SmartAPI metadata, clients will connect to the Archive to read the available Content Metadata, then access the files themselves though some link or protocol of file data transfer from the remote network storage (by a protocol to be further specified), for their intended local computational use.
Further specific details and road map of the Archive's design and its implementation - in global terms and with respect to the numbered functional parts noted on the KGE Archive Component Architecture diagram are summarized here below in this document.
- Client Authentication & Authorization
- Primary Client KGE File Set Upload
- KGE Archive Server
- KGE File Set Network Storage
- KGX Metadata Generation
- KGE SmartAPI Registration
- Secondary Client Access to KGE SmartAPI Entries
- Secondary Client Access to KGE Files
- The heart of the KGE Archive system will be a publicly visible, user authentication secured, web service application [3].
- Since NCATS (Biomedical Translator Consortium) operates most of its deployed infrastructure on a private (VPN) leased sub-cloud of Amazon Web Services (AWS) servers and related components, deployment of the KGI Archive is assumed to target AWS EC2, and related PAAS services, as its deployment platform.
Client access to the Archive will need to be authenticated (see related comments in other sections below). AWS off-the-shelf services (e.g. AWS Cognito) will be leveraged, though available AWS SDK's, to implement all Client user authentication and authorization. Given the complexities of this task, a separate KGE Archive client authentication and authorization document is provided here to review the design and configuration of the associated software components.
The precise form, protocol and software support for KGE File Set uploading [2] is to be elaborated. Two design patterns are under consideration:
- Files via the interfaces [2] upload to the KGE Archive Server [3] then proxied onward to network storage [4], .
- Files uploaded by network storage SDK's embedded in the interface [2], through to network storage [4].
-
The core web service application [3] will be installed to run within a suitable software deployment framework (a Docker Compose managed set of Docker containers) hosted on an AWS EC2 instance, running a common flavor (i.e. Ubuntu) of the Linux operating system.
-
Web Services Specification: a current release of the OpenAPI 3 web services specification standard is being used to specify the web services application programming interface (API) of the Archive. OpenAPI Tools code generation is used to convert the API into stub server code for elaboration. This tool is wrapped in a pair of (bash) shell scripts - a custom script generate-kge-server.sh, calling a locally mirrored copy of the OpenAPI Generator Script - that are executed to further automate and parameterize the server code generation process.
-
Primary implementation language: for Archive software components will be a recent release (3.9++) of Python. Related off-the-shelf standards and libraries are being used to develop the required components:
-
Build & Dependency Management: the application configuration Python dependency management and build process will be managed by the Python
pipenv
tool; -
Web Application Framework: the Python Flask, with customized business logic serving handlers to web service API code stubs as generated by the aforementioned OpenAPI code generator tool;
-
Amazon Web Services: the latest release of the available Python AWS Software Development Kit (Boto3) will be leveraged to integrate the web services application with AWS infrastructure.
-
Github Transactions: publication of SmartAPI entries for KGE File Sets [6] can be accomplished using an available Python library for programmatic access to Github.
-
Archived KGE file sets encoding Translator Knowledge Graphs, are KGX text files which are anticipated to be, on average, fairly large (e.g. gigabytes in size). Although such files could potentially be hosted on AWS EBS volumes attached to the aforementioned EC2 server instance, longer term stable persistence of such large files only periodically accessed, suggests that they should be hosted within an AWS S3 bucket. The aforementioned Boto3 Python AWS SDK library will (hopefully) efficiently broker web service application transactions with S3.
We will need to consider how to manage versioning of KGE File Sets, reflecting both Translator policy for knowledge graph versioning, as well as, the network storage (AWS S3) versioning characteristics.
KGX-generated 'content' metadata, perhaps augmented with some additional KGE Archive-specific 'provider' metadata, will need to be associated with every distinct knowledge graph published as a KGE File Set.
Two options are possible for the provision of KGX 'content' metadata:
-
the primary client will pre-generate a KGX content metadata file for upload alongside the KGX files forming the KGE File Set;
-
the Archive server will run an instance of KGX (running as a background process on the server, perhaps in a dedicated Docker container?) to generate the required content metadata file from previously uploaded KGX files.
The standard mechanism for the recording of Translator SmartAPI Registry ("Registry") entries to Translator resources to consume SmartAPI-compliant OpenAPI 3 YAML specifications from a designated Github repository location (e.g. the Translator API Registry github repository). Such a location can contain multiple distinct SmartAPI entries, one per distinct API. Every API registered in such a location needs to resolve to a host/path location of a live OpenAPI 3 implementation of the API.
After uploading KGE File Set (and adding KGX 'content' metadata [5] if required), the Archive needs to publish such a SmartAPI entry to the Translator and make provisions for the live hosting of the API.
For the former objective, the Archive will populate a KGE File Set specific instance of a KGE File Set SmartAPI specification template. For the latter objective, the resulting KGE SmartAPI YAML file will actually resolve to a KGE File Set associated subpath on the Archive, as programmatically defined in the full Archive API web services OpenAPI specification and its implementation, described herein.
Archive publication of KGE File Set SmartAPI entries will consist of a git
commit and push of the KGE File Set-specific API YAML file to a SmartApi registered Github repository location, likely within the Translator API Registry github repository and updating of any associated API List on the site. The full details of full automation of this process require further elaboration.
As mentioned in the high level architectural vision document, the SmartAPI Registry team already have infrastructure in place on the Translator SmartAPI Portal to handle this step in the KGE workflow, so there is no action required here.
As noted in [6] above, KGE File Set SmartAPI entries will point to a live server path indexed by knowledge graph name and hosted by the Archive web service application [3]. The key design considerations for this step is to specify the specific modality of access and its practical implementation.
With respect to modality, one specific REST path definition in the Archive API web services OpenAPI specification is defined to handle file retrieval as a some kind of URL brokered access to the files. How this URL is to be accessed remains to be further elaborated.
In terms of practical implementation, given the anticipated large anticipated size of many KGE File Sets, technical options for downloading or streaming such files will require additional consideration. Two general ideas come to mind at the moment:
-
That the Archive web server acts like a proxy gateway streaming such files to a secondary client, through the server, from the back end network storage system (AWS S3 bucket, perhaps via EBS buffering on the EC2 server). It is uncertain at this moment what level of bandwidth performance demands this may place upon the KGE Server, thus, we have a second option...
-
...That the Archive merely returns a suitable interface - CLI, web form, web service call endpoint, etc. - authorizing system access with temporary resource access tokens (i.e. AWS Cognito) to the back end network storage system (AWS S3 bucket), which the secondary client then uses to directly access the files from the storage system (without going through the server).