Data Store Charter #8

kozbo · 2018-08-11T00:45:39Z

Ready for review, choosing a few members from the Architecture team to get this started.

Charter for the Data Store group

charters/DataStore/charter.md

brianraymor · 2018-08-14T16:01:55Z

charters/DataStore/charter.md

+## In-scope
+
+### Interfaces
+* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.


DSS needs to be defined per its use. Then there's should be a consistency of when DSS is used instead of Data Store

I will move them all to Data Store

brianraymor · 2018-08-14T16:21:46Z

charters/DataStore/charter.md

+
+### Core capabilities
+* DSS data model and lifecycle (versioned bundles, etc) - Ongoing support and maintenance of the implementation of the data model. 
+* Subscriptions/Eventing - Implementation of Data lifecycle web-hooks (new bundle, new file, delete bundle, delete file). The Data Store implementation will move away from the current dependance on Elastic Search Percolate for our event subsystem. Eventing will depend instead on the AWS and GCP cloud infrastructure directly.


The Data Store event subsystem will transition from its current dependence on Elastic Search Percolate to the AWS and GCP cloud infrastructure.

Or:

Transition Data Store Subscriptions/Eventing services from the current dependence on Elastic Search Percolate to the AWS and GCP cloud infrastructure

brianraymor · 2018-08-14T16:32:39Z

charters/DataStore/charter.md

+* Subscriptions/Eventing - Implementation of Data lifecycle web-hooks (new bundle, new file, delete bundle, delete file). The Data Store implementation will move away from the current dependance on Elastic Search Percolate for our event subsystem. Eventing will depend instead on the AWS and GCP cloud infrastructure directly.
+* Multi-cloud replication of objects - There are three parts to this:
+   1. Maintenance and improvements to the synchronization implementation between AWS and GCP
+   2. Extending the cloud support to more vendors (such as Microsoft Azure) 


Is Azure in scope for this charter?

Azure is not specifically in scope for this charter. I think I should reword this to:
2. Document interfaces to enable new cloud implementations by 3rd parties.

brianraymor · 2018-08-14T16:39:34Z

charters/DataStore/charter.md

+   1. Maintenance and improvements to the synchronization implementation between AWS and GCP
+   2. Extending the cloud support to more vendors (such as Microsoft Azure) 
+   3. Supporting multiple replicas within a single cloud.
+* Support for plug-able indexes - Provide a standard interface for connecting indexing modules to the Data Store. This interface will provide a mechanism to connect indexing subsystems to receive events about the data. 


Repetitive. How about Define a standard interface to enable pluggable indexing modules to receive Data Store events

brianraymor · 2018-08-14T16:42:28Z

charters/DataStore/charter.md

+## Milestones
+* Mid-2018:  1000 bundle test scale, deploy as part of HCA DCP Pilot
+* EOY 2018: add checkout, collections, improved scaling/hardening, generic events to support stand-alone indexers, additional gaps identified in HCA DCP Pilot.
+* Future: (not in order of precedence) native GCP support, Authorization support for controlled-access data, additional scale/hardening, Biosphere requirements, tiered storage, content zones, FISMA moderate capabilities, single-replica deployments


Future would represent a re-charter to extend scope and milestones.

So should I take out the Future section?

Correct. Futures represent potential scope for a future, refreshed Data Store charter. Well-scoped charters are not intended to be for forever. Now, if you have a clear milestone to deliver (for example) native GCP support by May 2019, then that would be different. Does that make sense?

brianraymor · 2018-08-14T16:43:31Z

charters/DataStore/charter.md

+### Slack Channels
+* HumanCellAtlas/data-store : general data store discussions
+* HumanCellAtlas/data-store-eng : development discussions
+### Mailing list(s)


If not used, then Mailing list(s) and Discussion Forum(s) can be deleted.

brianraymor · 2018-08-14T17:00:58Z

charters/DataStore/charter.md

+
+### Community engagement
+* Triage and integration of feature requests from the community into the Data Store roadmap. 
+* Outreach and engagement of the community


Are Training and Hackathons sub-bullets or examples of Outreach and engagement of the community?

brianraymor · 2018-08-14T17:02:14Z

charters/DataStore/charter.md

+
+
+## Out-of-scope
+* Other index/query methods/engines - we should implement these as stand-alone projects against modular index/query API.


Who is we should implement these?

brianraymor · 2018-08-14T17:04:24Z

charters/DataStore/charter.md

+
+### Interfaces
+* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
+* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment


What is the difference between continuing support and maintenance [and extension] ?

In general, I find this style to be wordy. Why not eliminate the API "titles" and use a form like:

Maintain and extend (or continue to support if that's preferred) the Checkout service API which enables data checkout to a local filesystem or a personal cloud environment

brianraymor · 2018-08-14T17:09:09Z

charters/DataStore/charter.md

+* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
+* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
+* Collections service APIs - Maintenance and extension of the ability to do basic operations on arbitrary collections of objects in the Data Store.
+* API Documentation - Programmatic APIs available for the Data Store include the REST interface and the Python bindings. Documentation and examples will be created for both of these APIs.


Simplify? API Documentation and examples for both the Data Store REST interface and Python bindings will be published.

Or if you adopt the style above:

Publish API documentation and examples for both the Data Store REST interface and Python bindings

I like the latter.

mweiden · 2018-08-14T15:43:21Z

charters/DataStore/charter.md

+### Core capabilities
+* DSS data model and lifecycle (versioned bundles, etc) - Ongoing support and maintenance of the implementation of the data model. 
+* Subscriptions/Eventing - Implementation of Data lifecycle web-hooks (new bundle, new file, delete bundle, delete file). The Data Store implementation will move away from the current dependance on Elastic Search Percolate for our event subsystem. Eventing will depend instead on the AWS and GCP cloud infrastructure directly.
+* Multi-cloud replication of objects - There are three parts to this:


nit: I would remove - There are three parts to this:. That there are three parts to this is implied by the outline format.

Fetch ... the comfy chair!

mweiden · 2018-08-14T15:45:11Z

charters/DataStore/charter.md

+   1. Maintenance and improvements to the synchronization implementation between AWS and GCP
+   2. Extending the cloud support to more vendors (such as Microsoft Azure) 
+   3. Supporting multiple replicas within a single cloud.
+* Support for plug-able indexes - Provide a standard interface for connecting indexing modules to the Data Store. This interface will provide a mechanism to connect indexing subsystems to receive events about the data. 


I'd go with pluggable - https://english.stackexchange.com/questions/77373/plugable-or-pluggable

mweiden · 2018-08-14T15:48:28Z

charters/DataStore/charter.md

+
+### Security
+* User authentication system implementation
+* Data access authorization system implementation 


Would it be more accurate to say that DSS will provide authentication and authorization? Implementation of the foundational systems is not really the remit of the DSS.

won't each Box/module have to implement their part of Auth? I was thinking that Auth architecture and implementation of basic libs would be the responsibility of the DevSecOps group, but the implementation required to hook into those systems would be the responsibility of each Box/module. That is what I was going for here.

mweiden · 2018-08-14T15:50:20Z

charters/DataStore/charter.md

+
+
+## Description
+The Data Store is a scientific data sharing/publishing/distribution framework, providing file/bundle management on multiple clouds at PB scale, with strong public APIs. It provides a simple API for storage, retrieval, and subscription to events that functions transparently across multiple cloud systems such as AWS and GCP.


I would define the acronym that you use below so that people know what DSS is: The Data Storage System (DSS) ...

mweiden · 2018-08-14T15:54:54Z

charters/DataStore/charter.md

+* User authentication system implementation
+* Data access authorization system implementation 
+* DevSecOps - implementation of features required for eventual FISMA moderate deployments (authentication, authorization, logging, auditing, etc).
+* Operations for DSS - Implement and configure tools to facilitate the operation of the Data Store service in a production environment


Use DSS or Data Storage Service (with capitalization). Not both.

mweiden · 2018-08-14T17:15:46Z

charters/DataStore/charter.md

+
+
+## Out-of-scope
+* Other index/query methods/engines - we should implement these as stand-alone projects against modular index/query API.


How about: query languages and indices?

mweiden · 2018-08-14T17:16:17Z

charters/DataStore/charter.md

+* Matrix service API
+
+## Milestones
+* Mid-2018:  1000 bundle test scale, deploy as part of HCA DCP Pilot


double space

mweiden · 2018-08-14T17:17:49Z

charters/DataStore/charter.md

+## Communication
+### Slack Channels
+* HumanCellAtlas/data-store : general data store discussions
+* HumanCellAtlas/data-store-eng : development discussions


nit: I'd eliminate the extra space for consistency.

brianraymor · 2018-08-14T17:36:02Z

charters/DataStore/charter.md

+### Interfaces
+* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
+* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
+* Collections service APIs - Maintenance and extension of the ability to do basic operations on arbitrary collections of objects in the Data Store.


Or:

Maintain and extend the Collections service API which enables basic operations on arbitrary collections of objects in the Data Store.

brianraymor · 2018-08-14T17:38:16Z

charters/DataStore/charter.md

+* HCA DCP CLI tool - The HCA DCP CLI is a foundational tool for the DCP and its users. All subcomponents in the DCP use the same CLI system. The Data Store team will maintain the infrastructure to support the general CLI architecture as well as the CLI commands relating to the Data Store itself. Other modules such as Upload and Ingest will be responsible for implementing their respective functional components of the CLI
+
+### Core capabilities
+* DSS data model and lifecycle (versioned bundles, etc) - Ongoing support and maintenance of the implementation of the data model. 


Maintain and extend the DSS data model and lifecycle (such as versioned bundles)

brianraymor · 2018-08-14T17:52:13Z

charters/DataStore/charter.md

+* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
+* Collections service APIs - Maintenance and extension of the ability to do basic operations on arbitrary collections of objects in the Data Store.
+* API Documentation - Programmatic APIs available for the Data Store include the REST interface and the Python bindings. Documentation and examples will be created for both of these APIs.
+* HCA DCP CLI tool - The HCA DCP CLI is a foundational tool for the DCP and its users. All subcomponents in the DCP use the same CLI system. The Data Store team will maintain the infrastructure to support the general CLI architecture as well as the CLI commands relating to the Data Store itself. Other modules such as Upload and Ingest will be responsible for implementing their respective functional components of the CLI


How about something a bit tighter like:

The Command Line Interface (CLI) is a foundational tool for interacting with the DCP. The Data Store team is responsible for the specific Data Store commands and the maintenance of the infrastructure that allows other services such as Upload and Ingest to integrate their commands into the CLI.

Yes, thanks for the improved wording. I am thinking that the Data Store should be included in the review of all CLI changes by the other modules. What do folks think of that addition?

brianraymor

See my inlined editorial pass. In general, imagine that a new member of the community is reading your charter to understand the Data Store scope and responsibilities.

Added mention of software contributions from community.

charters/DataStore/charter.md

mweiden · 2018-09-14T16:25:42Z

The architecture team, sans an ingest tech lead, approved this charter in its meeting yesterday.

If @tburdett approves, we can merge it.

tburdett

Sorry for the super later review!

There's one thing I think must be addressed before I'd be happy to approve this. There's no user focus, so it is not clear what the process is for determining how data is to be organised in the data store. The DSS is the bridge between contributors and consumers, so from the ingest perspective, the one piece of information I need is how to structure data from contributors based on consumer needs. In practice this means that ownership of bundle specifications - and the process for how they evolve to meet new use cases (e.g. from red box portals) - either needs to be in scope for the datastore or explicitly pushed out to another charter (metadata?). I would also want to see a declaration of how that process is managed. For example, how often will bundle structuring requirements change and with how much notice? Given the close relationship between the bundle definition and the DSS implementation, I personally think the DSS probably needs to own bundle specs and the process for modifying them.

I've also provided some other inline comments, most of them minor, that would be nice to clarify. These are mostly suggestions for clarity and readability though.

charters/DataStore/charter.md

kozbo · 2018-09-19T01:02:33Z

@tburdett The only section I am having trouble resolving is from your main problem with the charter :-) Do you think that I should include use cases? I see the bundle structure as being owned by the metadata, indexing, Green, and Orange teams. Blue just stores the data, especially once the ES index is moved out.

Most of comments from Tony B.

tburdett · 2018-09-20T07:29:09Z

@kozbo thanks for the changes, much clearer!

On bundle structure, it makes sense to me for the DSS to not own the specification. However, the DSS is stuck between ingest and then pipelines, tertiary portals and the data portal. Given this organisation, I'd want to see a process whereby requests for a new mechanism of data access will come to the DSS, be assessed against those usecases and requirements that motivated the bundle-centric design, and if the request is compatible, then be handed off to (say) the metadata team to define the spec for a new bundle type.

To provide a concrete example: Let's say a tertiary portal asks for a new type of bundle that contains "all smartseq data". This might be trivial for the metadata team and ingest to implement. But we definitely wouldn't want the request to hit those teams first, as I assume (correct me if I'm wrong) that this would be A Bad Thing(tm) for the DSS. This is a very different usecase from the current assay-driven bundle design driven by pipeline usecases. This example is not a million miles away from the drive to redefine bundles to make them easier for the matrix service to handle.

If this all makes sense, some indication in this charter that the DSS will own the process of collecting new bundle requirements from users and assessing them against the datastore design before sending them to the metadata team for the definition of a new spec would be good enough for me to accept this review. This means the DSS charter owns the bundle use cases and bundle type definitions and the metadata team take ownership of defining the precise specification. It would be worth consulting with the metadata schema charter @lauraclarke @morrisonnorman @diekhans as to whether that aligns with their expected process

charters/DataStore/charter.md

lauraclarke · 2018-09-20T09:22:38Z

So I don't seem to be able to thread with @tburdett comment about bundle specification. I was under the impression that metadata/ingest service was going to specify the individual files and which files belong together in a bundle (so all the data and metadata files associated with a single experiment output such as a run/plex of a sequencing library) but we weren't going to specify how that bundle was structured inside the data store.

I am happy for there to be reciprocal statements in both the metadata charter and the data store charter about this but as the metadata charter does not consider the concept of a bundle at all at the moment we will need to discuss how it is framed

As you said whatever is put in both the data store and metadata charters we do need to make sure that the other components (and other data consumers) have some input into the process.

kozbo · 2018-09-23T23:52:32Z

@tburdett @lauraclarke
I added language around GDPR and managing the data model. Let's put the final polish on this on Tuesday.

kbergin · 2018-09-25T14:00:42Z

charters/DataStore/charter.md

+## Definitions
+**Bundle** A bundle is a list of related files along with some very basic metadata such as filenames.
+
+**DCP** The Data Coordination Platform is the name given to the entire system used to ingest, validate, store, analyzes, and make available the datga in the Human Cell Atlas project.


datga typo

comments from group review

charters/DataStore/charter.md

kbergin · 2018-09-28T14:08:52Z

Hello!

Would you like to add a reference to the new upload team email on this charter?

On the Data Processing Charter I decided to make the title of the charter mailto:email@data.humancellatlas.org and to add it to the Communication section. I formatted it as
[Team email](mailto:email@data.humancellatlas.org): email@data.humancellatlas.org

For your reference, your team email is dss-team@data.humancellatlas.org

Thanks! I'll be posting this on each charter for convenience, please do let me know your thoughts :)

kozbo added 4 commits August 10, 2018 16:35

Charter for the Data Store group

ccd7c74

Merge pull request #1 from kozbo/kozbo-DataStore-charter-2

6192eb6

Charter for the Data Store group

Update charter.md

5270f90

Update charter.md

3cac587

kozbo requested review from mweiden, kislyuk, brianraymor and hannes-ucsc August 11, 2018 00:45

brianraymor added the charter-community-review label Aug 11, 2018

brianraymor reviewed Aug 14, 2018

View reviewed changes

charters/DataStore/charter.md Show resolved Hide resolved

brianraymor reviewed Aug 14, 2018

View reviewed changes

mweiden requested changes Aug 14, 2018

View reviewed changes

brianraymor reviewed Aug 14, 2018

View reviewed changes

brianraymor requested changes Aug 14, 2018

View reviewed changes

Update charter.md

da7356f

Added mention of software contributions from community.

benedictpaten reviewed Aug 15, 2018

View reviewed changes

charters/DataStore/charter.md Show resolved Hide resolved

Incorporate first round of review

7d72684

briandoconnor approved these changes Aug 17, 2018

View reviewed changes

brianraymor approved these changes Aug 17, 2018

View reviewed changes

incorporate comments from Bruce and Brian

876edfd

kozbo added charter-oversight-review and removed charter-community-review labels Sep 10, 2018

sampierson reviewed Sep 13, 2018

View reviewed changes

charters/DataStore/charter.md Outdated Show resolved Hide resolved

sampierson approved these changes Sep 13, 2018

View reviewed changes

sampierson requested review from dshiga and tburdett September 13, 2018 15:48

dshiga approved these changes Sep 13, 2018

View reviewed changes

tburdett requested changes Sep 16, 2018

View reviewed changes

Incorporate more comments

ab5833f

Most of comments from Tony B.

lauraclarke reviewed Sep 20, 2018

View reviewed changes

charters/DataStore/charter.md Outdated Show resolved Hide resolved

Added GDPR statement and language about data model

e53870a

change of technical lead

b7bf799

kbergin reviewed Sep 25, 2018

View reviewed changes

kozbo added 2 commits September 25, 2018 10:33

group review adjustments

0da569c

comments from group review

spelling fix

cf56666

tburdett approved these changes Sep 28, 2018

View reviewed changes

tburdett reviewed Sep 28, 2018

View reviewed changes

charters/DataStore/charter.md Show resolved Hide resolved

kozbo added 3 commits September 28, 2018 11:16

Update to add group email addresses

3a53808

Add DevOps responsibility

381cea7

Removed completed milestone

e421c11

kozbo merged commit 3292a48 into HumanCellAtlas:master Oct 4, 2018

kozbo added charter-approved and removed charter-oversight-review labels Oct 4, 2018



		## Out-of-scope
		* Other index/query methods/engines - we should implement these as stand-alone projects against modular index/query API.



		## Description
		The Data Store is a scientific data sharing/publishing/distribution framework, providing file/bundle management on multiple clouds at PB scale, with strong public APIs. It provides a simple API for storage, retrieval, and subscription to events that functions transparently across multiple cloud systems such as AWS and GCP.

Data Store Charter #8

Data Store Charter #8

Conversation

kozbo commented Aug 11, 2018

brianraymor Aug 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianraymor Aug 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianraymor Aug 14, 2018 • edited Loading

Choose a reason for hiding this comment

brianraymor Aug 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianraymor left a comment

Choose a reason for hiding this comment

mweiden commented Sep 14, 2018

tburdett left a comment

Choose a reason for hiding this comment

kozbo commented Sep 19, 2018 • edited Loading

tburdett commented Sep 20, 2018 • edited Loading

lauraclarke commented Sep 20, 2018

kozbo commented Sep 23, 2018

Choose a reason for hiding this comment

kbergin commented Sep 28, 2018 • edited Loading

brianraymor Aug 14, 2018 •

edited

Loading

brianraymor Aug 14, 2018 •

edited

Loading

brianraymor Aug 14, 2018 •

edited

Loading

brianraymor Aug 14, 2018 •

edited

Loading

kozbo commented Sep 19, 2018 •

edited

Loading

tburdett commented Sep 20, 2018 •

edited

Loading

kbergin commented Sep 28, 2018 •

edited

Loading