Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Store Charter #8

Merged
merged 17 commits into from
Oct 4, 2018
Merged

Data Store Charter #8

merged 17 commits into from
Oct 4, 2018

Conversation

kozbo
Copy link
Contributor

@kozbo kozbo commented Aug 11, 2018

Ready for review, choosing a few members from the Architecture team to get this started.

## In-scope

### Interfaces
* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DSS needs to be defined per its use. Then there's should be a consistency of when DSS is used instead of Data Store

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move them all to Data Store


### Core capabilities
* DSS data model and lifecycle (versioned bundles, etc) - Ongoing support and maintenance of the implementation of the data model.
* Subscriptions/Eventing - Implementation of Data lifecycle web-hooks (new bundle, new file, delete bundle, delete file). The Data Store implementation will move away from the current dependance on Elastic Search Percolate for our event subsystem. Eventing will depend instead on the AWS and GCP cloud infrastructure directly.
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Data Store event subsystem will transition from its current dependence on Elastic Search Percolate to the AWS and GCP cloud infrastructure.

Or:

  • Transition Data Store Subscriptions/Eventing services from the current dependence on Elastic Search Percolate to the AWS and GCP cloud infrastructure

* Subscriptions/Eventing - Implementation of Data lifecycle web-hooks (new bundle, new file, delete bundle, delete file). The Data Store implementation will move away from the current dependance on Elastic Search Percolate for our event subsystem. Eventing will depend instead on the AWS and GCP cloud infrastructure directly.
* Multi-cloud replication of objects - There are three parts to this:
1. Maintenance and improvements to the synchronization implementation between AWS and GCP
2. Extending the cloud support to more vendors (such as Microsoft Azure)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Azure in scope for this charter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure is not specifically in scope for this charter. I think I should reword this to:
2. Document interfaces to enable new cloud implementations by 3rd parties.

1. Maintenance and improvements to the synchronization implementation between AWS and GCP
2. Extending the cloud support to more vendors (such as Microsoft Azure)
3. Supporting multiple replicas within a single cloud.
* Support for plug-able indexes - Provide a standard interface for connecting indexing modules to the Data Store. This interface will provide a mechanism to connect indexing subsystems to receive events about the data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repetitive. How about Define a standard interface to enable pluggable indexing modules to receive Data Store events

## Milestones
* Mid-2018: 1000 bundle test scale, deploy as part of HCA DCP Pilot
* EOY 2018: add checkout, collections, improved scaling/hardening, generic events to support stand-alone indexers, additional gaps identified in HCA DCP Pilot.
* Future: (not in order of precedence) native GCP support, Authorization support for controlled-access data, additional scale/hardening, Biosphere requirements, tiered storage, content zones, FISMA moderate capabilities, single-replica deployments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future would represent a re-charter to extend scope and milestones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I take out the Future section?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Futures represent potential scope for a future, refreshed Data Store charter. Well-scoped charters are not intended to be for forever. Now, if you have a clear milestone to deliver (for example) native GCP support by May 2019, then that would be different. Does that make sense?

### Slack Channels
* HumanCellAtlas/data-store : general data store discussions
* HumanCellAtlas/data-store-eng : development discussions
### Mailing list(s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not used, then Mailing list(s) and Discussion Forum(s) can be deleted.


### Community engagement
* Triage and integration of feature requests from the community into the Data Store roadmap.
* Outreach and engagement of the community
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Training and Hackathons sub-bullets or examples of Outreach and engagement of the community?



## Out-of-scope
* Other index/query methods/engines - we should implement these as stand-alone projects against modular index/query API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is we should implement these?


### Interfaces
* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between continuing support and maintenance [and extension] ?

In general, I find this style to be wordy. Why not eliminate the API "titles" and use a form like:

  • Maintain and extend (or continue to support if that's preferred) the Checkout service API which enables data checkout to a local filesystem or a personal cloud environment

* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
* Collections service APIs - Maintenance and extension of the ability to do basic operations on arbitrary collections of objects in the Data Store.
* API Documentation - Programmatic APIs available for the Data Store include the REST interface and the Python bindings. Documentation and examples will be created for both of these APIs.
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify? API Documentation and examples for both the Data Store REST interface and Python bindings will be published.

Or if you adopt the style above:

  • Publish API documentation and examples for both the Data Store REST interface and Python bindings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the latter.

### Core capabilities
* DSS data model and lifecycle (versioned bundles, etc) - Ongoing support and maintenance of the implementation of the data model.
* Subscriptions/Eventing - Implementation of Data lifecycle web-hooks (new bundle, new file, delete bundle, delete file). The Data Store implementation will move away from the current dependance on Elastic Search Percolate for our event subsystem. Eventing will depend instead on the AWS and GCP cloud infrastructure directly.
* Multi-cloud replication of objects - There are three parts to this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would remove - There are three parts to this:. That there are three parts to this is implied by the outline format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetch ... the comfy chair!

1. Maintenance and improvements to the synchronization implementation between AWS and GCP
2. Extending the cloud support to more vendors (such as Microsoft Azure)
3. Supporting multiple replicas within a single cloud.
* Support for plug-able indexes - Provide a standard interface for connecting indexing modules to the Data Store. This interface will provide a mechanism to connect indexing subsystems to receive events about the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Security
* User authentication system implementation
* Data access authorization system implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to say that DSS will provide authentication and authorization? Implementation of the foundational systems is not really the remit of the DSS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't each Box/module have to implement their part of Auth? I was thinking that Auth architecture and implementation of basic libs would be the responsibility of the DevSecOps group, but the implementation required to hook into those systems would be the responsibility of each Box/module. That is what I was going for here.



## Description
The Data Store is a scientific data sharing/publishing/distribution framework, providing file/bundle management on multiple clouds at PB scale, with strong public APIs. It provides a simple API for storage, retrieval, and subscription to events that functions transparently across multiple cloud systems such as AWS and GCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would define the acronym that you use below so that people know what DSS is: The Data Storage System (DSS) ...

* User authentication system implementation
* Data access authorization system implementation
* DevSecOps - implementation of features required for eventual FISMA moderate deployments (authentication, authorization, logging, auditing, etc).
* Operations for DSS - Implement and configure tools to facilitate the operation of the Data Store service in a production environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use DSS or Data Storage Service (with capitalization). Not both.



## Out-of-scope
* Other index/query methods/engines - we should implement these as stand-alone projects against modular index/query API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: query languages and indices?

* Matrix service API

## Milestones
* Mid-2018: 1000 bundle test scale, deploy as part of HCA DCP Pilot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double space

## Communication
### Slack Channels
* HumanCellAtlas/data-store : general data store discussions
* HumanCellAtlas/data-store-eng : development discussions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd eliminate the extra space for consistency.

### Interfaces
* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
* Collections service APIs - Maintenance and extension of the ability to do basic operations on arbitrary collections of objects in the Data Store.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or:

  • Maintain and extend the Collections service API which enables basic operations on arbitrary collections of objects in the Data Store.

* HCA DCP CLI tool - The HCA DCP CLI is a foundational tool for the DCP and its users. All subcomponents in the DCP use the same CLI system. The Data Store team will maintain the infrastructure to support the general CLI architecture as well as the CLI commands relating to the Data Store itself. Other modules such as Upload and Ingest will be responsible for implementing their respective functional components of the CLI

### Core capabilities
* DSS data model and lifecycle (versioned bundles, etc) - Ongoing support and maintenance of the implementation of the data model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maintain and extend the DSS data model and lifecycle (such as versioned bundles)

* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
* Collections service APIs - Maintenance and extension of the ability to do basic operations on arbitrary collections of objects in the Data Store.
* API Documentation - Programmatic APIs available for the Data Store include the REST interface and the Python bindings. Documentation and examples will be created for both of these APIs.
* HCA DCP CLI tool - The HCA DCP CLI is a foundational tool for the DCP and its users. All subcomponents in the DCP use the same CLI system. The Data Store team will maintain the infrastructure to support the general CLI architecture as well as the CLI commands relating to the Data Store itself. Other modules such as Upload and Ingest will be responsible for implementing their respective functional components of the CLI
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something a bit tighter like:

The Command Line Interface (CLI) is a foundational tool for interacting with the DCP. The Data Store team is responsible for the specific Data Store commands and the maintenance of the infrastructure that allows other services such as Upload and Ingest to integrate their commands into the CLI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the improved wording. I am thinking that the Data Store should be included in the review of all CLI changes by the other modules. What do folks think of that addition?

Copy link
Collaborator

@brianraymor brianraymor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my inlined editorial pass. In general, imagine that a new member of the community is reading your charter to understand the Data Store scope and responsibilities.

Added mention of software contributions from community.
@mweiden
Copy link
Contributor

mweiden commented Sep 14, 2018

The architecture team, sans an ingest tech lead, approved this charter in its meeting yesterday.

If @tburdett approves, we can merge it.

Copy link
Contributor

@tburdett tburdett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the super later review!

There's one thing I think must be addressed before I'd be happy to approve this. There's no user focus, so it is not clear what the process is for determining how data is to be organised in the data store. The DSS is the bridge between contributors and consumers, so from the ingest perspective, the one piece of information I need is how to structure data from contributors based on consumer needs. In practice this means that ownership of bundle specifications - and the process for how they evolve to meet new use cases (e.g. from red box portals) - either needs to be in scope for the datastore or explicitly pushed out to another charter (metadata?). I would also want to see a declaration of how that process is managed. For example, how often will bundle structuring requirements change and with how much notice? Given the close relationship between the bundle definition and the DSS implementation, I personally think the DSS probably needs to own bundle specs and the process for modifying them.

I've also provided some other inline comments, most of them minor, that would be nice to clarify. These are mostly suggestions for clarity and readability though.

charters/DataStore/charter.md Outdated Show resolved Hide resolved
charters/DataStore/charter.md Outdated Show resolved Hide resolved
charters/DataStore/charter.md Outdated Show resolved Hide resolved
charters/DataStore/charter.md Show resolved Hide resolved
charters/DataStore/charter.md Outdated Show resolved Hide resolved
charters/DataStore/charter.md Outdated Show resolved Hide resolved
@kozbo
Copy link
Contributor Author

kozbo commented Sep 19, 2018

@tburdett The only section I am having trouble resolving is from your main problem with the charter :-) Do you think that I should include use cases? I see the bundle structure as being owned by the metadata, indexing, Green, and Orange teams. Blue just stores the data, especially once the ES index is moved out.

Most of comments from Tony B.
@tburdett
Copy link
Contributor

tburdett commented Sep 20, 2018

@kozbo thanks for the changes, much clearer!

On bundle structure, it makes sense to me for the DSS to not own the specification. However, the DSS is stuck between ingest and then pipelines, tertiary portals and the data portal. Given this organisation, I'd want to see a process whereby requests for a new mechanism of data access will come to the DSS, be assessed against those usecases and requirements that motivated the bundle-centric design, and if the request is compatible, then be handed off to (say) the metadata team to define the spec for a new bundle type.

To provide a concrete example: Let's say a tertiary portal asks for a new type of bundle that contains "all smartseq data". This might be trivial for the metadata team and ingest to implement. But we definitely wouldn't want the request to hit those teams first, as I assume (correct me if I'm wrong) that this would be A Bad Thing(tm) for the DSS. This is a very different usecase from the current assay-driven bundle design driven by pipeline usecases. This example is not a million miles away from the drive to redefine bundles to make them easier for the matrix service to handle.

If this all makes sense, some indication in this charter that the DSS will own the process of collecting new bundle requirements from users and assessing them against the datastore design before sending them to the metadata team for the definition of a new spec would be good enough for me to accept this review. This means the DSS charter owns the bundle use cases and bundle type definitions and the metadata team take ownership of defining the precise specification. It would be worth consulting with the metadata schema charter @lauraclarke @morrisonnorman @diekhans as to whether that aligns with their expected process

@lauraclarke
Copy link
Member

So I don't seem to be able to thread with @tburdett comment about bundle specification. I was under the impression that metadata/ingest service was going to specify the individual files and which files belong together in a bundle (so all the data and metadata files associated with a single experiment output such as a run/plex of a sequencing library) but we weren't going to specify how that bundle was structured inside the data store.

I am happy for there to be reciprocal statements in both the metadata charter and the data store charter about this but as the metadata charter does not consider the concept of a bundle at all at the moment we will need to discuss how it is framed

As you said whatever is put in both the data store and metadata charters we do need to make sure that the other components (and other data consumers) have some input into the process.

@kozbo
Copy link
Contributor Author

kozbo commented Sep 23, 2018

@tburdett @lauraclarke
I added language around GDPR and managing the data model. Let's put the final polish on this on Tuesday.

## Definitions
**Bundle** A bundle is a list of related files along with some very basic metadata such as filenames.

**DCP** The Data Coordination Platform is the name given to the entire system used to ingest, validate, store, analyzes, and make available the datga in the Human Cell Atlas project.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datga typo

@kbergin
Copy link
Contributor

kbergin commented Sep 28, 2018

Hello!

Would you like to add a reference to the new upload team email on this charter?

On the Data Processing Charter I decided to make the title of the charter mailto:email@data.humancellatlas.org and to add it to the Communication section. I formatted it as
[Team email](mailto:email@data.humancellatlas.org): email@data.humancellatlas.org

For your reference, your team email is dss-team@data.humancellatlas.org

Thanks! I'll be posting this on each charter for convenience, please do let me know your thoughts :)

@kozbo kozbo merged commit 3292a48 into HumanCellAtlas:master Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.