Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Store Charter #8

Merged
merged 17 commits into from
Oct 4, 2018
75 changes: 75 additions & 0 deletions charters/DataStore/charter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@

# Data Store


## Description
kozbo marked this conversation as resolved.
Show resolved Hide resolved
The Data Store is a scientific data sharing/publishing/distribution framework, providing file/bundle management on multiple clouds at PB scale, with strong public APIs. It provides a simple API for storage, retrieval, and subscription to events that functions transparently across multiple cloud systems such as AWS and GCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would define the acronym that you use below so that people know what DSS is: The Data Storage System (DSS) ...


## Objective

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in the document it is clear that the team intends to a) build software, and b) provide devops services in support of operating this software. This section should make that explicit, as devops support (to the lead Ops team) is a core objective of the team.

The objective of the Data Store group is to deliver substantively complete functionality on all of the in-scope items listed in this charter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be generally true for all charters? If we approach objective this way it does not add information to the charter.

Copy link
Contributor Author

@kozbo kozbo Aug 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I wasn't sure what to put here. Now that there are other charters I will take a look to see if I can get inspired!
UPDATE: OK that helped. I have taken another pass at it.


## In-scope
kozbo marked this conversation as resolved.
Show resolved Hide resolved

### Interfaces
* DSS data read and write API (PUT bundle, PUT file, GET bundle, GET file) - maintenance and extension of the implementation of the basic data access APIs.
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DSS needs to be defined per its use. Then there's should be a consistency of when DSS is used instead of Data Store

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move them all to Data Store

* Checkout service APIs - Provide continuing support for the ability to checkout the data to a local filesystem, or a personal cloud environment
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between continuing support and maintenance [and extension] ?

In general, I find this style to be wordy. Why not eliminate the API "titles" and use a form like:

  • Maintain and extend (or continue to support if that's preferred) the Checkout service API which enables data checkout to a local filesystem or a personal cloud environment

* Collections service APIs - Maintenance and extension of the ability to do basic operations on arbitrary collections of objects in the Data Store.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or:

  • Maintain and extend the Collections service API which enables basic operations on arbitrary collections of objects in the Data Store.

* API Documentation - Programmatic APIs available for the Data Store include the REST interface and the Python bindings. Documentation and examples will be created for both of these APIs.
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify? API Documentation and examples for both the Data Store REST interface and Python bindings will be published.

Or if you adopt the style above:

  • Publish API documentation and examples for both the Data Store REST interface and Python bindings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the latter.

* HCA DCP CLI tool - The HCA DCP CLI is a foundational tool for the DCP and its users. All subcomponents in the DCP use the same CLI system. The Data Store team will maintain the infrastructure to support the general CLI architecture as well as the CLI commands relating to the Data Store itself. Other modules such as Upload and Ingest will be responsible for implementing their respective functional components of the CLI
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something a bit tighter like:

The Command Line Interface (CLI) is a foundational tool for interacting with the DCP. The Data Store team is responsible for the specific Data Store commands and the maintenance of the infrastructure that allows other services such as Upload and Ingest to integrate their commands into the CLI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the improved wording. I am thinking that the Data Store should be included in the review of all CLI changes by the other modules. What do folks think of that addition?


### Core capabilities
kozbo marked this conversation as resolved.
Show resolved Hide resolved
* DSS data model and lifecycle (versioned bundles, etc) - Ongoing support and maintenance of the implementation of the data model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maintain and extend the DSS data model and lifecycle (such as versioned bundles)

* Subscriptions/Eventing - Implementation of Data lifecycle web-hooks (new bundle, new file, delete bundle, delete file). The Data Store implementation will move away from the current dependance on Elastic Search Percolate for our event subsystem. Eventing will depend instead on the AWS and GCP cloud infrastructure directly.
Copy link
Collaborator

@brianraymor brianraymor Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Data Store event subsystem will transition from its current dependence on Elastic Search Percolate to the AWS and GCP cloud infrastructure.

Or:

  • Transition Data Store Subscriptions/Eventing services from the current dependence on Elastic Search Percolate to the AWS and GCP cloud infrastructure

* Multi-cloud replication of objects - There are three parts to this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would remove - There are three parts to this:. That there are three parts to this is implied by the outline format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetch ... the comfy chair!

1. Maintenance and improvements to the synchronization implementation between AWS and GCP
2. Extending the cloud support to more vendors (such as Microsoft Azure)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Azure in scope for this charter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure is not specifically in scope for this charter. I think I should reword this to:
2. Document interfaces to enable new cloud implementations by 3rd parties.

3. Supporting multiple replicas within a single cloud.
* Support for plug-able indexes - Provide a standard interface for connecting indexing modules to the Data Store. This interface will provide a mechanism to connect indexing subsystems to receive events about the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repetitive. How about Define a standard interface to enable pluggable indexing modules to receive Data Store events


### Security
* User authentication system implementation
* Data access authorization system implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to say that DSS will provide authentication and authorization? Implementation of the foundational systems is not really the remit of the DSS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't each Box/module have to implement their part of Auth? I was thinking that Auth architecture and implementation of basic libs would be the responsibility of the DevSecOps group, but the implementation required to hook into those systems would be the responsibility of each Box/module. That is what I was going for here.

kozbo marked this conversation as resolved.
Show resolved Hide resolved
* DevSecOps - implementation of features required for eventual FISMA moderate deployments (authentication, authorization, logging, auditing, etc).
* Operations for DSS - Implement and configure tools to facilitate the operation of the Data Store service in a production environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use DSS or Data Storage Service (with capitalization). Not both.


### Community engagement
* Triage and integration of feature requests from the community into the Data Store roadmap.
* Outreach and engagement of the community
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Training and Hackathons sub-bullets or examples of Outreach and engagement of the community?

kozbo marked this conversation as resolved.
Show resolved Hide resolved
* Training
* Hackathons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention a little more about what the hackathons and trainings would be about? I think it is less important what is the type of event facilitating the training and more important what they would focus on.



## Out-of-scope
kozbo marked this conversation as resolved.
Show resolved Hide resolved
* Other index/query methods/engines - we should implement these as stand-alone projects against modular index/query API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is we should implement these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: query languages and indices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change index to plural to match the rest of the list.

* Matrix service API

## Milestones
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Milestones should refer to features with the same names that were introduced in earlier sections. For example, pluggable indexing modules versus stand-alone indexers.

Perhaps, something like:
EOY 2018: Add Support for:

  • Checkout Service
  • Collection Service
  • Pluggable indexing modules
  • ...

To clarify, Data Store is planning re-charter in January 2019 based on the milestones?

kozbo marked this conversation as resolved.
Show resolved Hide resolved
* Mid-2018: 1000 bundle test scale, deploy as part of HCA DCP Pilot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double space

* EOY 2018: add checkout, collections, improved scaling/hardening, generic events to support stand-alone indexers, additional gaps identified in HCA DCP Pilot.
* Future: (not in order of precedence) native GCP support, Authorization support for controlled-access data, additional scale/hardening, Biosphere requirements, tiered storage, content zones, FISMA moderate capabilities, single-replica deployments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future would represent a re-charter to extend scope and milestones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I take out the Future section?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Futures represent potential scope for a future, refreshed Data Store charter. Well-scoped charters are not intended to be for forever. Now, if you have a clear milestone to deliver (for example) native GCP support by May 2019, then that would be different. Does that make sense?


## Roles

### Project Lead
[Brian O’Connor](mailto:brocono@ucsc.edu)

### Product Owner
[Kevin Osborn](mailto:kosborn2@ucsc.edu)

### Technical Lead
[Hannes Schmidt](mailto:hannes@ucsc.edu)

## Communication
### Slack Channels
* HumanCellAtlas/data-store : general data store discussions
* HumanCellAtlas/data-store-eng : development discussions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd eliminate the extra space for consistency.

### Mailing list(s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not used, then Mailing list(s) and Discussion Forum(s) can be deleted.

### Discussion Forum(s)

## Github repositories
* https://github.com/HumanCellAtlas/data-store
* https://github.com/HumanCellAtlas/dcp-cli
* https://github.com/HumanCellAtlas/metadata-api
* https://github.com/chanzuckerberg/cloud-blobstore
* https://github.com/HumanCellAtlas/checksumming_io