Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Testing the schema with data on Mesa at Google #131

Closed
pgrosu opened this issue Aug 24, 2014 · 4 comments
Closed

Testing the schema with data on Mesa at Google #131

pgrosu opened this issue Aug 24, 2014 · 4 comments

Comments

@pgrosu
Copy link
Contributor

pgrosu commented Aug 24, 2014

Hi Cassie, David, et al,

I just read the paper regarding Google's Mesa from the following links:

http://research.google.com/pubs/pub42851.html

http://research.google.com/pubs/archive/42851.pdf

I noticed it has several advantages such as online schema changes as well as query by function on sets of values, which can perform in near real-time. Another advantage is that it has petascale data-warehousing with ACID properties for transactions. The function on sets can be especially useful for the many-to-many relationships we have in our schema.

This seems to have some advantages over Megastore, Spanner, and F1 that we can try to leverage.

I was wondering if we can test the schema with data on a development area of Mesa.

Thank you,
Paul

@max-biodatomics
Copy link

Paul,

Mesa is a close source.
there are several open source alternatives for it in Hadoop stack. It
includes: Impala, Hive + Tez, Shark

Max

On Sun, Aug 24, 2014 at 1:20 PM, Paul Grosu notifications@github.com
wrote:

Hi Cassie, David, et al,

I just read the paper regarding Google's Mesa from the following links:

http://research.google.com/pubs/pub42851.html

http://research.google.com/pubs/archive/42851.pdf

I noticed it has several advantages such as online schema changes as well
as query by function on sets of values, which can perform in near
real-time. Another advantage is that it has petascale data-warehousing with
ACID properties for transactions. The function on sets can be especially
useful for the many-to-many relationships we have in our schema.

This seems to have some advantages over Megastore, Spanner, and F1 that we
can try to leverage.

I was wondering if we can test the schema with data on a development area
of Mesa.

Thank you,
Paul


Reply to this email directly or view it on GitHub
#131.

Maxim Mikheev M.D. Ph.D.
Founder & CEO

www.BioDatomics.com http://www.biodatomics.com/
Tel +1.412.475.8886
Fax +1.470.201.6233

@pgrosu
Copy link
Contributor Author

pgrosu commented Aug 25, 2014

Hi Max,

I understand and thank you for the alternatives, but I'm not sure that it should preclude us - especially if we gain other benefits. You'll notice BigQuery is also closed source, but we have an implementation of it for Google Genomics here:

https://github.com/googlegenomics/bigquery-examples

Thus utilizing the capabilities via a service does not require seeing the source code. All I am saying is that we have a better platform in operation ready to go for large-scale storage and analysis, and there seem to be advantages of such a closed platform that has components implemented in C/C++ vs Java with other optimizations (i.e. Collosus, etc.). Think of Google Caffeine which is based on Percolator that replaced processing on MapReduce, because it allowed more efficient processing of its indexing system. In fact Google Pregel can be very helpful for the variant analysis step where for GAVariationReference we can have very complex graphs, and it is also has a C++ implementation to make it fast. Again the code is not necessary, but only in the processing of the data via a service.

Since we are all working together as a team, everyone has their expertise which makes this project great. At least for me, the source code is not that critical in the system we use. If the service to the system can accept and process correctly a schema for updating the keys or data for storage and processing, then I see no downside. If a new platform exists with added benefits over the current implementations, and it does not impact production negatively, then I say let's try it out. Otherwise we keep tweaking the schema because of limitations in on older technologies, which might limit some of the analysis possibilities down the line.

Paul

@cassiedoll
Copy link
Member

@pgrosu - this kind of thing probably isn't a good fit for ga4gh.
This is very google specific, and so would fit best in a google specific area. Likewise, that bigquery repo you linked to isn't part of ga4gh, and there are no plans to move it over.

Like Max said, there are plenty of open source alternatives for this use case if we decide we need this kind of solution.

In this repo though (ga4gh/schemas) we actually don't have a need for any large scale backend as this is just an API definition - and not an implementation. Because of that, I'm closing this issue.

@pgrosu
Copy link
Contributor Author

pgrosu commented Aug 25, 2014

@cassiedoll - I understand, no problem :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants