Testing the schema with data on Mesa at Google #131

pgrosu · 2014-08-24T17:20:52Z

Hi Cassie, David, et al,

I just read the paper regarding Google's Mesa from the following links:

http://research.google.com/pubs/pub42851.html

http://research.google.com/pubs/archive/42851.pdf

I noticed it has several advantages such as online schema changes as well as query by function on sets of values, which can perform in near real-time. Another advantage is that it has petascale data-warehousing with ACID properties for transactions. The function on sets can be especially useful for the many-to-many relationships we have in our schema.

This seems to have some advantages over Megastore, Spanner, and F1 that we can try to leverage.

I was wondering if we can test the schema with data on a development area of Mesa.

Thank you,
Paul

max-biodatomics · 2014-08-25T14:48:08Z

Paul,

Mesa is a close source.
there are several open source alternatives for it in Hadoop stack. It
includes: Impala, Hive + Tez, Shark

Max

On Sun, Aug 24, 2014 at 1:20 PM, Paul Grosu notifications@github.com
wrote:

Hi Cassie, David, et al,

I just read the paper regarding Google's Mesa from the following links:

http://research.google.com/pubs/pub42851.html

http://research.google.com/pubs/archive/42851.pdf

I noticed it has several advantages such as online schema changes as well
as query by function on sets of values, which can perform in near
real-time. Another advantage is that it has petascale data-warehousing with
ACID properties for transactions. The function on sets can be especially
useful for the many-to-many relationships we have in our schema.

This seems to have some advantages over Megastore, Spanner, and F1 that we
can try to leverage.

I was wondering if we can test the schema with data on a development area
of Mesa.

Thank you,
Paul

—
Reply to this email directly or view it on GitHub
#131.

Maxim Mikheev M.D. Ph.D.
Founder & CEO

www.BioDatomics.com http://www.biodatomics.com/
Tel +1.412.475.8886
Fax +1.470.201.6233

pgrosu · 2014-08-25T16:15:25Z

Hi Max,

I understand and thank you for the alternatives, but I'm not sure that it should preclude us - especially if we gain other benefits. You'll notice BigQuery is also closed source, but we have an implementation of it for Google Genomics here:

https://github.com/googlegenomics/bigquery-examples

Thus utilizing the capabilities via a service does not require seeing the source code. All I am saying is that we have a better platform in operation ready to go for large-scale storage and analysis, and there seem to be advantages of such a closed platform that has components implemented in C/C++ vs Java with other optimizations (i.e. Collosus, etc.). Think of Google Caffeine which is based on Percolator that replaced processing on MapReduce, because it allowed more efficient processing of its indexing system. In fact Google Pregel can be very helpful for the variant analysis step where for GAVariationReference we can have very complex graphs, and it is also has a C++ implementation to make it fast. Again the code is not necessary, but only in the processing of the data via a service.

Since we are all working together as a team, everyone has their expertise which makes this project great. At least for me, the source code is not that critical in the system we use. If the service to the system can accept and process correctly a schema for updating the keys or data for storage and processing, then I see no downside. If a new platform exists with added benefits over the current implementations, and it does not impact production negatively, then I say let's try it out. Otherwise we keep tweaking the schema because of limitations in on older technologies, which might limit some of the analysis possibilities down the line.

Paul

cassiedoll · 2014-08-25T16:54:37Z

@pgrosu - this kind of thing probably isn't a good fit for ga4gh.
This is very google specific, and so would fit best in a google specific area. Likewise, that bigquery repo you linked to isn't part of ga4gh, and there are no plans to move it over.

Like Max said, there are plenty of open source alternatives for this use case if we decide we need this kind of solution.

In this repo though (ga4gh/schemas) we actually don't have a need for any large scale backend as this is just an API definition - and not an implementation. Because of that, I'm closing this issue.

pgrosu · 2014-08-25T19:06:23Z

@cassiedoll - I understand, no problem :)

cassiedoll closed this as completed Aug 25, 2014

pgrosu mentioned this issue Sep 11, 2014

GA4GH APIs need to address scientific reproducibility (propose immutable datatypes) #142

Closed

pgrosu mentioned this issue Mar 18, 2015

Side graphs: Sequences and Joins #250

Merged

pgrosu mentioned this issue Mar 26, 2015

Rationalises the SearchObjectRequest semantics. #253

Merged

pgrosu mentioned this issue Jun 5, 2015

It's not clear that the avro schema files are not required to be used in implementations ( / Is AVRO the right tool for the job?) #287

Closed

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this issue Jul 20, 2016

Fixed ga4gh#131. Filename in convert_to_binary.sh is now correct.

b043c69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing the schema with data on Mesa at Google #131

Testing the schema with data on Mesa at Google #131

pgrosu commented Aug 24, 2014

max-biodatomics commented Aug 25, 2014

pgrosu commented Aug 25, 2014

cassiedoll commented Aug 25, 2014

pgrosu commented Aug 25, 2014

Testing the schema with data on Mesa at Google #131

Testing the schema with data on Mesa at Google #131

Comments

pgrosu commented Aug 24, 2014

max-biodatomics commented Aug 25, 2014

pgrosu commented Aug 25, 2014

cassiedoll commented Aug 25, 2014

pgrosu commented Aug 25, 2014