Fixes map-only jobs to accommodate both an lzo source and sink binary converter by ulyssepence · Pull Request #1431 · twitter/scalding

ulyssepence · 2015-08-26T03:20:33Z

The LzoGenericScheme was using the same jobconf key for both the source and sink BinaryConverter, leading to a ClassCastException whenever a map-only job read one type and wrote another using LzoGenericScheme. After the fix, a job like this, having no reduce phase, no longer fails:

val sourceA = LzoGenericSource(ScroogeBinaryConverter[ThriftTypeA], classOf[ThriftTypeA], inputPathA)
val sourceB = LzoGenericSource(ScroogeBinaryConverter[ThriftTypeB], classOf[ThriftTypeB], inputPathB)
val sink    = LzoGenericSource(ScroogeBinaryConverter[ThriftTypeC], classOf[ThriftTypeC], outputPath)

val unioned = TypedPipe.from(sourceA) ++ TypedPipe.from(sourceB)

unioned
  .map {
    case ThriftTypeA(i) => ThriftTypeC(i)
    case ThriftTypeB(i) => ThriftTypeC(i)
  }
  .write(sink)

Unfortunately, it is difficult to create a good test because running this in a local hadoop cluster requires GPL compression libraries to be installed.

…verter

johnynek · 2015-08-26T03:29:55Z

What if there are two different source types? Also, are we 100% sure cascading cannot write two sinks in the same job?

I thought cascading had a means to keep keys separated on a per tap basis. Are we not reimplementing that?

/cc @cwensel

johnynek · 2015-08-26T03:34:21Z

To be more clear: I see how your job has two different source types, but I wonder if there is a reduce phase if this fix would have also worked.

Also, how can we have code that is okay (with respect to GPL) but testing it requires GPL code? Also, I think Hadoop-lzo is GPL, so that module that depends on it has to be GPL (we can dual license it but the Lzo stuff is probably GPL).

ulyssepence · 2015-08-26T03:41:58Z

Oh I'm so dumb I didn't realize GPL was GPL and thought it was some compression suite. What I should have said is it requires the LZO system libraries to be installed.

Mr. @ianoc told me multiple writes in Scalding always lead to multiple jobs, but I can look into it tomorrow.

I don't understand your question about the reducer.

ianoc · 2015-08-26T18:32:34Z

@johnynek if there was a force to disk/reduce phase this would have been ok -- why shard gets people unblocked who've hit it. We aren't re-implemented the multi-source/multi-tap stuff here, we actually still require it.

This code used the same key in the job conf's in inputs and outputs previously, which could cause a collision if one was in the global job conf. Hadoop MR formats use separate keys in the input and output sides for the same thing, so here we are just aligning with that, so it should support any multi-source/tap scenario the same as the built in formats I believe.

That make sense?

isnotinvain · 2015-08-26T19:10:02Z

@johnynek our (maybe unconfirmed) understanding is that cascading separates / isolates configs for each source but not for each sink, but there can be only one sink (not source) per MR job, so isolated sink configs is not needed / isn't implemented.

isnotinvain · 2015-08-26T19:11:12Z

We can't write an integration test because of how painful it is to install the native hadoop lzo libraries on mac laptops (though we could probably write a test that only runs in travis). However, we could probably write a unit test that does some setup and then inspects some configs for the right keys?

isnotinvain · 2015-08-26T19:14:09Z

scalding-commons/src/main/scala/com/twitter/scalding/commons/source/LzoGenericScheme.scala

I don't think you need this object, nor SinkConfigBinaryConverterProvider

just put these vals in:

object ConfigBinaryConverterProvider { val sourceKey = "com.twitter.scalding.lzo.converter.provider.source" val sinkKey = "com.twitter.scalding.lzo.converter.provider.sink" }

oh i see, you need these so you can have a class to put in the conf. nevermind.

isnotinvain · 2015-08-26T19:17:06Z

LGTM, think you can write a unit test though? If you skip the parts that actually read data, you can just assert that the configs contain the right keys / values

ulyssepence · 2015-08-26T22:37:54Z

In order to test this without running a local hadoop cluster/using the lzo system binary, I believe we would need to strip away all the elements that would make this a good test (per conversation with @ianoc).

rubanm · 2015-08-27T00:50:57Z

LGTM. Given the additional lzo binary setup, adding an e2e test internally for now sgtm.

Fixes map-only jobs to accommodate both an lzo source and sink binary converter

Fixes map-only exception when setting both source and sink binary con…

b19f198

…verter

ulyssepence changed the title ~~Fixes map-only jobs to accommodate both an lzo scrooge source and sink binary converter~~ Fixes map-only jobs to accommodate both an lzo source and sink binary converter Aug 26, 2015

isnotinvain reviewed Aug 26, 2015
View reviewed changes

ulyssepence pushed a commit that referenced this pull request Aug 27, 2015

Merge pull request #1431 from benpence/map_only_converter

4ebb2d1

Fixes map-only jobs to accommodate both an lzo source and sink binary converter

ulyssepence merged commit 4ebb2d1 into twitter:develop Aug 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes map-only jobs to accommodate both an lzo source and sink binary converter#1431

Fixes map-only jobs to accommodate both an lzo source and sink binary converter#1431
ulyssepence merged 1 commit intotwitter:developfrom
ulyssepence:map_only_converter

ulyssepence commented Aug 26, 2015

Uh oh!

johnynek commented Aug 26, 2015

Uh oh!

johnynek commented Aug 26, 2015

Uh oh!

ulyssepence commented Aug 26, 2015

Uh oh!

ianoc commented Aug 26, 2015

Uh oh!

isnotinvain commented Aug 26, 2015

Uh oh!

isnotinvain commented Aug 26, 2015

Uh oh!

isnotinvain Aug 26, 2015

Uh oh!

isnotinvain Aug 26, 2015

Uh oh!

isnotinvain commented Aug 26, 2015

Uh oh!

ulyssepence commented Aug 26, 2015

Uh oh!

rubanm commented Aug 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ulyssepence commented Aug 26, 2015

Uh oh!

johnynek commented Aug 26, 2015

Uh oh!

johnynek commented Aug 26, 2015

Uh oh!

ulyssepence commented Aug 26, 2015

Uh oh!

ianoc commented Aug 26, 2015

Uh oh!

isnotinvain commented Aug 26, 2015

Uh oh!

isnotinvain commented Aug 26, 2015

Uh oh!

isnotinvain Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

isnotinvain Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Aug 26, 2015

Uh oh!

ulyssepence commented Aug 26, 2015

Uh oh!

rubanm commented Aug 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants