Respect textinputformat.record.delimiter conf setting in LzoLineRecordReader #86

ash211 · 2014-01-03T10:20:19Z

This causes LzoTextInputFormat to follow the behavior of TextInputFormat as
shown in the below code

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapred/TextInputFormat.java#62

…dReader This causes LzoTextInputFormat to follow the behavior of TextInputFormat as shown in the below code http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapred/TextInputFormat.java#62

ash211 · 2014-01-03T10:24:21Z

I spent quite some time writing a unit test for this, but ended up not having one I was satisfied with.

Following the tests in TestLzoTextInputFormat I used a RecordWriter<Text, Text> with data in the key and empty Text() in the value positions. But when reading from the produced file I ended up with spurious tabs all over the place. What's the proper way to write data with the LZO codec in a unit test in a way that the LzoTextInputFormat can read it without having junk characters inserted into the data?

ash211 · 2014-01-03T16:25:45Z

Weird, that method doesn't exist in your CI's version of Hadoop but does on my CDH4.4.0 -- any suggestions for reconciling the two?

rangadi · 2014-01-03T18:44:04Z

The patch looks fine. Regd, unit test, I don't think extra tabs are added by hadoop-lzo anywhere. Can you update the patch with the test you have?

rangadi · 2014-01-03T18:52:59Z

What version is CI using? this method exists in all the hadoop versions I think.

ash211 · 2014-01-03T18:55:01Z

You're right that hadoop-lzo doesn't add additional tabs, I mean that the RecordWriter I'm using to create the test file puts a tab between the key and value of the record, which is then picked up by the RecordReader as data when reading it back.

ash211@dcbad20#diff-6e30a0c822cde1296ce246e316c91d59R81 -- write

ash211@dcbad20#diff-6e30a0c822cde1296ce246e316c91d59R104 -- read

ash211 · 2014-01-03T18:57:09Z

I found these lines in the Travis CI logs:

Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom
Downloading: https://repository.apache.org/releases/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom
Downloading: http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom
Downloaded: http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom (5 KB at 39.0 KB/sec)

So looks like 1.0.4

sjlee · 2014-01-06T16:37:11Z

Actually hadoop-lzo is built against two versions of hadoop: 1.0.4 and 2.1.0 (see https://github.com/twitter/hadoop-lzo/blob/master/pom.xml#L91). So the code must build and test cleanly against both versions.

We do have a small compatibility utility where we have to deal with binary- or API-compatibility issues (https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/util/CompatibilityUtil.java).

The suggestion is to stick with the APIs that are unchanged in both versions. If you must use APIs that are affected by the incompatibility, then you may need to look at the compatibility util and adopt or augment it.

Hope this helps.

ash211 · 2014-01-06T18:44:49Z

Hi @sjlee ,

From my reading of the 1.0 source -- https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/core/org/apache/hadoop/util/LineReader.java -- it looks like 1.0.4's LineReader doesn't have the method I need, and doesn't have an alternative to use either.

I think the desired behavior I'd like then is if you're running against the 2.0 APIs and have the conf setting set it uses that conf setting, but if you're running against 1.0 APIs and have it set then you get a warning that the conf setting only works in the new APIs. If you're running against either and don't have the conf setting set, things work as before.

Is that the right goal to aim for?

It should be pretty easy to set this up if I can detect which API version is being used and CompatibilityUtil. isVersion2x() looks like exactly what I'd need for that.

rangadi · 2014-01-06T22:03:02Z

sounds good. Looks like CDH includes this API, even with hadoop-1x. This is a fairly useful feature. Rather than checking hadoop version, you can check if the constructor exists when the config is set. Otherwise initialize it without delimiter.

ash211 · 2014-01-07T00:28:30Z

You'd suggest doing that check with reflection, checking for the
constructor on the class?

On Mon, Jan 6, 2014 at 2:03 PM, Raghu Angadi notifications@github.comwrote:

sounds good. Looks like CDH includes this API, even with hadoop-1x. This
is a fairly useful feature. Rather than checking hadoop version, you can
check if the constructor exists when the config is set. Otherwise
initialize it without delimiter.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/86#issuecomment-31692194
.

rangadi · 2014-01-07T00:34:28Z

Yep, IMHO.

ash211 · 2014-01-07T00:48:42Z

Okay I'll give that a shot and see what I come up with. Might be a day or
two before I get it back here.

On Mon, Jan 6, 2014 at 4:34 PM, Raghu Angadi notifications@github.comwrote:

Yep, IMHO.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/86#issuecomment-31702761
.

ash211 · 2014-07-08T07:35:52Z

So apparently I never quite got back to this! Fixing the test issue is probably something someone else should take on.

@rangadi or @sjlee would you be able to help push this PR over the edge and get it committed?

CLAassistant · 2019-07-18T15:08:10Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

ash211 added 3 commits January 6, 2014 21:31

Use reflection to determine if LineReader supports custom delimiters

1906a95

BROKEN: in-progress patch that currently fails

efa9bde

Disable test

5c79846

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect textinputformat.record.delimiter conf setting in LzoLineRecordReader #86

Respect textinputformat.record.delimiter conf setting in LzoLineRecordReader #86

ash211 commented Jan 3, 2014

ash211 commented Jan 3, 2014

ash211 commented Jan 3, 2014

rangadi commented Jan 3, 2014

rangadi commented Jan 3, 2014

ash211 commented Jan 3, 2014

ash211 commented Jan 3, 2014

sjlee commented Jan 6, 2014

ash211 commented Jan 6, 2014

rangadi commented Jan 6, 2014

ash211 commented Jan 7, 2014

rangadi commented Jan 7, 2014

ash211 commented Jan 7, 2014

ash211 commented Jul 8, 2014

CLAassistant commented Jul 18, 2019 •

edited

Loading

Respect textinputformat.record.delimiter conf setting in LzoLineRecordReader #86

Are you sure you want to change the base?

Respect textinputformat.record.delimiter conf setting in LzoLineRecordReader #86

Conversation

ash211 commented Jan 3, 2014

ash211 commented Jan 3, 2014

ash211 commented Jan 3, 2014

rangadi commented Jan 3, 2014

rangadi commented Jan 3, 2014

ash211 commented Jan 3, 2014

ash211 commented Jan 3, 2014

sjlee commented Jan 6, 2014

ash211 commented Jan 6, 2014

rangadi commented Jan 6, 2014

ash211 commented Jan 7, 2014

rangadi commented Jan 7, 2014

ash211 commented Jan 7, 2014

ash211 commented Jul 8, 2014

CLAassistant commented Jul 18, 2019 • edited Loading

CLAassistant commented Jul 18, 2019 •

edited

Loading