-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect textinputformat.record.delimiter conf setting in LzoLineRecordReader #86
base: master
Are you sure you want to change the base?
Conversation
…dReader This causes LzoTextInputFormat to follow the behavior of TextInputFormat as shown in the below code http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapred/TextInputFormat.java#62
I spent quite some time writing a unit test for this, but ended up not having one I was satisfied with. Following the tests in TestLzoTextInputFormat I used a RecordWriter<Text, Text> with data in the key and empty Text() in the value positions. But when reading from the produced file I ended up with spurious tabs all over the place. What's the proper way to write data with the LZO codec in a unit test in a way that the LzoTextInputFormat can read it without having junk characters inserted into the data? |
Weird, that method doesn't exist in your CI's version of Hadoop but does on my CDH4.4.0 -- any suggestions for reconciling the two? |
The patch looks fine. Regd, unit test, I don't think extra tabs are added by hadoop-lzo anywhere. Can you update the patch with the test you have? |
What version is CI using? this method exists in all the hadoop versions I think. |
You're right that hadoop-lzo doesn't add additional tabs, I mean that the RecordWriter I'm using to create the test file puts a tab between the key and value of the record, which is then picked up by the RecordReader as data when reading it back. ash211@dcbad20#diff-6e30a0c822cde1296ce246e316c91d59R81 -- write ash211@dcbad20#diff-6e30a0c822cde1296ce246e316c91d59R104 -- read |
I found these lines in the Travis CI logs:
So looks like 1.0.4 |
Actually hadoop-lzo is built against two versions of hadoop: 1.0.4 and 2.1.0 (see https://github.com/twitter/hadoop-lzo/blob/master/pom.xml#L91). So the code must build and test cleanly against both versions. We do have a small compatibility utility where we have to deal with binary- or API-compatibility issues (https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/util/CompatibilityUtil.java). The suggestion is to stick with the APIs that are unchanged in both versions. If you must use APIs that are affected by the incompatibility, then you may need to look at the compatibility util and adopt or augment it. Hope this helps. |
Hi @sjlee , From my reading of the 1.0 source -- https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/core/org/apache/hadoop/util/LineReader.java -- it looks like 1.0.4's LineReader doesn't have the method I need, and doesn't have an alternative to use either. I think the desired behavior I'd like then is if you're running against the 2.0 APIs and have the conf setting set it uses that conf setting, but if you're running against 1.0 APIs and have it set then you get a warning that the conf setting only works in the new APIs. If you're running against either and don't have the conf setting set, things work as before. Is that the right goal to aim for? It should be pretty easy to set this up if I can detect which API version is being used and CompatibilityUtil. isVersion2x() looks like exactly what I'd need for that. |
sounds good. Looks like CDH includes this API, even with hadoop-1x. This is a fairly useful feature. Rather than checking hadoop version, you can check if the constructor exists when the config is set. Otherwise initialize it without delimiter. |
You'd suggest doing that check with reflection, checking for the On Mon, Jan 6, 2014 at 2:03 PM, Raghu Angadi notifications@github.comwrote:
|
Yep, IMHO. |
Okay I'll give that a shot and see what I come up with. Might be a day or On Mon, Jan 6, 2014 at 4:34 PM, Raghu Angadi notifications@github.comwrote:
|
|
This causes LzoTextInputFormat to follow the behavior of TextInputFormat as
shown in the below code
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapred/TextInputFormat.java#62