Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line reader #30

Merged
merged 4 commits into from
Sep 14, 2018
Merged

Line reader #30

merged 4 commits into from
Sep 14, 2018

Conversation

yingsu00
Copy link

@yingsu00 yingsu00 commented Sep 7, 2018

Throw IOException when the text line is too big and

  1. The capacity of the Text object is over the VM limit for the array size;
  2. The bytes to store into the Text object is over the maxLineLength limit;
  3. The consumed bytes is over the maxBytesToConsume.

Case 1) would cause OOM, and the fix (the same as the second commit in this PR)
was submitted as the following PR to Apache Hadoop:
apache/hadoop#414

The fix for case 2) and 3) was behavior change so that when the line is over the configured
maxLineLength limit (suppose it's less than the VM array size limit), the LineReader
throws IOException so that the query would fail loudly. The original LineReader behavior
was to silently discard the content for a line after the maxLineLength limit.

<configuration>
<failBuildInCaseOfConflict>true</failBuildInCaseOfConflict>
</configuration>
</plugin>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing The LineReader class is duplicated and this plugin would throw error when building.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you excluded this file here: https://github.com/prestodb/presto-hadoop-apache2/pull/30/files#diff-600376dffeb79835ede4a0b285078036R546.

This plugin was added here for a reason to catch that situation. If class is duplicated, then which one will be used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That exclude doesn't work. I don't see what this maven-duplicate-finder-plugin is used for. There are no other duplicated classes. As a similar example, the presto-hive-apache pom.xml doesn't include this maven-duplicate-finder-plugin as well. However I added it back but changed the configuration to

true

@electrum what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shade plugin does a duplicate check itself. Unfortunately, it's only a warning and doesn't fail the build, but shading is complicated and doesn't change often, so it seems reasonable that anyone working on it or reviewing can review the build output. (there are lots of other things you have to review manually when shading, this is one of them)

Having the duplicate checker is problematic because it doesn't use the exclusion config from the shade plugin. It has to be configured separately, and can be misleading since you might only configure the duplicate checker and not the shade plugin and would be lead to believe the build was safe.

Hence, my recommendation is to remove it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, for explanation.

@dain
Copy link

dain commented Sep 7, 2018

Can you submit a PR (or issue) to Hadoop to get this change merged upstream, and then reference it in the code... basically, say, "Remove this class once xxx is resolved in Hadoop"

@yingsu00 yingsu00 force-pushed the LineReader branch 2 times, most recently from 3ca2406 to 56d2f8b Compare September 10, 2018 09:16
@dain dain removed their request for review September 10, 2018 19:03
@dain dain assigned electrum and unassigned dain Sep 10, 2018
@yingsu00 yingsu00 removed their assignment Sep 10, 2018
@yingsu00
Copy link
Author

@kokosing Regarding the maven-duplicate-finder-plugin, as @electrum explained, is a similar plugin that does the same thing of maven-shade-plugin, and we only need one. I have again removed maven-duplicate-finder-plugin and kept the maven-shade-plugin with the exclude of LineReader class.

@yingsu00 yingsu00 assigned dain and unassigned dain Sep 10, 2018
@electrum
Copy link
Contributor

The last commit title is too long (notice how GitHub truncates it). Please see guidelines here: https://chris.beams.io/posts/git-commit/

@electrum
Copy link
Contributor

Can you add test for LineReader to cover the new behavior? It should be easy to create one using ByteArrayInputStream as the input source. Ideally, the test should cover all of the places where the exception is thrown (run with coverage in your IDE and make sure all of the throw statements are hit).

<configuration>
<failBuildInCaseOfConflict>true</failBuildInCaseOfConflict>
</configuration>
</plugin>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shade plugin does a duplicate check itself. Unfortunately, it's only a warning and doesn't fail the build, but shading is complicated and doesn't change often, so it seems reasonable that anyone working on it or reviewing can review the build output. (there are lots of other things you have to review manually when shading, this is one of them)

Having the duplicate checker is problematic because it doesn't use the exclusion config from the shade plugin. It has to be configured separately, and can be misleading since you might only configure the duplicate checker and not the shade plugin and would be lead to believe the build was safe.

Hence, my recommendation is to remove it.

pom.xml Outdated
@@ -340,6 +340,7 @@
<version>6.2.1</version>
<scope>test</scope>
</dependency>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this looks like an accidental change

pom.xml Show resolved Hide resolved
src/main/java/org/apache/hadoop/util/LineReader.java Outdated Show resolved Hide resolved
src/main/java/org/apache/hadoop/util/LineReader.java Outdated Show resolved Hide resolved
@@ -240,12 +246,19 @@ private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)
appendLength = maxLineLength - txtLength;
}
if (appendLength > 0) {
int newTxtLength = txtLength + appendLength;
if (str.getBytes().length < newTxtLength && Math.max(newTxtLength, txtLength << 1) > MAX_ARRAY_SIZE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about overflow of txtLength << 1? Same question for the one below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case Math.max(newTxtLength, txtLength << 1) will be newTxtLength which is <= maxLineLength and won't overflow. This is the same logic in Text.setCapacity().

src/main/java/org/apache/hadoop/util/LineReader.java Outdated Show resolved Hide resolved
The shade plugin already checks for duplicate classes.
@yingsu00 yingsu00 force-pushed the LineReader branch 3 times, most recently from 34edac2 to 95bc0d1 Compare September 11, 2018 22:29
Copy link
Contributor

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments, otherwise looks good

@electrum
Copy link
Contributor

Can you also add a test for LineReader that passes?

import static java.nio.charset.StandardCharsets.UTF_8;
import static org.testng.Assert.assertEquals;

public class TestLineReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: formatting for this is off

// There should not be any bytes read into str because the maxLineLength is 0
reader.readLine(str, 0, 30);
assertEquals(str.getLength(), 0);
assertEquals(str.getBytes(), "".getBytes());
Copy link
Contributor

@electrum electrum Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second test will never be hit, because the first one would fail first if the string is non-empty. It would be a bit better to replace both tests with this:

assertEquals(str, new Text());

This will use the equals method of Text, which works correctly. The advantage is that on a failure, it will show the results of toString for the actual and expected values, which makes it easier to debug (compared to just showing the length is different).

The same goes for the other tests.

assertEquals(str.getLength(), 0);
assertEquals(str.getBytes(), "".getBytes());
}
catch (IOException e) {
Copy link
Contributor

@electrum electrum Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch is not needed. Just declare the checked exception for the test method. If it throws, the test will fail.
Same goes for the rest of the methods.

// The LineReader does not store the new line character into str,
// and the str was doubled 3 times so its size is 32, with 5 0-valued
// bytes at the end, so we need to compare just the first 27 characters.
assertEquals(str.getLength(), 27);
Copy link
Contributor

@electrum electrum Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write this as

String input = "Hello world! Goodbye world!\n";
...
assertEquals(str, new Text(input));

You want to make tests as simple as possible, both so they are easy to read and to avoid bugs in tests.

Fail the read when the text file line is over the maxLineLength limit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants