Line reader #30

yingsu00 · 2018-09-07T06:31:40Z

Throw IOException when the text line is too big and

The capacity of the Text object is over the VM limit for the array size;
The bytes to store into the Text object is over the maxLineLength limit;
The consumed bytes is over the maxBytesToConsume.

Case 1) would cause OOM, and the fix (the same as the second commit in this PR)
was submitted as the following PR to Apache Hadoop:
apache/hadoop#414

The fix for case 2) and 3) was behavior change so that when the line is over the configured
maxLineLength limit (suppose it's less than the VM array size limit), the LineReader
throws IOException so that the query would fail loudly. The original LineReader behavior
was to silently discard the content for a line after the maxLineLength limit.

kokosing · 2018-09-07T07:34:45Z

pom.xml

-                <configuration>
-                    <failBuildInCaseOfConflict>true</failBuildInCaseOfConflict>
-                </configuration>
-            </plugin>


why is this removed?

@kokosing The LineReader class is duplicated and this plugin would throw error when building.

But you excluded this file here: https://github.com/prestodb/presto-hadoop-apache2/pull/30/files#diff-600376dffeb79835ede4a0b285078036R546.

This plugin was added here for a reason to catch that situation. If class is duplicated, then which one will be used?

That exclude doesn't work. I don't see what this maven-duplicate-finder-plugin is used for. There are no other duplicated classes. As a similar example, the presto-hive-apache pom.xml doesn't include this maven-duplicate-finder-plugin as well. However I added it back but changed the configuration to

true

@electrum what do you think?

The shade plugin does a duplicate check itself. Unfortunately, it's only a warning and doesn't fail the build, but shading is complicated and doesn't change often, so it seems reasonable that anyone working on it or reviewing can review the build output. (there are lots of other things you have to review manually when shading, this is one of them)

Having the duplicate checker is problematic because it doesn't use the exclusion config from the shade plugin. It has to be configured separately, and can be misleading since you might only configure the duplicate checker and not the shade plugin and would be lead to believe the build was safe.

Hence, my recommendation is to remove it.

Thanks, for explanation.

dain · 2018-09-07T19:44:44Z

Can you submit a PR (or issue) to Hadoop to get this change merged upstream, and then reference it in the code... basically, say, "Remove this class once xxx is resolved in Hadoop"

yingsu00 · 2018-09-10T22:19:31Z

@kokosing Regarding the maven-duplicate-finder-plugin, as @electrum explained, is a similar plugin that does the same thing of maven-shade-plugin, and we only need one. I have again removed maven-duplicate-finder-plugin and kept the maven-shade-plugin with the exclude of LineReader class.

electrum · 2018-09-10T23:20:56Z

The last commit title is too long (notice how GitHub truncates it). Please see guidelines here: https://chris.beams.io/posts/git-commit/

electrum · 2018-09-10T23:26:51Z

Can you add test for LineReader to cover the new behavior? It should be easy to create one using ByteArrayInputStream as the input source. Ideally, the test should cover all of the places where the exception is thrown (run with coverage in your IDE and make sure all of the throw statements are hit).

electrum · 2018-09-10T23:10:41Z

pom.xml

-                <configuration>
-                    <failBuildInCaseOfConflict>true</failBuildInCaseOfConflict>
-                </configuration>
-            </plugin>


The shade plugin does a duplicate check itself. Unfortunately, it's only a warning and doesn't fail the build, but shading is complicated and doesn't change often, so it seems reasonable that anyone working on it or reviewing can review the build output. (there are lots of other things you have to review manually when shading, this is one of them)

Having the duplicate checker is problematic because it doesn't use the exclusion config from the shade plugin. It has to be configured separately, and can be misleading since you might only configure the duplicate checker and not the shade plugin and would be lead to believe the build was safe.

Hence, my recommendation is to remove it.

electrum · 2018-09-10T23:13:05Z

pom.xml

@@ -340,6 +340,7 @@
            <version>6.2.1</version>
            <scope>test</scope>
        </dependency>
+


Nit: this looks like an accidental change

pom.xml

src/main/java/org/apache/hadoop/util/LineReader.java

electrum · 2018-09-10T23:31:35Z

src/main/java/org/apache/hadoop/util/LineReader.java

@@ -240,12 +246,19 @@ private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)
                appendLength = maxLineLength - txtLength;
            }
            if (appendLength > 0) {
+                int newTxtLength = txtLength + appendLength;
+                if (str.getBytes().length < newTxtLength && Math.max(newTxtLength, txtLength << 1) > MAX_ARRAY_SIZE) {


Do we need to worry about overflow of txtLength << 1? Same question for the one below

In such case Math.max(newTxtLength, txtLength << 1) will be newTxtLength which is <= maxLineLength and won't overflow. This is the same logic in Text.setCapacity().

src/main/java/org/apache/hadoop/util/LineReader.java

The shade plugin already checks for duplicate classes.

electrum

See comments, otherwise looks good