Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the max line size in LineRecordReader #11370

Merged
merged 1 commit into from
Sep 14, 2018

Conversation

yingsu00
Copy link
Contributor

@yingsu00 yingsu00 commented Aug 28, 2018

This is the fix for #11367

When reading large text files, in fairly rare cases Presto hive is
requesting allocation of byte[] of size 2,147,483,643 to 2,147,483,647.
This exceeds the VM limit which is Integer.MAX_VALUE-5. However apache
hadoop library uses Integer.MAX_VALUE as the limit and allows such
allocation. This results in the OutOfMemoryError but it's not caught
by Presto Hive. As a result of our JVM configuration, this results in
JVM restarts and kills all queries running on the cluster.

This commit is to
1) fix the above bug in Hadoop and allow loud failure when the line
is over the max limit;
2) make Presto depend on the snapshot version of the Hadoop library;
3) Set the default limit of text file line size 100MB in Presto.

The hadoop snapshot library PR is at:
prestodb/presto-hadoop-apache2#30

@@ -107,6 +109,8 @@ public void updateConfiguration(Configuration config)
{
copy(resourcesConfiguration, config);

config.setInt(LineRecordReader.MAX_LINE_LENGTH, toIntExact(new DataSize(100, DataSize.Unit.MEGABYTE).toBytes()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not something closer to the original limit, which was 20x bigger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the original limit, there will be many humongous byte[] arrays that are close to 2GB allocated and then collected (see org.apache.hadoop.io.Text.setCapacity()). This would cause GC pressure if this OutOfMemoryError on array size is not hit first. In one dump we got I saw ~20 2GB byte[] being collected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that we maybe overriding a value set in a hadoop.conf file, so maybe we should only set this if it is not already set. Also, I think we should make this value configurable.

I'm not sure about the default value. @findepi what do you think is a good default for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So more thinking, I think we should handle this like we do the dfsTimeout config option in this method... there is a value that comes from the HiveClientConfig, and we always apply that.
We would still need a good default value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this becomes configurable, I am not that much concerned about the value.. The only problem would be that in case of a too long line, exception message would not be helpful in tracing the config switch governing the behavior.
As this is supposed to be OOM-preventing safety mechanism only, we can start with 100M being the default and revisit the decision if we face any problems with this.

@yingsu00 yingsu00 force-pushed the LineReaderMaxBytes branch from 52c5129 to e94eb91 Compare August 29, 2018 01:55
@@ -892,6 +894,18 @@ public HiveClientConfig setAssumeCanonicalPartitionKeys(boolean assumeCanonicalP
return this;
}

public DataSize getMaxLineLength()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add

@MinDataSize("1kB")   ← technically 1B, but that's not practical min value
@MaxDataSize(...) ← highest value that won't OOM (may vary between JVMs, see io.airlift.slice.Slices#MAX_ARRAY_SIZE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi I couldn't find io.airlift.slice.Slices#MAX_ARRAY_SIZE, so I'm using @MaxDataSize("1GB") for now

return maxLineLength;
}

@Config("mapreduce.input.linerecordreader.line.maxlength")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what the config name should be, but all config names in this class begin with "hive."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original config name in org.apache.hadoop.mapreduce.lib.input.LineRecordReader. @dain do you think it's ok to keep using it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should follow the convention of the other file formats (e.g., hive.orc.max-merge-distance) so something like hive.text.max-line-length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain would it cause confusion if it's different than the original config in org.apache.hadoop.mapreduce.lib.input.LineRecordReader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. None of the configuration properties use the Hadoop names. Instead, I think it might be more confusing to have one property that is not following the existing convention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the config name so that it's clear this is truncation. How about hive.text.line-truncation-limit.

@@ -107,7 +107,8 @@ public void testDefaults()
.setCreatesOfNonManagedTablesEnabled(true)
.setHdfsWireEncryptionEnabled(false)
.setPartitionStatisticsSampleSize(100)
.setCollectColumnStatisticsOnWrite(false));
.setCollectColumnStatisticsOnWrite(false)
.setMaxLineLength(new DataSize(100, DataSize.Unit.MEGABYTE)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this in the same order as in the HCC (before isUseParquetColumnNames)

(having same order everywhere reduces potential conflicts)

@yingsu00 yingsu00 force-pushed the LineReaderMaxBytes branch 2 times, most recently from 0e0630b to 7db27c0 Compare August 29, 2018 19:38
@@ -96,6 +96,8 @@

private List<String> resourceConfigFiles = ImmutableList.of();

private DataSize maxLineLength = new DataSize(100, DataSize.Unit.MEGABYTE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi, do you have a suggestion for a good default here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain Piotr had this comment before:
findepi 13 hours ago Member
Once this becomes configurable, I am not that much concerned about the value.. The only problem would be that in case of a too long line, exception message would not be helpful in tracing the config switch governing the behavior.
As this is supposed to be OOM-preventing safety mechanism only, we can start with 100M being the default and revisit the decision if we face any problems with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the error message look like when we hit this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain It's just this line in GC log:
java.lang.OutOfMemoryError: Requested array size exceeds VM limit

However after this change there shouldn't be hitting this anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I mean when we hit this limit do we get something like "Test line larger than 100MB"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. I would not have expected that. @findepi does that seem reasonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With David's suggest config changes, I think this is reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. As long as we agree that truncating the line at 100MB is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest of the line is silently discarded.

a bit scary. Is it really discarded or is it taken to constitute a next line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi The chars after the limit is really discarded.

@@ -154,6 +155,7 @@ public void testExplicitPropertyMappings()
.put("hive.force-local-scheduling", "true")
.put("hive.max-concurrent-file-renames", "100")
.put("hive.assume-canonical-partition-keys", "true")
.put("mapreduce.input.linerecordreader.line.maxlength", "1MB")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pick a more "unnatural" value that could never be a possible default.... like 13MB

return maxLineLength;
}

@Config("mapreduce.input.linerecordreader.line.maxlength")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the config name so that it's clear this is truncation. How about hive.text.line-truncation-limit.

return maxLineLength;
}

@Config("mapreduce.input.linerecordreader.line.maxlength")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a description

@ConfigDescription("Text file lines larger than this will be silently truncated")

@@ -78,6 +78,7 @@ public void testDefaults()
.setMaxPartitionsPerWriter(100)
.setMaxOpenSortFiles(50)
.setWriteValidationThreads(16)
.setMaxLineLength(new DataSize(100, DataSize.Unit.MEGABYTE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataSize prefix is not necessary since Unit is already imported. Your IDE should have a warning for this.

@@ -892,6 +894,20 @@ public HiveClientConfig setAssumeCanonicalPartitionKeys(boolean assumeCanonicalP
return this;
}

@MinDataSize("1kB")
@MaxDataSize("1GB")
public DataSize getMaxLineLength()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add @NotNull

@@ -892,6 +894,20 @@ public HiveClientConfig setAssumeCanonicalPartitionKeys(boolean assumeCanonicalP
return this;
}

@MinDataSize("1kB")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this 1B? It would be strange to set it this low, but I see no reason why we need to prevent it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum @findepi suggested to make this 1kB
findepi 15 hours ago Member
add

@MinDataSize("1kB") ← technically 1B, but that's not practical min value

@yingsu00 yingsu00 force-pushed the LineReaderMaxBytes branch from 7db27c0 to b6c90b3 Compare August 29, 2018 23:09
return maxLineLength;
}

@Config("hive.text.line-truncation-limit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive.text.line-truncation-threshold ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@nezihyigitbasi
Copy link
Contributor

Do we really want silent truncation? I feel like failing loudly for larger lines is better (in terms of correctness) than silently returning a subset of data. We have other similar limits where we fail loudly, e.g., SliceDirectStreamReader::readBlock() we have a 1GB limit.

@dain
Copy link
Contributor

dain commented Aug 30, 2018

@nezihyigitbasi, I agree, but the problem is the Text reader code comes from Hadoop, and it doesn't have a "fail" mode, so we are trying to figure the best alternative to writing our own text reader. For perspective, the current behavior is the whole JVM crashes because we get an OOM

@findepi
Copy link
Contributor

findepi commented Aug 30, 2018

For perspective, the current behavior is the whole JVM crashes because we get an OOM

JVM crash is a bad thing. However, if we don't crash, we let a query return wrong (truncated) results. Both options don't look great.

@dain
Copy link
Contributor

dain commented Aug 30, 2018

@findepi the phrase "wrong (truncated) results" is dependent on your definition of what correct results are. If we define correctness as what Hive/Hadoop will produces, then that behavior is dependent on the hadoop.conf file, which can have this property set. I don't believe that Hadoop has a default for this, so maybe we should make the config null by default?

@yingsu00
Copy link
Contributor Author

Probably the "functionally correct way" is to set this default limit as the array size VM limit, which is 2,147,483,645 on our system. Then we can avoid the error "java.lang.OutOfMemoryError: Requested array size exceeds VM limit", and lines longer than this is not supported by Hadoop anyways. But this may cause "java.lang.OutOfMemoryError: Heap space" on the other hand, because the Hadoop code keep allocating and collecting arrays of near 2GB size. In the heap dump we saw ~20 such arrays which takes 40GB. To see this look at org.apache.hadoop.io.Text.setCapacity().

But then in the hadoop.conf file we can put a stricter limit like 100MB and let users know the change, and monitor if this still causes any GC pressure. @dain @findepi @nezihyigitbasi @electrum What do you think about this?

@findepi
Copy link
Contributor

findepi commented Aug 31, 2018

the phrase "wrong (truncated) results" is dependent on your definition of what correct results are. If we define correctness as what Hive/Hadoop will produce ...

@dain indeed, i didn't look at this that way.

Probably the "functionally correct way" is to set this default limit as the array size VM limit, which is 2,147,483,645 on our system [..]

@yingsu00 that was my thinking as well. However, as you observed, this doesn't really prevent OOM from happening, so might not solve the real problem.

What about setting this to some large value, like 100 MB and sending a patch to Hadoop changing the default upstream as well?
OR what about changing LineRecordReader to throw on too long lines and sending a patch upstream?
(OR both)

@yingsu00
Copy link
Contributor Author

@findepi Did you mean changing the truncation limit to 100MB in Hadoop as well? Anyways I think we can enable this PR on presto side first. We can send patch to Hadoop with both options later.

Now, do we have an agreement setting the default truncation limit as 100MB in Presto code? I have asked Hive team and they don't have such limit, which means the limit is 2GB for them. But they don't really care because each job uses a separate JVM. So we have two options here

  1. Set the default limit at 100MB in the code as what this PR currently does.
  2. Set the default limit as 2GB-2, and set hive.text.line-truncation-limit to 100MB in the Hive Configuration Properties file

@dain @findepi Could you please let me know your opinion on this? Thanks!

@nezihyigitbasi
Copy link
Contributor

@dain

I agree, but the problem is the Text reader code comes from Hadoop, and it doesn't have a "fail" mode, so we are trying to figure the best alternative to writing our own text reader

I understand all that. If we set a limit of, say ~100MB, and after we read a line, can we detect in our record readers/cursor that the line we just read is ~100MB and then we fail the query?

@dain
Copy link
Contributor

dain commented Sep 1, 2018

@nezihyigitbasi

I understand all that. If we set a limit of, say ~100MB, and after we read a line, can we detect in our record readers/cursor that the line we just read is ~100MB and then we fail the query?

I'm pretty sure we get the final decoded data back instead of the raw text line back.

@findepi
Copy link
Contributor

findepi commented Sep 1, 2018

Did you mean changing the truncation limit to 100MB in Hadoop as well? Anyways I think we can enable this PR on presto side first. We can send patch to Hadoop with both options later.

@yingsu00 we're considering overriding Hadoop's default behavior with something safer, at the cost of a discrepancy. Promoting this change upstream removes the discrepancy.
And yes, PR on presto side first

as to the default value -- it's apparent there isn't a silver bullet here, until we update LineRecordReader to introduce an option to fail on too long lines.
As I understand you had users running queries over files with very long lines (or maybe files that weren't in the expected format), leading to node crashes via OOM. I still need to understand what your workflow now will be, once you introduce the limit. The queries will stop crashing the nodes via OOM, but how will you identify those bad queries / bad files?

@yingsu00
Copy link
Contributor Author

yingsu00 commented Sep 4, 2018

@findepi hi Piotr, before LineRecordReader is able to introduce the fail mode, we won't be able to tell the "bad" files or queries. These queries will now succeed.

@findepi
Copy link
Contributor

findepi commented Sep 4, 2018

@yingsu00 yes, I know. I suspect that LineRecordReader won't support the fail mode, unless we make it do that.

@yingsu00
Copy link
Contributor Author

yingsu00 commented Sep 5, 2018

We had a discussion with @electrum and we will create our own copy of the LineReader. This will be done in two seperate PRs on presto-hadoop-apache2 repo.

@yingsu00
Copy link
Contributor Author

yingsu00 commented Sep 7, 2018

hi all, I have updated the PR that fails the query (throw IOException when the line is too long) in PR prestodb/presto-hadoop-apache2#30

@dain dain assigned electrum and unassigned dain Sep 10, 2018
@yingsu00
Copy link
Contributor Author

If the line is over the limit, the following error will be thrown:
java.lang.RuntimeException: Too many bytes before newline: n

This is the fix for prestodb#11367

When reading large text files, in fairly rare cases Presto hive is
requesting allocation of byte[] of size 2,147,483,643 to 2,147,483,647.
This exceeds the VM limit which is Integer.MAX_VALUE-5. However apache
hadoop library uses Integer.MAX_VALUE as the limit and allows such
allocation. This results in the OutOfMemoryError but it's not caught
by Presto Hive. As a result of our JVM configuration, this results in
JVM restarts and kills all queries running on the cluster.

This commit is to
1) fix the above bug in Hadoop and allow loud failure when the line
is over the max limit;
2) make Presto depend on the snapshot version of the Hadoop library;
3) Set the default limit of text file line size 100MB in Presto.
@@ -367,7 +367,7 @@
<dependency>
<groupId>com.facebook.presto.hadoop</groupId>
<artifactId>hadoop-apache2</artifactId>
<version>2.7.4-3</version>
<version>2.7.4-4-SNAPSHOT</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to 2.7.4-4

Copy link
Contributor

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to the release version of hadoop-apache2 before merging

@yingsu00 yingsu00 merged commit 42ec5ca into prestodb:master Sep 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants