Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ import org.scalatest.FunSuite

import org.apache.hadoop.io.Text

import org.apache.spark.SparkContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.Utils
import org.apache.hadoop.io.compress.{DefaultCodec, CompressionCodecFactory, GzipCodec}

Expand All @@ -42,7 +42,15 @@ class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
private var factory: CompressionCodecFactory = _

override def beforeAll() {
sc = new SparkContext("local", "test")
// Hadoop's FileSystem caching does not use the Configuration as part of its cache key, which
// can cause Filesystem.get(Configuration) to return a cached instance created with a different
// configuration than the one passed to get() (see HADOOP-8490 for more details). This caused
// hard-to-reproduce test failures, since any suites that were run after this one would inherit
// the new value of "fs.local.block.size" (see SPARK-5227 and SPARK-5679). To work around this,
// we disable FileSystem caching in this suite.
val conf = new SparkConf().set("spark.hadoop.fs.file.impl.disable.cache", "true")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you think we should disable it across all tests? just in case there are other tests that also modify the hadoop configuration thinking that the config objects are local to them? It might bite someone else in the butt later if we don't globally do this. I don't think there is a global test class that every tests inherits, maybe we can add it in SparkSparkContext since a lot of the new tests written use that trait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. If we wanted to disable this in all tests, then I think the right place to do that would be in the Maven and SBT builds via system properties.

I chose not to do that here because I wasn't sure whether doing so might mask bugs, since most users of Spark will run with FileSystem caching enabled (I think that disabling it across the board may harm performance, since it sounds like a lot of Hadoop code assumes that FileSystem.get is cheap, and, accordingly, calls it many times).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that doing it for all tests would be a bit drastic. It's not that modifications to Configuration objects arbitrarily affect other Configuration objects. It's that this test specifically relies on some underlying properties set on the FileSystem, and the FS Cache allows these to leak to other FS instances.


sc = new SparkContext("local", "test", conf)

// Set the block size of local file system to test whether files are split right or not.
sc.hadoopConfiguration.setLong("fs.local.block.size", 32)
Expand Down