[SPARK-8398][CORE] Hadoop input/output format advanced control #6848

koertkuipers · 2015-06-16T17:33:17Z

What changes were proposed in this pull request?

Consistently expose Configuration/JobConf for all methods that use Hadoop input/output formats, which facilitates re-use and discourages many additional parameters (that end up changing the Configuration/JobConf internally).

How was this patch tested?

New tests in SparkContextSuite that check that the resulting HadoopRDD/NewHadoopRDD indeed has the settings passed in using the Configuration/JobConf parameter.

…ats in spark context

squito · 2015-06-17T00:26:25Z

Jenkins, this is OK to test

squito · 2015-06-17T04:34:25Z

I think we'll also want to add these to JavaSparkContext and the python api (though I am not entirely certain how to do that off the top of my head.)
It would also be nice to have some really simple regression tests that those confs actually get used.

koertkuipers · 2015-06-18T18:06:49Z

ok i will look into JavaSparkContext and a few simple regression tests.
will probably need some help with python.

On Wed, Jun 17, 2015 at 12:34 AM, Imran Rashid notifications@github.com
wrote:

I think we'll also want to add these to JavaSparkContext and the python
api (though I am not entirely certain how to do that off the top of my
head.)
It would also be nice to have some really simple regression tests that
those confs actually get used.

—
Reply to this email directly or view it on GitHub
#6848 (comment).

andrewor14 · 2015-06-18T20:36:25Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

remove space around the {}

andrewor14 · 2015-06-18T20:37:50Z

add to whitelist

andrewor14 · 2015-06-18T20:38:57Z

The changes here look fine. @JoshRosen do we have to worry about breaking binary compatibility in some ways here? Even though we provide a default value to the last parameter we're technically adding a new parameters to a several public APIs here.

JoshRosen · 2015-06-18T20:43:25Z

@andrewor14 Adding a new parameter with a default value will break binary compatibility from a Java point-of-view, as far as I know.

JoshRosen · 2015-06-18T20:43:34Z

MiMa should tell us, though.

SparkQA · 2015-06-18T20:45:20Z

Test build #35167 has finished for PR 6848 at commit 135b96e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…p input/output formats in java api

SparkQA · 2015-06-19T15:26:00Z

Test build #35275 has finished for PR 6848 at commit 425a578.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-06-19T17:35:20Z

Test build #35293 has finished for PR 6848 at commit 2bfa320.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…)HadoopRDD

SparkQA · 2015-06-19T21:24:28Z

Test build #35323 has finished for PR 6848 at commit 2122160.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-19T22:20:29Z

Test build #35330 has finished for PR 6848 at commit df2c2ae.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

koertkuipers · 2015-06-20T16:42:51Z

i see MiMa failed. i will try to produce a version that is binary compatible.

SparkQA · 2015-06-20T19:51:55Z

Test build #35369 has finished for PR 6848 at commit e2f7023.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * converters, but then we couldn't have an object for every subclass of Writable (you can't

andrewor14 · 2015-06-22T18:55:34Z

retest this please. MiMa tests have been a little flaky recently.

SparkQA · 2015-06-22T19:13:29Z

Test build #35471 has finished for PR 6848 at commit e2f7023.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * converters, but then we couldn't have an object for every subclass of Writable (you can't

SparkQA · 2015-10-02T05:24:01Z

Test build #43170 has finished for PR 6848 at commit 7ca662c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * converters, but then we couldn't have an object for every subclass of Writable (you can't

SparkQA · 2015-10-18T17:34:08Z

Test build #43898 has finished for PR 6848 at commit 470b3d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * converters, but then we couldn't have an object for every subclass of Writable (you can't

SparkQA · 2015-11-11T18:07:29Z

Test build #45653 has finished for PR 6848 at commit 208c019.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * converters, but then we couldn't have an object for every subclass of Writable (you can't\n

holdenk · 2016-04-19T01:30:20Z

Would we want to maybe consider this for Spark 2.0? It seems like if were maybe going to be adding new default params to functions this might be the time to do it (of course only if people have the bandwidth to update & also review)?

It also seems like some unrelated R changes might have accidentally gotten mixed in during one of the merges that should be reverted if we want to move forward with this.

koertkuipers · 2016-04-19T02:31:13Z

i am happy to update this, if there is any interest. or otherwise i will
close.

On Mon, Apr 18, 2016 at 9:31 PM, Holden Karau notifications@github.com
wrote:

Would we want to maybe consider this for Spark 2.0? It seems like if were
maybe going to be adding new default params to functions this might be the
time to do it (of course only if people have the bandwidth to update & also
review)?

It also seems like some unrelated R changes might have accidentally gotten
mixed in during one of the merges that should be reverted if we want to
move forward with this.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#6848 (comment)

ScrapCodes · 2016-04-19T05:46:59Z

IMO, this is useful in one way that hadoop configuration need not be a global state. We can have a default set of configuration that we use everywhere as a default. And then in every hadoop related method a user has an alternative to override the default.

Binary compatibility will definitely be broken, but source compatibility might not be affected i.e. one might need to recompile the project with newer spark version. As it is asked already, it should be okay for 2.0 ?

@andrewor14 ping !

SparkQA · 2016-04-20T15:15:25Z

Test build #56367 has finished for PR 6848 at commit 0ea4e5c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T17:24:21Z

Test build #56370 has finished for PR 6848 at commit 60c34e1.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

koertkuipers · 2016-04-20T17:38:30Z

Jenkins, retest this please.

On Wed, Apr 20, 2016 at 1:26 PM, UCB AMPLab notifications@github.com
wrote:

Build finished. Test FAILed.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#6848 (comment)

SparkQA · 2016-04-20T20:26:11Z

Test build #56392 has finished for PR 6848 at commit c06548d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Timer(val iteration: Int)
- case class Case(name: String, fn: Timer => Unit)
- class ContinuousQuery(object):
- class Trigger(object):
- class ProcessingTime(Trigger):
- trait ScalaReflection
- case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends RDDPartition
- abstract class OutputWriterFactory extends Serializable
- abstract class OutputWriter
- case class HadoopFsRelation(
- trait FileFormat
- case class Partition(values: InternalRow, files: Seq[FileStatus])
- trait FileCatalog
- class HDFSFileCatalog(
- case class FakeFileStatus(

koertkuipers · 2016-04-21T03:25:50Z

ok i updated this for spark 2. the unit test failures seem unrelated

ScrapCodes · 2016-04-21T11:05:23Z

Jenkins, retest this please.

SparkQA · 2016-04-21T12:55:03Z

Test build #56527 has finished for PR 6848 at commit c06548d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Timer(val iteration: Int)
- case class Case(name: String, fn: Timer => Unit)
- class ContinuousQuery(object):
- class Trigger(object):
- class ProcessingTime(Trigger):
- trait ScalaReflection
- case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends RDDPartition
- abstract class OutputWriterFactory extends Serializable
- abstract class OutputWriter
- case class HadoopFsRelation(
- trait FileFormat
- case class Partition(values: InternalRow, files: Seq[FileStatus])
- trait FileCatalog
- class HDFSFileCatalog(
- case class FakeFileStatus(

SparkQA · 2016-04-22T00:12:17Z

Test build #56599 has finished for PR 6848 at commit 34f97d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-04-22T00:18:29Z

@koertkuipers now days we try and provide a description for our pull request (sometimes it can be copied from the JIRA) for the eventual commit message - it might be good to add that?
Also CC @JoshRosen & @squito to maybe take a look at the updated PR :)

koertkuipers · 2016-04-23T04:03:31Z

@holdenk ok i tried to make it look all up to latest standards for pullreqs

SparkQA · 2016-05-11T12:17:43Z

Test build #58363 has finished for PR 6848 at commit 34f97d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-10T20:25:11Z

Test build #60306 has finished for PR 6848 at commit 34f97d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-07T05:01:55Z

I'm going to close this for now.

Closes apache#14537. Closes apache#16181. Closes apache#8318. Closes apache#6848. Closes apache#7265. Closes apache#9543.

koertkuipers added 3 commits May 26, 2015 16:25

make hadoop configuration available to user for all hadoop input form…

db24636

…ats in spark context

add JobConf to all RDD saveAs... methods

333d943

actually use conf in saveAsSequenceFile

1f82a33

koertkuipers changed the title ~~SPARK 8398 hadoop input/output format advanced control~~ SPARK-8398 hadoop input/output format advanced control Jun 16, 2015

merge from master

135b96e

andrewor14 reviewed Jun 18, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/rdd/RDD.scala Outdated

Copy link

Contributor

andrewor14 Jun 18, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space around the {}

expose hadoop Configuration or JobConf for all methods that use hadoo…

425a578

…p input/output formats in java api

koertkuipers added 2 commits June 19, 2015 12:14

address issues raised by andrewor14

9230543

merge from master

2bfa320

really simple tests that the Configuration provided gets used in (New…

2122160

…)HadoopRDD

fix scalastyle errors

df2c2ae

dont break binary compatibility (make MiMa happy)

e2f7023

koertkuipers added 2 commits October 13, 2015 12:59

merge from master

5483148

Merge branch 'master' into feat-hadoop-input-format-advanced-control

470b3d9

merge from master

208c019

koertkuipers added 2 commits April 20, 2016 10:10

merge from master

5e0b89c

fix scalastyle errors

0ea4e5c

move mima exclusions for SPARK-8398 to version 2.0

60c34e1

merge from master

c06548d

merge from master

34f97d4

koertkuipers changed the title ~~SPARK-8398 hadoop input/output format advanced control~~ [SPARK-8398][CORE] Hadoop input/output format advanced control Apr 22, 2016

asfgit closed this in 08d6441 Dec 7, 2016

zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025

Closes stale & invalid pull requests.

a5f529b

Closes apache#14537. Closes apache#16181. Closes apache#8318. Closes apache#6848. Closes apache#7265. Closes apache#9543.

[SPARK-8398][CORE] Hadoop input/output format advanced control #6848

[SPARK-8398][CORE] Hadoop input/output format advanced control #6848

Uh oh!

Conversation

koertkuipers commented Jun 16, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

squito commented Jun 17, 2015

Uh oh!

squito commented Jun 17, 2015

Uh oh!

koertkuipers commented Jun 18, 2015

Uh oh!

andrewor14 Jun 18, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jun 18, 2015

Uh oh!

andrewor14 commented Jun 18, 2015

Uh oh!

JoshRosen commented Jun 18, 2015

Uh oh!

JoshRosen commented Jun 18, 2015

Uh oh!

SparkQA commented Jun 18, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

koertkuipers commented Jun 20, 2015

Uh oh!

SparkQA commented Jun 20, 2015

Uh oh!

andrewor14 commented Jun 22, 2015

Uh oh!

SparkQA commented Jun 22, 2015

Uh oh!

SparkQA commented Oct 2, 2015

Uh oh!

SparkQA commented Oct 18, 2015

Uh oh!

SparkQA commented Nov 11, 2015

Uh oh!

holdenk commented Apr 19, 2016

Uh oh!

koertkuipers commented Apr 19, 2016

Uh oh!

ScrapCodes commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

koertkuipers commented Apr 20, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

koertkuipers commented Apr 21, 2016

Uh oh!

ScrapCodes commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

holdenk commented Apr 22, 2016

Uh oh!

koertkuipers commented Apr 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

SparkQA commented Jun 10, 2016

koertkuipers commented Jun 16, 2015 •

edited

Loading

koertkuipers commented Apr 23, 2016 •

edited

Loading