[SPARK-4361][Doc] Add more docs for Hadoop Configuration #3225

zsxwing · 2014-11-12T09:53:34Z

I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea?

SparkQA · 2014-11-12T10:00:05Z

Test build #23263 has started for PR 3225 at commit fe4e3d5.

This patch merges cleanly.

SparkQA · 2014-11-12T11:26:42Z

Test build #23263 has finished for PR 3225 at commit fe4e3d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-12T11:26:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23263/
Test PASSed.

rxin · 2014-11-21T08:25:44Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -630,7 +634,10 @@ class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging {
   * necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable),
   * using the older MapReduce API (`org.apache.hadoop.mapred`).
   *
-   * @param conf JobConf for setting up the dataset
+   * @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.


i don't think we reuse the conf across different RDDs, do we?

People may call this method directly and pass their Configuration.

def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]( path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration)

E.g., creating a configuration for accessing hbase:

import java.io.{DataOutputStream, ByteArrayOutputStream} import java.lang.String import org.apache.hadoop.hbase.client.Scan import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.client.Result import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.util.Base64 def convertScanToString(scan: Scan): String = { val out: ByteArrayOutputStream = new ByteArrayOutputStream val dos: DataOutputStream = new DataOutputStream(out) scan.write(dos) Base64.encodeBytes(out.toByteArray) } val conf = HBaseConfiguration.create() val scan = new Scan() scan.setCaching(500) scan.setCacheBlocks(false) conf.set(TableInputFormat.INPUT_TABLE, "table_name") conf.set(TableInputFormat.SCAN, convertScanToString(scan)) val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) rdd.count()

This is fine. However, some people may need to access two tables and union them. They may reuse the Configuration like this:

val conf = HBaseConfiguration.create() val scan = new Scan() scan.setCaching(500) scan.setCacheBlocks(false) conf.set(TableInputFormat.INPUT_TABLE, "table_name") conf.set(TableInputFormat.SCAN, convertScanToString(scan)) val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) conf.set(TableInputFormat.INPUT_TABLE, "another_table_name") val rdd2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) rdd.union(rdd2).count()

The result will be weird and wrong.
I think the docs should tell people not to reuse it like this.

My motivation is this mail thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-did-the-RDD-union-work-td18686.html

In fact, there are many uses of sc.hadoopConfiguration in the wild which assume that it's shared:

https://github.com/search?q=sc.hadoopConfiguration&type=Code&utf8=%E2%9C%93

Most of those are using it to configure S3 credentials.

Nice find. It seems perfectly reasonable from the user's perspective to just save sc.hadoopConfiguration into a val and use it for many things. That's probably what I would have done if I didn't know about the nuances here.

andrewor14 · 2014-12-22T22:28:11Z

@JoshRosen

JoshRosen · 2014-12-22T22:40:33Z

I agree that the current mutable nature of sc.hadoopConfiguration is confusing and this seems like it's worth documenting. It would be nicer if we didn't have this messy mutable configuration, though. I think that the combination of a mutable conf + lazy evaluation is what makes this confusing, since @zsxwing's example of reading from two tables would work correctly under eager evaluation:

conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

conf.set(TableInputFormat.INPUT_TABLE, "another_table_name")
val rdd2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

I suppose one approach would be to have the configuration stay mutable but to make a defensive copy of it when constructing RDDs that accept configurations. This would break programs that were relying on being able to mutate credentials after having defined a bunch of RDDs (e.g. define some RDDs, fail due to missing S3 credentials, supply new credentials, and re-run), but I think it makes things easier to reason about.

If we're not going to introduce any change in behavior, though, then I think we should document the current behavior more explicitly, as this patch has done.

srowen · 2015-02-06T11:46:17Z

Is the outcome here that the doc changes are OK to merge? documenting current behavior seems good.

andrewor14 · 2015-02-06T19:20:44Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -242,7 +242,11 @@ class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging {
  // the bound port to the cluster manager properly
  ui.foreach(_.bind())

-  /** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse. */
+  /** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.


really small nit but this should be javadoc style instead of scaladoc

andrewor14 · 2015-02-06T19:24:12Z

I think it's safe to say that we won't implement the alternative behavior that @JoshRosen suggested by the release. For this reason I think we should at least document these unexpected behavior for 1.3 in addition to delaying the fix till later. I'm going to merge this into master and 1.3.

I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea? Author: zsxwing <zsxwing@gmail.com> Closes #3225 from zsxwing/SPARK-4361 and squashes the following commits: fe4e3d5 [zsxwing] Add more docs for Hadoop Configuration (cherry picked from commit af2a2a2) Signed-off-by: Andrew Or <andrew@databricks.com>

…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue #3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue #4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue #3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue #3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue #4075](h2database/h2database#4075): infinite loop in compact - [Issue #4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue #4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR #3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR #3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR #3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR #3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR #3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR #3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR #3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue #3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR #3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue #3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR #3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR #2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR #2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR #2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

Add more docs for Hadoop Configuration

fe4e3d5

rxin reviewed Nov 21, 2014
View reviewed changes

andrewor14 reviewed Feb 6, 2015
View reviewed changes

asfgit closed this in af2a2a2 Feb 6, 2015

zsxwing deleted the SPARK-4361 branch February 25, 2015 04:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4361][Doc] Add more docs for Hadoop Configuration #3225

[SPARK-4361][Doc] Add more docs for Hadoop Configuration #3225

zsxwing commented Nov 12, 2014

SparkQA commented Nov 12, 2014

SparkQA commented Nov 12, 2014

AmplabJenkins commented Nov 12, 2014

rxin Nov 21, 2014

zsxwing Nov 21, 2014

JoshRosen Dec 22, 2014

andrewor14 Feb 6, 2015

andrewor14 commented Dec 22, 2014

JoshRosen commented Dec 22, 2014

srowen commented Feb 6, 2015

andrewor14 Feb 6, 2015

andrewor14 commented Feb 6, 2015

[SPARK-4361][Doc] Add more docs for Hadoop Configuration #3225

[SPARK-4361][Doc] Add more docs for Hadoop Configuration #3225

Conversation

zsxwing commented Nov 12, 2014

SparkQA commented Nov 12, 2014

SparkQA commented Nov 12, 2014

AmplabJenkins commented Nov 12, 2014

rxin Nov 21, 2014

Choose a reason for hiding this comment

zsxwing Nov 21, 2014

Choose a reason for hiding this comment

JoshRosen Dec 22, 2014

Choose a reason for hiding this comment

andrewor14 Feb 6, 2015

Choose a reason for hiding this comment

andrewor14 commented Dec 22, 2014

JoshRosen commented Dec 22, 2014

srowen commented Feb 6, 2015

andrewor14 Feb 6, 2015

Choose a reason for hiding this comment

andrewor14 commented Feb 6, 2015