Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4361][Doc] Add more docs for Hadoop Configuration #3225

Closed
wants to merge 1 commit into from
Closed

[SPARK-4361][Doc] Add more docs for Hadoop Configuration #3225

wants to merge 1 commit into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Nov 12, 2014

I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea?

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23263 has started for PR 3225 at commit fe4e3d5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23263 has finished for PR 3225 at commit fe4e3d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23263/
Test PASSed.

@@ -630,7 +634,10 @@ class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging {
* necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable),
* using the older MapReduce API (`org.apache.hadoop.mapred`).
*
* @param conf JobConf for setting up the dataset
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we reuse the conf across different RDDs, do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People may call this method directly and pass their Configuration.

   def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
       path: String,
       fClass: Class[F],
       kClass: Class[K],
       vClass: Class[V],
       conf: Configuration = hadoopConfiguration)

E.g., creating a configuration for accessing hbase:

import java.io.{DataOutputStream, ByteArrayOutputStream}
import java.lang.String
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Base64

def convertScanToString(scan: Scan): String = {
  val out: ByteArrayOutputStream = new ByteArrayOutputStream
  val dos: DataOutputStream = new DataOutputStream(out)
  scan.write(dos)
  Base64.encodeBytes(out.toByteArray)
}

val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
rdd.count()

This is fine. However, some people may need to access two tables and union them. They may reuse the Configuration like this:

val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

conf.set(TableInputFormat.INPUT_TABLE, "another_table_name")
val rdd2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

rdd.union(rdd2).count()

The result will be weird and wrong.
I think the docs should tell people not to reuse it like this.

My motivation is this mail thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-did-the-RDD-union-work-td18686.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, there are many uses of sc.hadoopConfiguration in the wild which assume that it's shared:

https://github.com/search?q=sc.hadoopConfiguration&type=Code&utf8=%E2%9C%93

Most of those are using it to configure S3 credentials.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. It seems perfectly reasonable from the user's perspective to just save sc.hadoopConfiguration into a val and use it for many things. That's probably what I would have done if I didn't know about the nuances here.

@andrewor14
Copy link
Contributor

@JoshRosen

@JoshRosen
Copy link
Contributor

I agree that the current mutable nature of sc.hadoopConfiguration is confusing and this seems like it's worth documenting. It would be nicer if we didn't have this messy mutable configuration, though. I think that the combination of a mutable conf + lazy evaluation is what makes this confusing, since @zsxwing's example of reading from two tables would work correctly under eager evaluation:

conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

conf.set(TableInputFormat.INPUT_TABLE, "another_table_name")
val rdd2 = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])

I suppose one approach would be to have the configuration stay mutable but to make a defensive copy of it when constructing RDDs that accept configurations. This would break programs that were relying on being able to mutate credentials after having defined a bunch of RDDs (e.g. define some RDDs, fail due to missing S3 credentials, supply new credentials, and re-run), but I think it makes things easier to reason about.

If we're not going to introduce any change in behavior, though, then I think we should document the current behavior more explicitly, as this patch has done.

@srowen
Copy link
Member

srowen commented Feb 6, 2015

Is the outcome here that the doc changes are OK to merge? documenting current behavior seems good.

@@ -242,7 +242,11 @@ class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging {
// the bound port to the cluster manager properly
ui.foreach(_.bind())

/** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse. */
/** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really small nit but this should be javadoc style instead of scaladoc

@andrewor14
Copy link
Contributor

I think it's safe to say that we won't implement the alternative behavior that @JoshRosen suggested by the release. For this reason I think we should at least document these unexpected behavior for 1.3 in addition to delaying the fix till later. I'm going to merge this into master and 1.3.

@asfgit asfgit closed this in af2a2a2 Feb 6, 2015
asfgit pushed a commit that referenced this pull request Feb 6, 2015
I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea?

Author: zsxwing <zsxwing@gmail.com>

Closes #3225 from zsxwing/SPARK-4361 and squashes the following commits:

fe4e3d5 [zsxwing] Add more docs for Hadoop Configuration

(cherry picked from commit af2a2a2)
Signed-off-by: Andrew Or <andrew@databricks.com>
@zsxwing zsxwing deleted the SPARK-4361 branch February 25, 2015 04:20
yaooqinn pushed a commit that referenced this pull request Aug 26, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue #3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue #4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue #3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue #3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue #4075](h2database/h2database#4075): infinite loop in compact
    - [Issue #4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue #4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR #3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR #3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR #3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR #3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR #3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR #3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR #3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue #3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR #3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue #3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR #3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR #2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR #2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR #2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact
    - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact
    - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact
    - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants