Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions dev/deps/spark-deps-hadoop-2.6
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,9 @@ objenesis-2.5.1.jar
okhttp-3.8.1.jar
okio-1.13.0.jar
opencsv-2.3.jar
orc-core-1.5.2-nohive.jar
orc-mapreduce-1.5.2-nohive.jar
orc-shims-1.5.2.jar
orc-core-1.5.3-nohive.jar
orc-mapreduce-1.5.3-nohive.jar
orc-shims-1.5.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.8.jar
Expand Down
6 changes: 3 additions & 3 deletions dev/deps/spark-deps-hadoop-2.7
Original file line number Diff line number Diff line change
Expand Up @@ -154,9 +154,9 @@ objenesis-2.5.1.jar
okhttp-3.8.1.jar
okio-1.13.0.jar
opencsv-2.3.jar
orc-core-1.5.2-nohive.jar
orc-mapreduce-1.5.2-nohive.jar
orc-shims-1.5.2.jar
orc-core-1.5.3-nohive.jar
orc-mapreduce-1.5.3-nohive.jar
orc-shims-1.5.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.8.jar
Expand Down
6 changes: 3 additions & 3 deletions dev/deps/spark-deps-hadoop-3.1
Original file line number Diff line number Diff line change
Expand Up @@ -172,9 +172,9 @@ okhttp-2.7.5.jar
okhttp-3.8.1.jar
okio-1.13.0.jar
opencsv-2.3.jar
orc-core-1.5.2-nohive.jar
orc-mapreduce-1.5.2-nohive.jar
orc-shims-1.5.2.jar
orc-core-1.5.3-nohive.jar
orc-mapreduce-1.5.3-nohive.jar
orc-shims-1.5.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.8.jar
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@
<hive.version.short>1.2.1</hive.version.short>
<derby.version>10.12.1.1</derby.version>
<parquet.version>1.10.0</parquet.version>
<orc.version>1.5.2</orc.version>
<orc.version>1.5.3</orc.version>
<orc.classifier>nohive</orc.classifier>
<hive.parquet.version>1.6.0</hive.parquet.version>
<jetty.version>9.3.24.v20180605</jetty.version>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.orc.OrcConf.COMPRESS
import org.apache.orc.OrcFile
import org.apache.orc.OrcProto.ColumnEncoding.Kind.{DICTIONARY_V2, DIRECT, DIRECT_V2}
import org.apache.orc.OrcProto.Stream.Kind
import org.apache.orc.impl.RecordReaderImpl
import org.scalatest.BeforeAndAfterAll
Expand Down Expand Up @@ -115,6 +116,76 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll {
}
}

protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
val tableName = "orcTable"

withTempDir { dir =>
withTable(tableName) {
val sqlStatement = orcImp match {
case "native" =>
s"""
|CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE)
|USING ORC
|OPTIONS (
| path '${dir.toURI}',
| orc.dictionary.key.threshold '1.0',
| orc.column.encoding.direct 'uniqColumn'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new feature needs a doc update. We need to let our end users how to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, Apache ORC is an independent Apache project which has its own website and documents. We should respect that. If we introduce new ORC configuration one by one in Apache Spark website, it will eventually duplicate Apache ORC document in Apache Spark document.

We had better guide ORC fans to Apache ORC website. If something is missing there, they can file an ORC JIRA, not SPARK JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine either way. However, our current doc does not explain we are passing the data source specific options to the underlying data source:

https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options

Could you help improve it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also give an example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a different issue. This PR covers both TBLPROPERTIES and OPTIONS syntaxes where are designed for that configuration-purpose historically. I mean this is not about data-source specific PR. Also, the scope of this PR is only write-side configurations.

In any way, +1 for adding some introduction section for both Parquet/ORC examples there. We had better give both read/write side configuration examples, too. Could you file a JIRA issue for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, dictionary encoding could be a good candidate; parquet.enable.dictionary and orc.dictionary.key.threshold et al.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|)
""".stripMargin
case "hive" =>
s"""
|CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE)
|STORED AS ORC
|LOCATION '${dir.toURI}'
|TBLPROPERTIES (
| orc.dictionary.key.threshold '1.0',
| hive.exec.orc.dictionary.key.size.threshold '1.0',
| orc.column.encoding.direct 'uniqColumn'
|)
""".stripMargin
case impl =>
throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl")
}

sql(sqlStatement)
sql(s"INSERT INTO $tableName VALUES ('94086', 'random-uuid-string', 0.0)")

val partFiles = dir.listFiles()
.filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_"))
assert(partFiles.length === 1)

val orcFilePath = new Path(partFiles.head.getAbsolutePath)
val readerOptions = OrcFile.readerOptions(new Configuration())
val reader = OrcFile.createReader(orcFilePath, readerOptions)
var recordReader: RecordReaderImpl = null
try {
recordReader = reader.rows.asInstanceOf[RecordReaderImpl]

// Check the kind
val stripe = recordReader.readStripeFooter(reader.getStripes.get(0))

// The encodings are divided into direct or dictionary-based categories and
// further refined as to whether they use RLE v1 or v2. RLE v1 is used by
// Hive 0.11 and RLE v2 is introduced in Hive 0.12 ORC with more improvements.
// For more details, see https://orc.apache.org/specification/
assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write some comments to explain what DICTIONARY_V2 , DIRECT_V2 and DIRECT are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

if (isSelective) {
assert(stripe.getColumns(2).getKind === DIRECT_V2)
} else {
assert(stripe.getColumns(2).getKind === DICTIONARY_V2)
}
// Floating point types are stored with DIRECT encoding in IEEE 754 floating
// point bit layout.
assert(stripe.getColumns(3).getKind === DIRECT)
} finally {
if (recordReader != null) {
recordReader.close()
}
}
}
}
}

test("create temporary orc table") {
checkAnswer(sql("SELECT COUNT(*) FROM normal_orc_source"), Row(10))

Expand Down Expand Up @@ -284,4 +355,8 @@ class OrcSourceSuite extends OrcSuite with SharedSQLContext {
test("Check BloomFilter creation") {
testBloomFilterCreation(Kind.BLOOM_FILTER_UTF8) // After ORC-101
}

test("Enforce direct encoding column-wise selectively") {
testSelectiveDictionaryEncoding(isSelective = true)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
}
}
}

test("Enforce direct encoding column-wise selectively") {
Seq(true, false).foreach { convertMetastore =>
withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") {
testSelectiveDictionaryEncoding(isSelective = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even with CONVERT_METASTORE_ORC as true, we still can't use selective direct encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This is based on the current behavior which is a little related to your CTAS PR. Only read-path works as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we change Spark behavior later, this test will be adapted according to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I see. Thanks.

}
}
}
}