Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 9, 2016

What changes were proposed in this pull request?

This PR introduces INFORMATION_SCHEMA, a database consisting views which provide information about all of the tables, views, columns in a database.

scala> spark.catalog
scala> sql("create table t(a int, b double)")
scala> sql("create view v as select 1")
scala> sql("select * from information_schema.databases").show()
+------------+------------------+
|CATALOG_NAME|       SCHEMA_NAME|
+------------+------------------+
|     default|           default|
|     default|information_schema|
+------------+------------------+

scala> sql("select * from information_schema.schemata").show()
+------------+------------------+
|CATALOG_NAME|       SCHEMA_NAME|
+------------+------------------+
|     default|           default|
|     default|information_schema|
+------------+------------------+

scala> sql("select * from information_schema.tables").show()
+-------------+------------+----------+----------+
|TABLE_CATALOG|TABLE_SCHEMA|TABLE_NAME|TABLE_TYPE|
+-------------+------------+----------+----------+
|      default|     default|         t|     TABLE|
|      default|     default|         v|      VIEW|
+-------------+------------+----------+----------+

scala> sql("select * from information_schema.views").show()
+-------------+------------+----------+---------------+
|TABLE_CATALOG|TABLE_SCHEMA|TABLE_NAME|VIEW_DEFINITION|
+-------------+------------+----------+---------------+
|      default|     default|         v|           VIEW|
+-------------+------------+----------+---------------+


scala> sql("select * from information_schema.columns").show()
+-------------+------------+----------+-----------+----------------+-----------+---------+
|TABLE_CATALOG|TABLE_SCHEMA|TABLE_NAME|COLUMN_NAME|ORDINAL_POSITION|IS_NULLABLE|DATA_TYPE|
+-------------+------------+----------+-----------+----------------+-----------+---------+
|      default|     default|         t|          a|               0|       true|      int|
|      default|     default|         t|          b|               1|       true|   double|
|      default|     default|         v|          1|               0|       true|      int|
+-------------+------------+----------+-----------+----------------+-----------+---------+

scala> sql("select * from information_schema.session_variables").show(false)
+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|VARIABLE_NAME                  |VARIABLE_VALUE                                                                                                                               |
+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
|hive.metastore.warehouse.dir   |file:/Users/dongjoon/SPARK-16452-SCHEMA/spark-warehouse                                                                                      |
|spark.app.id                   |local-1468412934354                                                                                                                          |
|spark.app.name                 |Spark shell                                                                                                                                  |
|spark.driver.host              |192.168.0.12                                                                                                                                 |
|spark.driver.memory            |6G                                                                                                                                           |
|spark.driver.port              |56198                                                                                                                                        |
|spark.executor.id              |driver                                                                                                                                       |
|spark.home                     |/Users/dongjoon/SPARK-16452-SCHEMA                                                                                                           |
|spark.jars                     |                                                                                                                                             |
|spark.master                   |local[*]                                                                                                                                     |
|spark.repl.class.outputDir     |/private/var/folders/dc/1pz9m69x14q_gw8t7m143t1c0000gn/T/spark-fd7a86e9-651f-4718-be82-6b84b30a97ac/repl-d4e25a57-c5f1-40d9-809b-1276c55f59c6|
|spark.repl.class.uri           |spark://192.168.0.12:56198/classes                                                                                                           |
|spark.sql.catalogImplementation|hive                                                                                                                                         |
|spark.submit.deployMode        |client                                                                                                                                       |
+-------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+

How was this patch tested?

Pass the Jenkins tests including a new testsuite.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
I made a PR for your directional advice.

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62023 has finished for PR 14116 at commit cbdd641.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DatabasesSource extends SchemaRelationProvider
    • case class DatabasesRelation(@transient sparkSession: SparkSession)
    • class TablesSource extends SchemaRelationProvider
    • case class TablesRelation(@transient sparkSession: SparkSession)
    • class ViewsSource extends SchemaRelationProvider
    • case class ViewsRelation(@transient sparkSession: SparkSession)
    • class ColumnsSource extends SchemaRelationProvider
    • case class ColumnsRelation(@transient sparkSession: SparkSession)
    • class SessionVariablesSource extends SchemaRelationProvider
    • case class SessionVariablesRelation(@transient sparkSession: SparkSession)

@dongjoon-hyun
Copy link
Member Author

Wow, there was only one error. I fixed it a few minute ago.

- alter table: rename *** FAILED *** (8 milliseconds)
[info]   ArrayBuffer(`dbx`.`tab2`, `columns`, `databases`, `schemata`, `session_variables`, `tables`, `views`) did not equal List(`dbx`.`tab2`) (DDLSuite.scala:466)

listTable returns temporary tables for all databases. This suite should be fixed generally, not specifically for this PR.

@dongjoon-hyun
Copy link
Member Author

cc @hvanhovell , too.

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62029 has finished for PR 14116 at commit d9d9344.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The failure seems to be irrelevant to this PR.

FAIL [31.391s]: test_predictions (pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests)
Test predicted values on a toy model.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", line 1426, in test_predictions
    self._eventually(condition, catch_assertions=True)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", line 135, in _eventually
    raise lastValue
AssertionError: 11 != 20

Locally, it passes like the following.

$ python python/run-tests.py --python-executables python2.7 --modules pyspark-mllib
...
Finished test(python2.7): pyspark.mllib.tests (199s)
Tests passed in 199 seconds

@dongjoon-hyun
Copy link
Member Author

Retest this please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhm the indentation is really weird - i don't think we need to indent each line with one more level ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@rxin
Copy link
Contributor

rxin commented Jul 9, 2016

It doesn't look like we are getting any benefits from column pruning - perhaps we should just do predicate pushdown? The code would be simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and setupTable or registerTable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used registerTable.

@rxin
Copy link
Contributor

rxin commented Jul 9, 2016

This looks pretty good. Can you add more comments explaining what each class/method does, and how the whole thing works?

@dongjoon-hyun
Copy link
Member Author

Sure. I'll update the PR and proceed in this way.
Thank you, @rxin .

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62030 has finished for PR 14116 at commit d9d9344.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2016

Test build #62031 has finished for PR 14116 at commit d9d9344.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2016

Test build #62038 has finished for PR 14116 at commit b89039d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DatabasesRelationProvider extends SchemaRelationProvider
    • class TablesRelationProvider extends SchemaRelationProvider
    • class ViewsRelationProvider extends SchemaRelationProvider
    • class ColumnsRelationProvider extends SchemaRelationProvider
    • class SessionVariablesRelationProvider extends SchemaRelationProvider

@SparkQA
Copy link

SparkQA commented Jul 10, 2016

Test build #62040 has finished for PR 14116 at commit 8cb4956.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2016

Test build #62041 has finished for PR 14116 at commit a55da04.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2016

Test build #62043 has finished for PR 14116 at commit c770315.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

SparkR failures are due to my TODO item.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16452][SQL][WIP] Basic INFORMATION_SCHEMA support [SPARK-16452][SQL] Basic INFORMATION_SCHEMA support Jul 10, 2016
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16452][SQL] Basic INFORMATION_SCHEMA support [SPARK-16452][SQL] Support basic INFORMATION_SCHEMA Jul 10, 2016
@dongjoon-hyun
Copy link
Member Author

Locally, the last commit passes the R test, too. Now, I think I've finished my first implementation.
While waiting for #14114 and #14115 , I'll move on other new issues.

@SparkQA
Copy link

SparkQA commented Jul 10, 2016

Test build #62064 has finished for PR 14116 at commit a645410.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62065 has finished for PR 14116 at commit e6e96eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Now, all tests are passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thank you for review! Right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain which cases will enter this processing? Is that possible we could hit backtick-quoted INFORMATION_SCHEMA_DATABASE here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backtick-quoted one will not reach here.

scala> sql("create table `aaa.bbb`(a int)")
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: [aaa.bbb]: is not a valid table name;

@gatorsmile
Copy link
Member

        val catalog = spark.sessionState.catalog
        catalog.setCurrentDatabase(SessionCatalog.INFORMATION_SCHEMA_DATABASE)
        sql("CREATE TABLE my_tab (age INT, name STRING)")

We can set the current database to INFORMATION_SCHEMA_DATABASE, but we got an error when we are trying to create a table in this database.

Database 'information_schema' not found;
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'information_schema' not found;

Copy link
Member

@gatorsmile gatorsmile Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am lost. A database INFORMATION_SCHEMA_DATABASE is created through a SQL command, but the function databaseExists always returns true for the database for INFORMATION_SCHEMA_DATABASE.

See the code:
https://github.com/dongjoon-hyun/spark/blob/b04f70127760dde5700a63e7dad4100cf407b863/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L177-L184

@gatorsmile
Copy link
Member

Finished my first pass. The major concern is handling of INFORMATION_SCHEMA is not clean to me. It looks hacky. Many holes are caused by it. More test cases are needed.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @gatorsmile . And, sorry for late response. Definitely, I have many things to do. Now, it's my turn. Let's see how much I can handle them. :)

@SparkQA
Copy link

SparkQA commented Jul 22, 2016

Test build #62721 has finished for PR 14116 at commit eddaec6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62897 has finished for PR 14116 at commit 72bc2dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2016

Test build #63023 has finished for PR 14116 at commit 0fc02f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Rebased to the master because CatalogColumn was removed from master.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63351 has finished for PR 14116 at commit 9ad03d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63641 has finished for PR 14116 at commit dc5d1dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2016

Test build #63757 has finished for PR 14116 at commit e9302e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64041 has finished for PR 14116 at commit bd85aa5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64358 has finished for PR 14116 at commit 9bb92bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 27, 2016

Test build #64530 has finished for PR 14116 at commit 5704e83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 29, 2016

Test build #64577 has finished for PR 14116 at commit 7543069.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64876 has finished for PR 14116 at commit d7bfc7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65019 has finished for PR 14116 at commit d107721.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65141 has finished for PR 14116 at commit a8c30c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2016

Test build #65250 has finished for PR 14116 at commit c531025.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65431 has finished for PR 14116 at commit 2296c0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Rebased to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65664 has finished for PR 14116 at commit e832f5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The issue might be tackled later after Catalog Federation. Now, I close this PR since it's too stale. Thank you all for spending time at this PR.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-16452 branch January 7, 2019 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants