Skip to content

Oracle Catalog

Harish Butani edited this page Jan 13, 2022 · 9 revisions

Details

Oracle Catalog is an implementation of Spark's CatalogPlugin interface that provides the entry point to catalog information of an Oracle instance to Spark.

Oracle Catalog provides TableCatalog, SupportsNamespaces, StagingTableCatalog and CatalogExtension functionality. It is enabled by setting spark.sql.catalog.oracle. Connection information must also be provided. For example:

spark.sql.catalog.oracle=org.apache.spark.sql.connector.catalog.oracle.OracleCatalog

spark.sql.catalog.oracle.authMethod=ORACLE_WALLET
spark.sql.catalog.oracle.url=jdbc:oracle:thin:@<tns-address>
spark.sql.catalog.oracle.user=<your_demo_user>
spark.sql.catalog.oracle.password=<your_passwd>
spark.sql.catalog.oracle.net.tns_admin=<wallet_location>

The Oracle Catalog enables users to browse schemas, describe tables, query tables, write tables, and create and write external tables.

Oracle Catalog surfaces Oracle tables in Spark as an OracleTable which provides functionality to Spark for StagedTable with SupportsRead with SupportsWrite with SupportsDelete. See the read-path flow and write path flow documents on how the Spark Engine interacts with components of the Oracle plugin during planning and execution.

Planned Support

Step 1 (Catalog API + Table Metadata API + Read Path)

  • Introspect Oracle dictionary
  • Query tables
  • Oracle function registration

Step 2 (Write Path)

  • Write tables
  • Insert, Insert Overwrite, and Delete on tables

Step 3 (Write Path, OCI Object Store Tables)

  • Create, CTAS, replace external tables that are stored in OCI object store and are in Parquet format
  • Special bypass code path for external tables for Insert, Insert Overwrite

Step 4 (Write Path, Merge/Update Tables)

  • Update and Merge operations, when Spark supports them.

Others

  • View integration TBD

Schema/Namespace functionality

The schemas accessible by connected oracle user exposed as Spark Namespaces Query used is:

select username, all_shard
from all_users
where ORACLE_MAINTAINED = 'N'

Tables accessible within a namespace are based on the following query on ALL_TABLES dictionary view

select owner, table_name
from ALL_TABLES
where owner in (select username from all_users where ORACLE_MAINTAINED = 'N')
order by OWNER, TABLE_NAME

Table metadata is read using DBMS_METADATA.get_sxml and DBMS_METADATA.get_xml calls. Most metadata is obtained via the get_sxml call; certain physical characteristics are obtained from the get_xml call.

Namespace API behavior:

  • Oracle namespaces are flat; so a namespace doesn't have any sub-namespaces
  • You cannot create, alter, or drop Oracle catalog namespaces.

Examples:

show namespaces;

+---------+
|namespace|
+---------+
|    ADMIN|
|      UWH|
|    MIKEV|
|SPARKTEST|
|  GGADMIN|
|  MASHUPS|
| PERFUSER|
|    TPCDS|
+---------+

Show current namespace;

+-------+---------+
|catalog|namespace|
+-------+---------+
| oracle|    tpcds|
+-------+---------+

Describe namespace extended oracle.tpcds;

+--------------+--------------------------------+
|name          |value                           |
+--------------+--------------------------------+
|Namespace Name|tpcds                           |
|Location      |jdbc:oracle:thin:@mammoth_medium|
|Owner         |tpcds                           |
+--------------+--------------------------------+

use TPCDS;
show tables;

+---------+----------------------+
|namespace|tableName             |
+---------+----------------------+
|tpcds    |WEB_RETURNS           |
|tpcds    |ITEM                  |
|tpcds    |STORE_SALES           |
|tpcds    |WAREHOUSE             |
|tpcds    |STORE                 |
|tpcds    |HOUSEHOLD_DEMOGRAPHICS|
|tpcds    |SHIP_MODE             |
|tpcds    |CATALOG_RETURNS       |
|tpcds    |REASON                |
|tpcds    |WEB_SALES             |
|tpcds    |STORE_RETURNS         |
|tpcds    |CUSTOMER              |
|tpcds    |TIME_DIM              |
|tpcds    |CUSTOMER_ADDRESS      |
|tpcds    |CUSTOMER_DEMOGRAPHICS |
|tpcds    |DATE_DIM              |
|tpcds    |WEB_PAGE              |
|tpcds    |PROMOTION             |
|tpcds    |CATALOG_SALES         |
|tpcds    |CALL_CENTER           |
|tpcds    |INVENTORY             |
|tpcds    |CATALOG_PAGE          |
|tpcds    |INCOME_BAND           |
|tpcds    |WEB_SITE              |
+---------+----------------------+

use SPARKTEST;
show tables;

+---------+------------------------------+
|namespace|tableName                     |
+---------+------------------------------+
|SPARKTEST|COMP                          |
|SPARKTEST|INTERVAL_PAR_DEMO             |
|SPARKTEST|SALES_BY_REGION_UNKNOWN_VALUES|
|SPARKTEST|MIXED_CASE_NAME               |
|SPARKTEST|SPARK_TEST_T1                 |
|SPARKTEST|mixed_case_name               |
|SPARKTEST|SALES_BY_REGION_AND_CHANNEL   |
|SPARKTEST|Mixed_case_name               |
|SPARKTEST|QUARTERLY_REGIONAL_SALES      |
|SPARKTEST|SALES_BY_REGION               |
|SPARKTEST|HASH_PARTITION_TABLE          |
|SPARKTEST|REGIONS                       |
|SPARKTEST|SALES_RANGE_PARTITION         |
|SPARKTEST|COUNTRIES2                    |
+---------+------------------------------+

Table functionality

  • Both oracle managed and external tables are surfaced in Spark.
  • You can scan and issue Insert/Insert Overwrite/Delete on these tables.
  • You can only create external tables from Spark.
  • All other DDL actions should be done using Oracle SQL.

Examples

describe extended tpcds.store_sales;

+----------------------------+--------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                 |comment|
+----------------------------+--------------------------------------------------------------------------+-------+
|SS_SOLD_DATE_SK             |decimal(38,18)                                                            |       |
|SS_SOLD_TIME_SK             |decimal(38,18)                                                            |       |
|SS_ITEM_SK                  |decimal(38,18)                                                            |       |
|SS_CUSTOMER_SK              |decimal(38,18)                                                            |       |
|SS_CDEMO_SK                 |decimal(38,18)                                                            |       |
|SS_HDEMO_SK                 |decimal(38,18)                                                            |       |
|SS_ADDR_SK                  |decimal(38,18)                                                            |       |
|SS_STORE_SK                 |decimal(38,18)                                                            |       |
|SS_PROMO_SK                 |decimal(38,18)                                                            |       |
|SS_TICKET_NUMBER            |decimal(38,18)                                                            |       |
|SS_QUANTITY                 |decimal(38,18)                                                            |       |
|SS_WHOLESALE_COST           |decimal(38,18)                                                            |       |
|SS_LIST_PRICE               |decimal(38,18)                                                            |       |
|SS_SALES_PRICE              |decimal(38,18)                                                            |       |
|SS_EXT_DISCOUNT_AMT         |decimal(38,18)                                                            |       |
|SS_EXT_SALES_PRICE          |decimal(38,18)                                                            |       |
|SS_EXT_WHOLESALE_COST       |decimal(38,18)                                                            |       |
|SS_EXT_LIST_PRICE           |decimal(38,18)                                                            |       |
|SS_EXT_TAX                  |decimal(38,18)                                                            |       |
|SS_COUPON_AMT               |decimal(38,18)                                                            |       |
|SS_NET_PAID                 |decimal(38,18)                                                            |       |
|SS_NET_PAID_INC_TAX         |decimal(38,18)                                                            |       |
|SS_NET_PROFIT               |decimal(38,18)                                                            |       |
|                            |                                                                          |       |
|# Partitioning              |                                                                          |       |
|Part 0                      |SS_SOLD_DATE_SK                                                           |       |
|                            |                                                                          |       |
|# Detailed Table Information|                                                                          |       |
|Name                        |TPCDS.STORE_SALES                                                         |       |
|Table Properties            |[connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=false,userName=tpcds]|       |
+----------------------------+--------------------------------------------------------------------------+-------+

-- table partitioned by state and channel
describe sparktest.sales_by_region_and_channel;

+----------------------------+--------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                 |comment|
+----------------------------+--------------------------------------------------------------------------+-------+
|DEPT_NUMBER                 |decimal(38,18)                                                            |       |
|DEPT_NAME                   |string                                                                    |       |
|QUARTERLY_SALES             |decimal(10,2)                                                             |       |
|STATE                       |string                                                                    |       |
|CHANNEL                     |string                                                                    |       |
|                            |                                                                          |       |
|# Partitioning              |                                                                          |       |
|Part 0                      |STATE                                                                     |       |
|Part 1                      |CHANNEL                                                                   |       |
|                            |                                                                          |       |
|# Detailed Table Information|                                                                          |       |
|Name                        |SPARKTEST.SALES_BY_REGION_AND_CHANNEL                                     |       |
|Table Properties            |[connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=false,userName=tpcds]|       |
+----------------------------+--------------------------------------------------------------------------+-------+

-- external table
describe sparktest.spark_test_t1;

+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                                                                                                                                                                                                                                                                                      |comment|
+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|NAME                        |string                                                                                                                                                                                                                                                                                                                                                                         |       |
|AGE                         |bigint                                                                                                                                                                                                                                                                                                                                                                         |       |
|                            |                                                                                                                                                                                                                                                                                                                                                                               |       |
|# Partitioning              |                                                                                                                                                                                                                                                                                                                                                                               |       |
|Not partitioned             |                                                                                                                                                                                                                                                                                                                                                                               |       |
|                            |                                                                                                                                                                                                                                                                                                                                                                               |       |
|# Detailed Table Information|                                                                                                                                                                                                                                                                                                                                                                               |       |
|Name                        |SPARKTEST.SPARK_TEST_T1                                                                                                                                                                                                                                                                                                                                                        |       |
|Table Properties            |[ACCESS_DRIVER_TYPE=ORACLE_BIGDATA,ACCESS_PARAMETERS=com.oracle.bigdata.credential.name=OS_EXT_OCI
com.oracle.bigdata.fileformat=PARQUET
  ,DEFAULT_DIRECTORY=DATA_PUMP_DIR,LOCATION=https://objectstorage.us-ashburn-1.oraclecloud.com/n/idlxex3qf8sf/b/SparkTest/o/t1.parquet,REJECT_LIMIT=UNLIMITED,connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=true,userName=tpcds]|       |
+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+

-- interval partitioned
describe sparktest.interval_par_demo;

+----------------------------+--------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                 |comment|
+----------------------------+--------------------------------------------------------------------------+-------+
|START_DATE                  |date                                                                      |       |
|STORE_ID                    |decimal(38,18)                                                            |       |
|INVENTORY_ID                |int                                                                       |       |
|QTY_SOLD                    |smallint                                                                  |       |
|                            |                                                                          |       |
|# Partitioning              |                                                                          |       |
|Part 0                      |START_DATE                                                                |       |
|                            |                                                                          |       |
|# Detailed Table Information|                                                                          |       |
|Name                        |SPARKTEST.INTERVAL_PAR_DEMO                                               |       |
|Table Properties            |[connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=false,userName=tpcds]|       |
+----------------------------+--------------------------------------------------------------------------+-------+

Table Partitioning functionality

We don't support Partition Management from Spark, but users can issue show partitions on Oracle tables. Oracle has many partitioning schemes such as RANGE, LIST, INTERVAL, and HASH. These don't map very well to Spark's concept of a partition.

  • A Range partition can contain a range of partition column values.
  • A List partition can contain an explicit set of column values.
  • An Interval partition can contain any values in an interval of the data type's domain.
  • A Hash partition contains all partition column values that hash to the same value.
  • Further, a partition-value can comprise multiple columns such as PARTITION BY LIST (state, channel)

For these reasons a show partitions Spark command run on an Oracle table typically doesn't show the partition specifications very well, and users shouldn't interpret the output like a regular Spark table's partition specification.

For example, running Spark's show partitions on an Oracle table with the partition scheme PARTITION BY RANGE (TSTAMP) SUBPARTITION BY LIST (deptno) we get

+-----------------------------------+
|partition                          |
+-----------------------------------+
|TSTAMP=null/DEPTNO=%271%27         |
|TSTAMP=null/DEPTNO=%271%27         |

We are not converting the TSTAMP value in Oracle's metadata into a Spark literal. Further we cannot show (or let Spark know) about the range semantics of the first level of partitioning.

In the case of 'TPCDS.STORE_SALES' which is partitioned as PARTITION BY RANGE (SS_SOLD_DATE_SK) the output is

+------------------------------------------+
|partition                                 |
+------------------------------------------+
|SS_SOLD_DATE_SK=2450816.000000000000000000|
|SS_SOLD_DATE_SK=2450846.000000000000000000|
|SS_SOLD_DATE_SK=2450874.000000000000000000|
...

This is better, but we still cannot show that the partitioning has range semantics.

Show Oracle Partitions Command

For these reasons we provide a show oracle partitions <table_id> extension command.

So running show oracle partitions on the first table above outputs:

+-------------------------------------------------------------------------------------------------------------------+
|partition                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------+
|Structure: Partition(kind=RANGE, cols=[TSTAMP]), Sub-Partition(kind=LIST, cols=[DEPTNO])                           |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 0. Sub-Partition(name=P01_SP1, values='1')         |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 1. Sub-Partition(name=P01_SP2, values='2')         |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 2. Sub-Partition(name=P01_SP3, values='3')         |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 3. Sub-Partition(name=P01_SP4, values='4')         |
|1. Partition(name=P02, values=TIMESTAMP' 2010-02-01 00:00:00'), 0. Sub-Partition(name=P02_SP1, values='1')         |
...

This clearly shows the partitioning scheme and the partition names and values.

Running show oracle partitions on 'TPCDS.STORE_SALES' outputs:

+--------------------------------------------------------+
|partition                                               |
+--------------------------------------------------------+
|Structure: Partition(kind=RANGE, cols=[SS_SOLD_DATE_SK])|
|0. Partition(name=P0, values=2450816)                   |
|1. Partition(name=P1176, values=2450846)                |
|2. Partition(name=P1177, values=2450874)                |
...

Metadata caching

Oracle metadata is cached in the Spark driver, to minimize metadata access overhead. If the spark.sql.catalog.oracle.metadata_cache_loc parameter is set, then cached metadata will be reused across Spark restarts. This can be set to any local accessible folder on the Spark driver machine. if this parameter is not set, metadata is cached to a new temporary folder every time the Spark cluster starts.