-
Notifications
You must be signed in to change notification settings - Fork 10
Oracle Catalog
Oracle Catalog is an implementation of Spark's CatalogPlugin interface that provides the entry point to catalog information of an Oracle instance to Spark.
Oracle Catalog
provides TableCatalog, SupportsNamespaces, StagingTableCatalog and CatalogExtension
functionality. It is enabled by setting spark.sql.catalog.oracle
.
Connection information must also be provided. For example:
spark.sql.catalog.oracle=org.apache.spark.sql.connector.catalog.oracle.OracleCatalog
spark.sql.catalog.oracle.authMethod=ORACLE_WALLET
spark.sql.catalog.oracle.url=jdbc:oracle:thin:@<tns-address>
spark.sql.catalog.oracle.user=<your_demo_user>
spark.sql.catalog.oracle.password=<your_passwd>
spark.sql.catalog.oracle.net.tns_admin=<wallet_location>
The Oracle Catalog enables users to browse schemas, describe tables, query tables, write tables, and create and write external tables.
Oracle Catalog
surfaces Oracle tables in Spark as an OracleTable
which provides functionality to Spark for StagedTable with SupportsRead with SupportsWrite with SupportsDelete
.
See the read-path flow and write path flow documents on how
the Spark Engine interacts with components of the Oracle plugin during planning and execution.
- Introspect Oracle dictionary
- Query tables
- Oracle function registration
- Write tables
- Insert, Insert Overwrite, and Delete on tables
- Create, CTAS, replace external tables that are stored in OCI object store and are in Parquet format
- Special bypass code path for external tables for Insert, Insert Overwrite
- Update and Merge operations, when Spark supports them.
- View integration TBD
The schemas accessible by connected oracle user exposed as Spark Namespaces Query used is:
select username, all_shard
from all_users
where ORACLE_MAINTAINED = 'N'
Tables accessible within a namespace are based on the following query on ALL_TABLES
dictionary view
select owner, table_name
from ALL_TABLES
where owner in (select username from all_users where ORACLE_MAINTAINED = 'N')
order by OWNER, TABLE_NAME
Table metadata is read using DBMS_METADATA.get_sxml
and DBMS_METADATA.get_xml
calls.
Most metadata is obtained via the get_sxml
call; certain physical characteristics are obtained
from the get_xml
call.
- Oracle namespaces are flat; so a namespace doesn't have any sub-namespaces
- You cannot
create
,alter
, ordrop
Oracle catalog namespaces.
show namespaces;
+---------+
|namespace|
+---------+
| ADMIN|
| UWH|
| MIKEV|
|SPARKTEST|
| GGADMIN|
| MASHUPS|
| PERFUSER|
| TPCDS|
+---------+
Show current namespace;
+-------+---------+
|catalog|namespace|
+-------+---------+
| oracle| tpcds|
+-------+---------+
Describe namespace extended oracle.tpcds;
+--------------+--------------------------------+
|name |value |
+--------------+--------------------------------+
|Namespace Name|tpcds |
|Location |jdbc:oracle:thin:@mammoth_medium|
|Owner |tpcds |
+--------------+--------------------------------+
use TPCDS;
show tables;
+---------+----------------------+
|namespace|tableName |
+---------+----------------------+
|tpcds |WEB_RETURNS |
|tpcds |ITEM |
|tpcds |STORE_SALES |
|tpcds |WAREHOUSE |
|tpcds |STORE |
|tpcds |HOUSEHOLD_DEMOGRAPHICS|
|tpcds |SHIP_MODE |
|tpcds |CATALOG_RETURNS |
|tpcds |REASON |
|tpcds |WEB_SALES |
|tpcds |STORE_RETURNS |
|tpcds |CUSTOMER |
|tpcds |TIME_DIM |
|tpcds |CUSTOMER_ADDRESS |
|tpcds |CUSTOMER_DEMOGRAPHICS |
|tpcds |DATE_DIM |
|tpcds |WEB_PAGE |
|tpcds |PROMOTION |
|tpcds |CATALOG_SALES |
|tpcds |CALL_CENTER |
|tpcds |INVENTORY |
|tpcds |CATALOG_PAGE |
|tpcds |INCOME_BAND |
|tpcds |WEB_SITE |
+---------+----------------------+
use SPARKTEST;
show tables;
+---------+------------------------------+
|namespace|tableName |
+---------+------------------------------+
|SPARKTEST|COMP |
|SPARKTEST|INTERVAL_PAR_DEMO |
|SPARKTEST|SALES_BY_REGION_UNKNOWN_VALUES|
|SPARKTEST|MIXED_CASE_NAME |
|SPARKTEST|SPARK_TEST_T1 |
|SPARKTEST|mixed_case_name |
|SPARKTEST|SALES_BY_REGION_AND_CHANNEL |
|SPARKTEST|Mixed_case_name |
|SPARKTEST|QUARTERLY_REGIONAL_SALES |
|SPARKTEST|SALES_BY_REGION |
|SPARKTEST|HASH_PARTITION_TABLE |
|SPARKTEST|REGIONS |
|SPARKTEST|SALES_RANGE_PARTITION |
|SPARKTEST|COUNTRIES2 |
+---------+------------------------------+
- Both oracle managed and external tables are surfaced in Spark.
- You can scan and issue Insert/Insert Overwrite/Delete on these tables.
- You can only create external tables from Spark.
- All other DDL actions should be done using Oracle SQL.
describe extended tpcds.store_sales;
+----------------------------+--------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+--------------------------------------------------------------------------+-------+
|SS_SOLD_DATE_SK |decimal(38,18) | |
|SS_SOLD_TIME_SK |decimal(38,18) | |
|SS_ITEM_SK |decimal(38,18) | |
|SS_CUSTOMER_SK |decimal(38,18) | |
|SS_CDEMO_SK |decimal(38,18) | |
|SS_HDEMO_SK |decimal(38,18) | |
|SS_ADDR_SK |decimal(38,18) | |
|SS_STORE_SK |decimal(38,18) | |
|SS_PROMO_SK |decimal(38,18) | |
|SS_TICKET_NUMBER |decimal(38,18) | |
|SS_QUANTITY |decimal(38,18) | |
|SS_WHOLESALE_COST |decimal(38,18) | |
|SS_LIST_PRICE |decimal(38,18) | |
|SS_SALES_PRICE |decimal(38,18) | |
|SS_EXT_DISCOUNT_AMT |decimal(38,18) | |
|SS_EXT_SALES_PRICE |decimal(38,18) | |
|SS_EXT_WHOLESALE_COST |decimal(38,18) | |
|SS_EXT_LIST_PRICE |decimal(38,18) | |
|SS_EXT_TAX |decimal(38,18) | |
|SS_COUPON_AMT |decimal(38,18) | |
|SS_NET_PAID |decimal(38,18) | |
|SS_NET_PAID_INC_TAX |decimal(38,18) | |
|SS_NET_PROFIT |decimal(38,18) | |
| | | |
|# Partitioning | | |
|Part 0 |SS_SOLD_DATE_SK | |
| | | |
|# Detailed Table Information| | |
|Name |TPCDS.STORE_SALES | |
|Table Properties |[connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=false,userName=tpcds]| |
+----------------------------+--------------------------------------------------------------------------+-------+
-- table partitioned by state and channel
describe sparktest.sales_by_region_and_channel;
+----------------------------+--------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+--------------------------------------------------------------------------+-------+
|DEPT_NUMBER |decimal(38,18) | |
|DEPT_NAME |string | |
|QUARTERLY_SALES |decimal(10,2) | |
|STATE |string | |
|CHANNEL |string | |
| | | |
|# Partitioning | | |
|Part 0 |STATE | |
|Part 1 |CHANNEL | |
| | | |
|# Detailed Table Information| | |
|Name |SPARKTEST.SALES_BY_REGION_AND_CHANNEL | |
|Table Properties |[connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=false,userName=tpcds]| |
+----------------------------+--------------------------------------------------------------------------+-------+
-- external table
describe sparktest.spark_test_t1;
+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|NAME |string | |
|AGE |bigint | |
| | | |
|# Partitioning | | |
|Not partitioned | | |
| | | |
|# Detailed Table Information| | |
|Name |SPARKTEST.SPARK_TEST_T1 | |
|Table Properties |[ACCESS_DRIVER_TYPE=ORACLE_BIGDATA,ACCESS_PARAMETERS=com.oracle.bigdata.credential.name=OS_EXT_OCI
com.oracle.bigdata.fileformat=PARQUET
,DEFAULT_DIRECTORY=DATA_PUMP_DIR,LOCATION=https://objectstorage.us-ashburn-1.oraclecloud.com/n/idlxex3qf8sf/b/SparkTest/o/t1.parquet,REJECT_LIMIT=UNLIMITED,connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=true,userName=tpcds]| |
+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
-- interval partitioned
describe sparktest.interval_par_demo;
+----------------------------+--------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+--------------------------------------------------------------------------+-------+
|START_DATE |date | |
|STORE_ID |decimal(38,18) | |
|INVENTORY_ID |int | |
|QTY_SOLD |smallint | |
| | | |
|# Partitioning | | |
|Part 0 |START_DATE | |
| | | |
|# Detailed Table Information| | |
|Name |SPARKTEST.INTERVAL_PAR_DEMO | |
|Table Properties |[connURL=jdbc:oracle:thin:@mammoth_medium,isExternal=false,userName=tpcds]| |
+----------------------------+--------------------------------------------------------------------------+-------+
We don't support Partition Management from Spark, but users can issue show partitions on Oracle tables. Oracle has many partitioning schemes such as RANGE, LIST, INTERVAL, and HASH. These don't map very well to Spark's concept of a partition.
- A Range partition can contain a range of partition column values.
- A List partition can contain an explicit set of column values.
- An Interval partition can contain any values in an interval of the data type's domain.
- A Hash partition contains all partition column values that hash to the same value.
- Further, a partition-value can comprise multiple columns such as
PARTITION BY LIST (state, channel)
For these reasons a show partitions
Spark command run on an Oracle table typically doesn't show the partition specifications very well, and users shouldn't interpret the output like a regular Spark table's partition specification.
For example, running Spark's show partitions
on an Oracle table with the partition scheme
PARTITION BY RANGE (TSTAMP) SUBPARTITION BY LIST (deptno)
we get
+-----------------------------------+
|partition |
+-----------------------------------+
|TSTAMP=null/DEPTNO=%271%27 |
|TSTAMP=null/DEPTNO=%271%27 |
We are not converting the TSTAMP
value in Oracle's metadata into a Spark literal.
Further we cannot show (or let Spark know) about the range semantics of the first level
of partitioning.
In the case of 'TPCDS.STORE_SALES' which is partitioned as PARTITION BY RANGE (SS_SOLD_DATE_SK)
the output is
+------------------------------------------+
|partition |
+------------------------------------------+
|SS_SOLD_DATE_SK=2450816.000000000000000000|
|SS_SOLD_DATE_SK=2450846.000000000000000000|
|SS_SOLD_DATE_SK=2450874.000000000000000000|
...
This is better, but we still cannot show that the partitioning has range semantics.
For these reasons we provide a show oracle partitions <table_id>
extension command.
So running show oracle partitions
on the first table above outputs:
+-------------------------------------------------------------------------------------------------------------------+
|partition |
+-------------------------------------------------------------------------------------------------------------------+
|Structure: Partition(kind=RANGE, cols=[TSTAMP]), Sub-Partition(kind=LIST, cols=[DEPTNO]) |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 0. Sub-Partition(name=P01_SP1, values='1') |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 1. Sub-Partition(name=P01_SP2, values='2') |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 2. Sub-Partition(name=P01_SP3, values='3') |
|0. Partition(name=P01, values=TIMESTAMP' 2010-01-01 00:00:00'), 3. Sub-Partition(name=P01_SP4, values='4') |
|1. Partition(name=P02, values=TIMESTAMP' 2010-02-01 00:00:00'), 0. Sub-Partition(name=P02_SP1, values='1') |
...
This clearly shows the partitioning scheme and the partition names and values.
Running show oracle partitions
on 'TPCDS.STORE_SALES' outputs:
+--------------------------------------------------------+
|partition |
+--------------------------------------------------------+
|Structure: Partition(kind=RANGE, cols=[SS_SOLD_DATE_SK])|
|0. Partition(name=P0, values=2450816) |
|1. Partition(name=P1176, values=2450846) |
|2. Partition(name=P1177, values=2450874) |
...
Oracle metadata is cached in the Spark driver, to minimize metadata access overhead.
If the spark.sql.catalog.oracle.metadata_cache_loc
parameter is set, then cached metadata will be reused across Spark restarts. This can be set to any local accessible folder on the Spark driver machine. if this parameter is not set, metadata is cached to a new temporary folder every time the Spark cluster starts.
- Quick Start
- Latest Demo
- Configuration
- Catalog
- Translation
- Query Splitting details
- DML Operations
- Language Integration
- Dockerized Demo env.
- Sharded Database
- Developer Notes