-
Notifications
You must be signed in to change notification settings - Fork 10
Quick Start Guide
Harish Butani edited this page Jan 13, 2022
·
6 revisions
- You have a Spark Spark 3.1 environment.
- Spark Oracle can be run on any type of cluster (embedded, spark-shell, thriftserver; standalone, yarn, kubernetes, mesos...)
- You have access to an Oracle Database. If not, see [section below](#Setting up an Oracle Free Tier ADW instance).
- Build the spark-oracle package.
- The oracle user used for connection must have
-
plan_table
access. - In some cases we use the
DBMS_PARALLEL_EXECUTE
procedure to plan splitting of pushdown queries.
-
- For ADW instances
- Store Oracle wallet for the instance on each Spark node.
- To use external tables, ensure an OCI credential is set up in the database.
- Create an Oracle Free Tier Account, if you don't already have one.
- Instantiate an Oracle ADW instance.
- Download and set up Oracle Instant Client
and Oracle Instant Tools. You will use the
sqlplus
andimpdp
executables. - Download and set up Oracle wallet for the instance
- Connect using
sqlplus
instant client and create a user for running the demo. For example:
create user tpcds identified by <your_chosen_password_here>;
grant CONNECT, RESOURCE, unlimited tablespace to tpcds;
-- To run unit tests, add the additional grants
grant drop any table to tpcds;
grant create any function to tpcds;
grant create any procedure to tpcds;
- Upload your cloud credentials to the instance using dbms_cloud.create_credential procedure
- See
- For example
begin
DBMS_CLOUD.CREATE_CREDENTIAL(
credential_name => 'SPARK_ORA_CRED',
username => '<your_oci_username>',
password => 'your_oci_password'
);
end;
/
- Load the tpcds tables and data into your instance.
- A dump of tpcds scale1 is available in this oci location.
- You can copy this to your own bucket.
- Import the data by running :
# ensure TNS_ADMIN points to your wallet.
./impdp admin/<your_passwd>@<your_tns_address> directory=data_pump_dir credential=SPARK_ORA_CRED dumpfile=<your_dump_file_location>
The Spark Oracle zip file has the spark-oracle
and oracle jdbc jars
:
This zip file can be unzipped into any Spark deployment.
In order to enable the Spark Oracle
catalog integration and pushdown functionality
add the following to your spark configuration
# Oracle Catalog
# Enable Spark Oracle extensions
spark.sql.extensions=org.apache.spark.sql.oracle.SparkSessionExtensions
spark.kryo.registrator=org.apache.spark.sql.connector.catalog.oracle.OraKryoRegistrator
# enable the Oracle Catalog integration
spark.sql.catalog.oracle=org.apache.spark.sql.connector.catalog.oracle.OracleCatalog
# oracle sql logging and jdbc fetchsize
spark.sql.catalog.oracle.log_and_time_sql.enabled=true
spark.sql.catalog.oracle.log_and_time_sql.log_level=info
spark.sql.catalog.oracle.fetchSize=5000
spark.sql.oracle.enable.querysplitting=false
spark.sql.oracle.querysplit.target=1Mb
# Configure jdbc connection information
# Example for a non wallet instance
#spark.sql.catalog.oracle.url=jdbc:oracle:thin:@<your_instance_details>
#spark.sql.catalog.oracle.user=tpcds
#spark.sql.catalog.oracle.password=tpcds
# Example config for an ADW instance
#spark.sql.catalog.oracle.authMethod=ORACLE_WALLET
#spark.sql.catalog.oracle.url=jdbc:oracle:thin:@<tns-address>
#spark.sql.catalog.oracle.user=<your_demo_user>
#spark.sql.catalog.oracle.password=<your_passwd>
#spark.sql.catalog.oracle.net.tns_admin=<wallet_location>
See the demo script for a session of catalog browsing and execution of tpcds queries.
- We use sbt
- In the root folder of the project run
sbt clean compile;sbt -DaddOraJars=true universal:packageBin
. This createsspark-oracle-0.1.0-SNAPSHOT.zip
in thepackaging/spark_extend/target/universal
folder.
- Quick Start
- Latest Demo
- Configuration
- Catalog
- Translation
- Query Splitting details
- DML Operations
- Language Integration
- Dockerized Demo env.
- Sharded Database
- Developer Notes