Platform Extension Framework (PXF) for Cloudberry Database

Introduction

PXF is an extensible framework that allows a distributed database like Greenplum and Cloudberry Database to query external data files, whose metadata is not managed by the database. PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables, JDBC-accessible databases and more. Users can also create their own connectors to other data storage or processing engines.

This project is forked from greenplum/pxf and customized for Cloudberry Database.

Repository Contents

external-table/ : Contains the CloudberryDB extension implementing an External Table protocol handler
fdw/ : Contains the CloudberryDB extension implementing a Foreign Data Wrapper (FDW) for PXF
server/ : Contains the server side code of PXF along with the PXF Service and all the Plugins
cli/ : Contains command line interface code for PXF
automation/ : Contains the automation and integration tests for PXF against the various datasources
singlecluster/ : Hadoop testing environment to exercise the pxf automation tests
regression/ : Contains the end-to-end (integration) tests for PXF against the various datasources, utilizing the PostgreSQL testing framework pg_regress
downloads/ : An empty directory that serves as a staging location for CloudberryDB RPMs for the development Docker image

PXF Development

Below are the steps to build and install PXF along with its dependencies including CloudberryDB and Hadoop.

Note

To start, ensure you have a ~/workspace directory and have cloned the pxf and its prerequisites (shown below) under it. (The name workspace is not strictly required but will be used throughout this guide.)

mkdir -p ~/workspace
cd ~/workspace

git clone https://github.com/cloudberrydb/pxf.git

Install Dependencies

To build PXF, you must have:

GCC compiler, make system, unzip package, maven for running integration tests
Installed Cloudberry Database

Either download and install CloudberryDB RPM or build CloudberryDB from the source by following instructions in the CloudberryDB.

Assuming you have installed CloudberryDB into /usr/local/cloudberrydb directory, run its environment script:
```
source /usr/local/cloudberrydb/greenplum_path.sh
```
JDK 1.8 or JDK 11 to compile/run

Export your JAVA_HOME:
```
export JAVA_HOME=<PATH_TO_YOUR_JAVA_HOME>
```
Go (1.9 or later)

To install Go on CentOS, sudo yum install go. For other platforms, see the Go downloads page.

Make sure to export your GOPATH and add go to your PATH. For example:
```
export GOPATH=$HOME/go
export PATH=$PATH:/usr/local/go/bin:$GOPATH/bin
```
Once you have installed Go, you will need the ginkgo tool which runs Go tests, respectively. Assuming go is on your PATH, you can run:
```
go install github.com/onsi/ginkgo/ginkgo@latest
```
cURL (7.29 or later):

To install cURL devel package on CentOS 7, sudo yum install libcurl-devel.

Note that CentOS 6 provides an older, unsupported version of cURL (7.19). You should install a newer version from source if you are on CentOS 6.

How to Build PXF

PXF uses Makefiles to build its components. PXF server component uses Gradle that is wrapped into the Makefile for convenience.

cd ~/workspace/pxf

# Compile & Test PXF
make

# Only run unit tests
make test

How to Install PXF

To install PXF, first make sure that the user has sufficient permissions in the $GPHOME and $PXF_HOME directories to perform the installation. It's recommended to change ownership to match the installing user. For example, when installing PXF as user gpadmin under /usr/local/cloudberrydb:

export GPHOME=/usr/local/cloudberrydb
export PXF_HOME=/usr/local/pxf
export PXF_BASE=${HOME}/pxf-base
chown -R gpadmin:gpadmin "${GPHOME}" "${PXF_HOME}"
make -C ~/workspace/pxf install

NOTE: if PXF_BASE is not set, it will default to PXF_HOME, and server configurations, libraries or other configurations, might get deleted after a PXF re-install.

How to Run PXF

Ensure that PXF is in your path. This command can be added to your .bashrc

export PATH=/usr/local/pxf/bin:$PATH

Then you can prepare and start up PXF by doing the following.

pxf prepare
pxf start

If ${HOME}/pxf-base does not exist, pxf prepare will create the directory for you. This command should only need to be run once.

Re-installing PXF after making changes

Note: Local development with PXF requires a running CloudberryDB cluster.

Once the desired changes have been made, there are 2 options to re-install PXF:

Run make -sj4 install to re-install and run tests
Run make -sj4 install-server to only re-install the PXF server without running unit tests.

After PXF has been re-installed, you can restart the PXF instance using:

pxf restart

How to demonstrate Hadoop Integration

In order to demonstrate end to end functionality you will need Hadoop installed. We have all the related hadoop components (hdfs, hive, hbase, zookeeper, etc) mapped into simple artifact named singlecluster. You can download from here and untar the singlecluster-HDP.tar.gz file, which contains everything needed to run Hadoop.

mv singlecluster-HDP.tar.gz ~/workspace/
cd ~/workspace
tar xzf singlecluster-HDP.tar.gz

Create a symlink using ln -s ~/workspace/singlecluster-HDP ~/workspace/singlecluster and then follow the steps in Setup Hadoop.

While PXF can run on either Java 8 or Java 11, please ensure that you are running Java 8 for hdfs, hadoop, etc. Please set your java version by seting your JAVA_HOME to the appropriate location.

On a Mac, you can set your java version using JAVA_HOME like so:

export JAVA_HOME=`/usr/libexec/java_home -v 1.8`

Initialize the default server configurations:

cp ${PXF_HOME}/templates/*-site.xml ${PXF_BASE}/servers/default

Development With Docker

Note

Since the docker container will house all Single cluster Hadoop, CloudberryDB and PXF, we recommend that you have at least 4 cpus and 6GB memory allocated to Docker. These settings are available under docker preferences.

The quick and easy is to download the CloudberryDB RPM from GitHub and move it into the /downloads folder. Then run ./dev/start.bash to get a docker image with a running CloudberryDB, Hadoop cluster and an installed PXF.

Setup CloudberryDB in the Docker image

Configure, build and install CloudberryDB. This will be needed only when you use the container for the first time with CloudberryDB source.

~/workspace/pxf/dev/build_gpdb.bash
sudo mkdir /usr/local/cloudberry-db-devel
sudo chown gpadmin:gpadmin /usr/local/cloudberry-db-devel
~/workspace/pxf/dev/install_gpdb.bash

For subsequent minor changes to CloudberryDB source you can simply do the following:

~/workspace/pxf/dev/install_gpdb.bash

Run all the instructions below and run GROUP=smoke (in one script):

~/workspace/pxf/dev/smoke_shortcut.sh

Create CloudberryDB Cluster

source /usr/local/cloudberrydb-db-devel/greenplum_path.sh
make -C ~/workspace/cbdb create-demo-cluster
source ~/workspace/cbdb/gpAux/gpdemo/gpdemo-env.sh

Setup Hadoop

Hdfs will be needed to demonstrate functionality. You can choose to start additional hadoop components (hive/hbase) if you need them.

Setup User Impersonation prior to starting the hadoop components (this allows the gpadmin user to access hadoop data).

~/workspace/pxf/dev/configure_singlecluster.bash

Setup and start HDFS

pushd ~/workspace/singlecluster/bin
echo y | ./init-gphd.sh
./start-hdfs.sh
popd

Start other optional components based on your need

pushd ~/workspace/singlecluster/bin
# Start Hive
./start-yarn.sh
./start-hive.sh

# Start HBase
./start-zookeeper.sh
./start-hbase.sh
popd

Setup Minio (optional)

Minio is an S3-API compatible local storage solution. The development docker image comes with Minio software pre-installed. To start the Minio server, run the following script:

source ~/workspace/pxf/dev/start_minio.bash

After the server starts, you can access Minio UI at http://localhost:9000 from the host OS. Use admin for the access key and password for the secret key when connecting to your local Minio instance.

The script also sets PROTOCOL=minio so that the automation framework will use the local Minio server when running S3 automation tests. If later you would like to run Hadoop HDFS tests, unset this variable with unset PROTOCOL command.

Setup PXF

Install PXF Server

# Install PXF
make -C ~/workspace/pxf install

# Start PXF
export PXF_JVM_OPTS="-Xmx512m -Xms256m"
$PXF_HOME/bin/pxf start

Install PXF client (ignore if this is already done)

psql -d template1 -c "create extension pxf"

Run PXF Tests

All tests use a database named pxfautomation.

pushd ~/workspace/pxf/automation

# Initialize default server configs using template
cp ${PXF_HOME}/templates/{hdfs,mapred,yarn,core,hbase,hive}-site.xml ${PXF_BASE}/servers/default

# Run specific tests. Example: Hdfs Smoke Test
make TEST=HdfsSmokeTest

# Run all tests. This will be very time consuming.
make GROUP=gpdb

# If you wish to run test(s) against a different storage protocol set the following variable (for eg: s3)
export PROTOCOL=s3
popd

If you see any HBase failures, try copying pxf-hbase-*.jar to the HBase classpath, and restart HBase:

cp ${PXF_HOME}/lib/pxf-hbase-*.jar ~/workspace/singlecluster/hbase/lib/pxf-hbase.jar
~/workspace/singlecluster/bin/stop-hbase.sh
~/workspace/singlecluster/bin/start-hbase.sh

Make Changes to PXF

To deploy your changes to PXF in the development environment.

# $PXF_HOME folder is replaced each time you make install.
# So, if you have any config changes, you may want to back those up.
$PXF_HOME/bin/pxf stop
make -C ~/workspace/pxf install
# Make any config changes you had backed up previously
rm -rf $PXF_HOME/pxf-service
yes | $PXF_HOME/bin/pxf init
$PXF_HOME/bin/pxf start

IDE Setup (IntelliJ)

Start IntelliJ. Click "Open" and select the directory to which you cloned the pxf repo.
Select File > Project Structure.
Make sure you have a JDK (version 1.8) selected.
In the Project Settings > Modules section, select Import Module, pick the pxf/server directory and import as a Gradle module. You may see an error saying that there's no JDK set for Gradle. Just cancel and retry. It goes away the second time.
Import a second module, giving the pxf/automation directory, select "Import module from external model", pick Maven then click Finish.
Restart IntelliJ
Check that it worked by running a unit test (cannot currently run automation tests from IntelliJ) and making sure that imports, variables, and auto-completion function in the two modules.
Optionally you can replace ${PXF_TMP_DIR} with ${GPHOME}/pxf/tmp in automation/pom.xml
Select Tools > Create Command-line Launcher... to enable starting Intellij with the idea command, e.g. cd ~/workspace/pxf && idea ..

Debugging the locally running instance of PXF server using IntelliJ

In IntelliJ, click Edit Configuration and add a new one of type Remote
Change the name to PXF Service Boot
Change the port number to 2020
Save the configuration
Restart PXF in DEBUG Mode PXF_DEBUG=true pxf restart
Debug the new configuration in IntelliJ
Run a query in CloudberryDB that uses PXF to debug with IntelliJ

To run a Kerberized Hadoop Cluster

Requirements

Download bin_gpdb (from any of the pipelines)
Download pxf_tarball (from any of the pipelines)

These instructions allow you to run a Kerberized cluster

docker run --rm -it \
  --privileged \
  --hostname c6401.ambari.apache.org \
  -p 5432:5432 \
  -p 5888:5888 \
  -p 8000:8000 \
  -p 8080:8080 \
  -p 8020:8020 \
  -p 9000:9000 \
  -p 9090:9090 \
  -p 50070:50070 \
  -w /home/gpadmin/workspace \
  -v ~/workspace/cbdb:/home/gpadmin/workspace/gpdb_src \
  -v ~/workspace/pxf:/home/gpadmin/workspace/pxf_src \
  -v ~/workspace/singlecluster-HDP:/home/gpadmin/workspace/singlecluster \
  -v ~/Downloads/bin_cbdb:/home/gpadmin/workspace/bin_cbdb \
  -v ~/Downloads/pxf_tarball:/home/gpadmin/workspace/pxf_tarball \
  -e CLUSTER_NAME=hdp \
  -e NODE=c6401.ambari.apache.org \
  -e REALM=AMBARI.APACHE.ORG \
  gcr.io/$PROJECT_ID/gpdb-pxf-dev/gpdb6-centos7-test-pxf-hdp2 /bin/bash

# Inside the container run the following command:
pxf_src/concourse/scripts/test_pxf_secure.bash

echo "+----------------------------------------------+"
echo "| Kerberos admin principal: admin/admin@$REALM |"
echo "| Kerberos admin password : admin              |"
echo "+----------------------------------------------+"

su - gpadmin

Contribute

See the CONTRIBUTING file for how to make contributions dedicated to the PXF for Cloudberry Database.

License

Under Apache License V2.0, See the LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,414 Commits
.github		.github
automation		automation
cli/go		cli/go
concourse		concourse
dev		dev
docs		docs
downloads		downloads
external-table		external-table
fdw		fdw
package		package
regression		regression
server		server
singlecluster		singlecluster
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE-OF-CONDUCT.md		CODE-OF-CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
api_version		api_version
common.mk		common.mk
version		version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Platform Extension Framework (PXF) for Cloudberry Database

Introduction

Repository Contents

PXF Development

Install Dependencies

How to Build PXF

How to Install PXF

How to Run PXF

Re-installing PXF after making changes

How to demonstrate Hadoop Integration

Development With Docker

Setup CloudberryDB in the Docker image

Setup Hadoop

Setup Minio (optional)

Setup PXF

Run PXF Tests

Make Changes to PXF

IDE Setup (IntelliJ)

Debugging the locally running instance of PXF server using IntelliJ

To run a Kerberized Hadoop Cluster

Requirements

Contribute

License

About

Releases

Packages

Contributors 4

Languages

License

apache/cloudberry-pxf

Folders and files

Latest commit

History

Repository files navigation

Platform Extension Framework (PXF) for Cloudberry Database

Introduction

Repository Contents

PXF Development

Install Dependencies

How to Build PXF

How to Install PXF

How to Run PXF

Re-installing PXF after making changes

How to demonstrate Hadoop Integration

Development With Docker

Setup CloudberryDB in the Docker image

Setup Hadoop

Setup Minio (optional)

Setup PXF

Run PXF Tests

Make Changes to PXF

IDE Setup (IntelliJ)

Debugging the locally running instance of PXF server using IntelliJ

To run a Kerberized Hadoop Cluster

Requirements

Contribute

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages