Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

[SQL-DS-CACHE-36][POAE7-898]HCFS docs for OAP 1.1 #37

Merged
merged 4 commits into from
Mar 19, 2021
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/HCFS-User-Guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# HCFS User Guide

* [Prerequisites](#prerequisites)
* [Configurations](#configuration)

## Prerequisites

HCFS based Data Source Cache on Spark 3.0.0 requires a working Hadoop cluster with YARN and Spark. Running Spark on YARN requires a binary distribution of Spark, which is built with YARN support. The HCFS based Data Source Cache also need to install plasma and redis, please follow [OAP-Installation-Guide](OAP-Installation-Guide.md) for how to install plasma and redis.

## Configurations

### Spark Configurations

Before you run `$SPARK_HOME/bin/spark-shell `, you need to configure Spark for integration. You need to add or update the following configurations in the Spark configuration file `$SPARK_HOME/conf/spark-defaults.conf` on your working node.

```bash
spark.hadoop.fs.cachedFs.impl com.intel.oap.fs.hadoop.cachedfs.CachedFileSystem
# absolute path of the jar on your working node
spark.files /path/to/hcfs-sql-ds-cache-<version>.jar
# relative path to spark.files, just specify jar name in current dir
spark.executor.extraClassPath ./hcfs-sql-cache-<version>.jar
# absolute path of the jar on your working node
spark.driver.extraClassPath /path/to/hcfs-sql-ds-cache-<version>.jar
```

### Redis Configuration

Add the following configuration to `$SPARK_HOME/conf/spark-defaults.conf`.

```
spark.hadoop.fs.cachedFs.redis.host $HOST
spark.hadoop.fs.cachedFs.redis.port $PORT
```

### Configuration for HCFS cache location policy

We provide three HCFS cache location policies, you can choose the best one for you workload
* defalut policy
This policy the file block locations consist of cached blocks and hdfs blocks (if cached blocks are incomplete)
* cache_over_hdfs
This policy use cached block location only if all requested content is cached, otherwise use HDFS block locations
* hdfs_only
This policy will ignoring cached blocks when finding file block locations

Add the following configuration to `$SPARK_HOME/conf/spark-defaults.conf`.

```
spark.hadoop.fs.cachedFs.blockLocation.policy default or cache_over_hdfs or hdfs_only
```

## Configuration for HCFS cache path patten

We provide HCFS cache patterns for paths to determine wherthe path will be cached
* whitelist
The path match the pattens will be cached. An empty regexp results in matching everything.
eg. cachedFs://localhost:9000/dir/
* blacklist
The path match the pattens will not be cached. An empty regexp results in no matching of black list.
eg. io_data|io_control

Add the following configuration to `$SPARK_HOME/conf/spark-defaults.conf`.

```
spark.hadoop.fs.cachedFs.whiteList.regexp $PATTERN
xieqi marked this conversation as resolved.
Show resolved Hide resolved
spark.hadoop.fs.cachedFs.blacklist.regexp $PATTERN
```