A cross-platform Pandas-like DataFrame base on Pandas, Spark, and Dask.
View Demo
·
Report Bug
·
Request Feature
Table of Contents
Octopus-DF, which integrates Pandas, Dask, and Spark as the backend computing platforms and exposes the most widely used Pandas-style APIs to users.
Why Octopus-DF?
- Efficient for different data scales. A DataFrame-based algorithm has quite different performances over various platforms with various data scales. It is not efficient to process the DataFrame of different data scales on a single platform. Octopus-DF integrates Pandas, Dask, and Spark, which make it efficient for different data scales.
- Ease to use. Octopus-DF provides Pandas-style API for data analysts to model and develops their problems and programs.
Octopus-DF
You should install python3.6+ environment first.
- Clone the repo
git clone https://github.com/PasaLab/Octopus-DF.git
- Install all dependencies.
cd Octopus-DF pip install –r requirements.txt
- Generate the target package.
python setup.py sdist
- Install the package
pip install Octopus-DataFrame
Octopus-DF is built on Pandas, Spark, and Dask. You need to deploy them in your distributed environment first. For spark, we use Spark on Yarn mode. For Dask, we use dask distributed. Please first check the official documentation to complete the installation and deployment.
To optimize the secondary index, you should install Redis first. The suggested way of installing Redis is compiling it from sources as Redis has no dependencies other than a working GCC compiler and libc. Please check the redis official documentation to complete the installation and deployment.
To optimize the local index, you should install plasma store first. In stand-alone mode, install pyarrow0.11.1
on the local machine (pip install pyarrow==0.11.1
), and use plasma_store –m 1000000000 –s /tmp/plasma &
to open up memory space to store memory objects. The above command opens up 1g of memory space. For details, please refer to official plasma documentation. In cluster mode, install on each machine and start the plasma store process.
When you install and deploy the above dependencies, you need to config Octopus-DF by editing the $HOME/.config/octopus/config.ini
file.
The configuration example is as follows:
[installation]
HADOOP_CONF_DIR = $HADOOP_HOME/etc/hadoop
SPARK_HOME = $SPARK_HOME
PYSPARK_PYTHON = ./ANACONDA/pythonenv/bin/python # python environment
SPARK_YARN_DIST_ARCHIVES = hdfs://host:port/path/to/pythonenv.zip#ANACONDA
SPARK_YARN_QUEUE = root.users.xquant
SPARK_EXECUTOR_CORES = 4
SPARK_EXECUTOR_MEMORY =10g
SPARK_EXECUTOR_INSTANCES = 6
SPARK_MEMORY_STORAGE_FRACTION = 0.2
SPARK_YARN_JARS = hdfs://host:port/path/to/spark/jars # jars for spark runtime, which is equivalent to spark.yarn.jars in spark configuration
SPARK_DEBUG_MAX_TO_STRING_FIELDS = 1000
SPARK_YARN_EXECUTOR_MEMORYOVERHEAD = 10g
For imperative interfaces such as SparkDataFrame and OctDataFrame, using Octopus-DF is like using pandas.
from Octopus.dataframe.core.sparkDataFrame import SparkDataFrame
odf = SparkDataFrame.from_csv("file_path")
print(odf.head(100))
print(odf.loc[0:10:2,:])
print(odf.filter(like='1'))
from Octopus.dataframe.core.octDataFrame import OctDataFrame
# engine_type can be dask, pandas.
odf = OctDataFrame.from_csv("file_path",engine_type="spark")
print(odf.head(100))
print(odf.loc[0:10:2,:])
print(odf.filter(like='1'))
For declarative interfaces such as SymbolDataframe, using Octopus-DF is like using spark. Due to its lazy computation mechanism, you should call compute()
to do the calculation.
from Octopus.dataframe.core.symbolDataFrame import SymbolDataFrame
# engine_type can be dask, pandas.
# If not declared, Octopus-DF will automatically select the best platform.
# Note that this function is experimental, need to be improved.
odf = SymbolDataFrame.from_csv(filie_path,engine_type='spark')
odf1 = odf.iloc[0:int(0.8*M):2,:]
odf2 = odf1.iloc[0:int(0.2*M):1,:]
odf2.compute()
# We can show the exectution plan and scheduler's execution time by show_execution_plan()
odf2.show_execution_plan()
See the open issues for a list of proposed features (and known issues).
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Gu Rong - gurong@nju.edu.cn