emrfs-vs-alluxio-benchmark

Uncompromised benchmark to understand the trade-off between EMRFS and Alluxio for Apache Spark applications with Amazon S3 persistence.

Environment

EMR release: emr-5.23.0
EBR root volume size: 10 GB
Region: us-east-1
Master Node: m4.2xlarge + 2 x EBS 32 GB GP2
CORE Nodes: 10 x m4.4xlarge + 4 x EBS 32 GB GP2
Apache Spark 2.4.0
Python 3.6
Alluxio 2.0.0-preview - ASYNC_THROUGH/MUST_CACHE write type - CACHE_PROMOTE read type

Setup

For both scenarios:

launch EMR cluster with the related bootstrap and config files.
Create and attach a EMR notebook in the cluster

For Alluxio, after launch the cluster, you need to give access for the Notebook user:

sudo su alluxio
/opt/alluxio/bin/alluxio fs mkdir /test
/opt/alluxio/bin/alluxio fs chown livy:livy /test

Workload

Input: 1 x Text file separated by pipes (|) generated by TPC-H on S3 - 74 GB - 600 M records - 16 columns - No compression
Output: 1000 x Parquet files on S3 - 27 GB - SNAPPY compression

To generate a similar dataset you can follow these instruction on an EC2 with Amazon AMI

sudo yum install gcc make git -y
cd $HOME
mkdir data
git clone https://github.com/gregrahn/tpch-kit.git
cd tpch-kit/dbgen/
make OS=LINUX
export DSS_PATH=$HOME/data
./dbgen -v -T o -s 100

Benchmark

The benchmark routine measures the time elapsed to write the output (overwrite mode) and to read that.

It runs each routine 10x and extract tha minimum, maximum and average values.

P.S. The first read (From text file) and the first write (without overwrite) are ignored.

P.P.S The target Dataframe was cached to isolate the I/O performance.

Result

Alluxio - ASYNC_THROUGH

Operation	Metric	Time (sec)
write	min	80.2
write	avg	82.8
write	max	86.7
read	min	81.4
read	avg	86.0
read	max	91.6

Alluxio - MUST_CACHE

Operation	Metric	Time (sec)
write	min	79.3
write	avg	84.2
write	max	91.9
read	min	82.4
read	avg	91.0
read	max	95.4

Alluxio - MUST_CACHE - METADATA load disabled

Operation	Metric	Time (sec)
write	min	44.6
write	avg	45.6
write	max	47.8
read	min	80.3
read	avg	83.7
read	max	93.1

EMRFS

Operation	Metric	Time (sec)
write	min	62.2
write	avg	67.4
write	max	77.7
read	min	94.3
read	avg	97.7
read	max	102.7

Conclusion

EMRFS writes faster than Alluxio
Alluxio reads faster than EMRFS
EMRFS provides the data on S3 immediately, Alluxio does not.
Alluxio uses less CPU (Check Ganglia PDF files)
EMRFS uses less memory (Check Ganglia PDF files)

License

These codes/files are licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
alluxio-async-through		alluxio-async-through
alluxio-must-cache-metadata-disabled		alluxio-must-cache-metadata-disabled
alluxio-must-cache		alluxio-must-cache
emrfs		emrfs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

emrfs-vs-alluxio-benchmark

Environment

Setup

Workload

Benchmark

Result

Conclusion

License

About

Releases

Packages

Languages

License

StevenMih/emrfs-vs-alluxio-benchmark

Folders and files

Latest commit

History

Repository files navigation

emrfs-vs-alluxio-benchmark

Environment

Setup

Workload

Benchmark

Result

Conclusion

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages