Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wf #6

Merged
merged 44 commits into from
Jul 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
6d92e77
Add workflow file
nicolecastillo Jun 22, 2020
2ab8271
Add Dockerfile
nicolecastillo Jun 22, 2020
bd8cf9a
Add scripts
nicolecastillo Jun 22, 2020
ec46aa4
Add scripts
nicolecastillo Jun 23, 2020
c34118b
Update download_dataset.sh
nicolecastillo Jun 23, 2020
4db0a82
Update run_benchmark.sh
nicolecastillo Jun 23, 2020
634eece
Update run_higgs_xlearn.py
nicolecastillo Jun 23, 2020
2ae8ed3
Update run_liblinear.sh
nicolecastillo Jun 23, 2020
288380d
Add files
nicolecastillo Jun 23, 2020
6318d91
Update wf.yml
nicolecastillo Jun 23, 2020
b33a3bf
remove unused
ivotron Jun 23, 2020
6951d32
Add step logic directly to workflow
ivotron Jun 23, 2020
7f621ed
Add step for building image as part of worfklow
ivotron Jun 23, 2020
fe911a2
Install liblinear and csv2libsvm in docker image
ivotron Jun 23, 2020
c23abda
Make scripts executable
ivotron Jun 23, 2020
76ba5bb
Move scripts to parent dir
ivotron Jun 23, 2020
9c2df00
Move Dockerfile to parent folder
ivotron Jun 23, 2020
c92b52a
Make scripts executable
ivotron Jun 23, 2020
0e5ca61
Update path to dockerfile
ivotron Jun 23, 2020
19a0039
Merge pull request #4 from ivotron/wf-tweaks
nicolecastillo Jun 24, 2020
f8c592c
Update Dockerfile
nicolecastillo Jun 24, 2020
6c398a1
Update run_benchmark.sh
nicolecastillo Jun 24, 2020
2bc2c22
Update options in run_higgs_liblinear.py
nicolecastillo Jun 24, 2020
619e871
Update run_higgs_xlearn.py
nicolecastillo Jun 24, 2020
55f1743
Update run_liblinear.sh
nicolecastillo Jun 24, 2020
4ed7824
Update run_xlearn.sh
nicolecastillo Jun 24, 2020
2f0ce61
Add csv2libsvm.py file
nicolecastillo Jun 24, 2020
4d41b10
Update wf.yml
nicolecastillo Jun 24, 2020
33ecdfe
Update csv2libsvm.py
nicolecastillo Jun 24, 2020
fd70eb2
Add README.md file
nicolecastillo Jun 26, 2020
813a5f8
Update README.md
nicolecastillo Jun 26, 2020
d4af7ea
Update README.md
nicolecastillo Jun 26, 2020
28ec617
Update run_benchmark.sh
nicolecastillo Jun 28, 2020
e7220d4
Update show_results.py
nicolecastillo Jun 28, 2020
1e2e33a
fix directories
nicolecastillo Jul 1, 2020
350bd5f
update Dockerfile
nicolecastillo Jul 1, 2020
ad4f7c4
update run_benchmark.sh
nicolecastillo Jul 1, 2020
1d8aa0d
add save figure to show_results.py
nicolecastillo Jul 1, 2020
37a272c
add dir attribute
nicolecastillo Jul 1, 2020
a6490ce
add figures in their respective folder
nicolecastillo Jul 1, 2020
b3021e1
add permissions to created folder
nicolecastillo Jul 1, 2020
317f18b
Update README.md
nicolecastillo Jul 5, 2020
e40a6ad
Update Dockerfile
nicolecastillo Jul 6, 2020
4f21f0e
Add popper folder description
nicolecastillo Jul 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion demo/classification/higgs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,8 @@ You can find the full data from this here (`Link`__)

The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set.

.. __: https://archive.ics.uci.edu/ml/datasets/HIGGS
Popper
*****
There is a performace validation test that you can find in the popper folder that compares the liblinear and xLearn libraries with a workflow that automatically downloads the data set, runs the benchmark and shows the results on a chart.

.. __: https://archive.ics.uci.edu/ml/datasets/HIGGS
23 changes: 23 additions & 0 deletions demo/classification/higgs/popper/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM python:3.7-slim-buster as base

ENV USER=root

# install build dependencies and python libs to run benchmarks
RUN apt update && \
apt install -y cmake g++ git curl && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
pip install --no-cache-dir sklearn pandas==1.0.4

# install liblinear from source
RUN git clone https://github.com/cjlin1/liblinear /opt/liblinear && \
ivotron marked this conversation as resolved.
Show resolved Hide resolved
cd /opt/liblinear/python && \
git checkout f41e72c && \
make -j4
ENV PYTHONPATH=/opt/liblinear/python

# install xlearn from source
COPY . /xlearn
RUN cd /xlearn && \
ls -l && \
./build.sh && \
rm -r /xlearn
44 changes: 44 additions & 0 deletions demo/classification/higgs/popper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Performance Validation Workflow with HIGGS
## Using Popper

[Popper](https://github.com/systemslab/popper) is a tool for defining and executing container-native workflows in Docker, as well as other container engines. More details about Popper can be found [here](https://popper.readthedocs.io/).

ivotron marked this conversation as resolved.
Show resolved Hide resolved
## Description

This folder contains a `wf.yml` file that defines a Popper workflow for automatically downloading and verifying the complete [HIGGS data set](https://archive.ics.uci.edu/ml/datasets/HIGGS) from UCI (which has 11 million entries), running the benchmark to compare the liblinear library with xLearn and finally generating a report with a chart that shows the results including error bars.

The benchmark tests the performance of each library by running five times the following set of main tasks:
- Load data set with the help of [Pandas](https://pandas.pydata.org/).
- Generate the trained linear model
- Predict

This is an example of how the chart looks:

![report](https://user-images.githubusercontent.com/33427324/86541248-39be6a00-bec0-11ea-8961-132951ac028f.png)
### Instructions:

1. Clone the repository.
```
git clone https://github.com/aksnzhy/xlearn.git
```

2. Install [docker](https://docs.docker.com/get-docker/).

3. Install the `popper` tool.
```
curl -sSfL https://raw.githubusercontent.com/getpopper/popper/master/install.sh | sh
```
4. Run the workflow.
```
cd xlearn/
popper run -f demo/classification/higgs/popper/wf.yml
```
ivotron marked this conversation as resolved.
Show resolved Hide resolved
There is a way to run a single step of the workflow in case you don't want to run the whole thing each time, you only have to add the name of the step at the end like the following example.
```
popper run -f demo/classification/higgs/popper/wf.yml prepare-data
```
When we are having problems with a step there is also an easy way to debug the workflow by opening an interactive shell instead of having to update the YAML file and invoke `popper run` again.
```
popper sh -f demo/classification/higgs/popper/wf.yml prepare-data
```
The example above opens a shell inside the container where other things can be done. More information on this matter can be found [here](https://popper.readthedocs.io/en/latest/sections/getting_started.html#run-your-workflow).
76 changes: 76 additions & 0 deletions demo/classification/higgs/popper/csv2libsvm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/usr/bin/env python

"""
Convert CSV file to libsvm format. Works only with numeric variables.
Put -1 as label index (argv[3]) if there are no labels in your file.
Expecting no headers. If present, headers can be skipped with argv[4] == 1.

source: https://stackoverflow.com/questions/23170152/converting-csv-file-to-libsvm-compatible-data-file-using-python

"""

import sys
import csv
import operator
from collections import defaultdict

def construct_line(label, line, labels_dict):
new_line = []
if label.isnumeric():
if float(label) == 0.0:
label = "0"
else:
if label in labels_dict:
new_line.append(labels_dict.get(label))
else:
label_id = str(len(labels_dict))
labels_dict[label] = label_id
new_line.append(label_id)

for i, item in enumerate(line):
if item == '' or float(item) == 0.0:
continue
elif item=='NaN':
item="0.0"
new_item = "%s:%s" % (i + 1, item)
new_line.append(new_item)
new_line = " ".join(new_line)
new_line += "\n"
return new_line

# ---

input_file = sys.argv[1]
try:
output_file = sys.argv[2]
except IndexError:
output_file = input_file+".out"


try:
label_index = int( sys.argv[3] )
except IndexError:
label_index = 0

try:
skip_headers = sys.argv[4]
except IndexError:
skip_headers = 0

i = open(input_file, 'rt')
o = open(output_file, 'wb')

reader = csv.reader(i)

if skip_headers:
headers = reader.__next__()

labels_dict = {}
for line in reader:
if label_index == -1:
label = '1'
else:
label = line.pop(label_index)

new_line = construct_line(label, line, labels_dict)
o.write(new_line.encode('utf-8'))
28 changes: 28 additions & 0 deletions demo/classification/higgs/popper/run_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
set -ex

timestamp=$(date "+%Y%m%d-%H%M%S")
results_dir="results/$timestamp"
report_file="results/$timestamp/report.csv"

if [ -f $report_file ]; then
rm -f $report_file
fi

# Generate the output directory
if [ ! -d $results_dir ]; then
mkdir -p ./$results_dir
chmod -R 777 ./$results_dir
fi

echo time,library >> $report_file
# Run the training 5 times
counter=1
while [ $counter -le 5 ]
do
. ./run_xlearn.sh
echo $result,xlearn >> $report_file
. ./run_liblinear.sh
echo $result,liblinear >> $report_file
counter=$(( counter+1 ))
done
6 changes: 6 additions & 0 deletions demo/classification/higgs/popper/run_higgs_liblinear.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from liblinearutil import *

# Read data in LIBSVM format
y, x = svm_read_problem('HIGGSlibsvm')
m = train(y[:8800000], x[:8800000], '-s 0 -c 4 -B 1')
p_label, p_acc, p_val = predict(y[8800000:], x[8800000:], m)
47 changes: 47 additions & 0 deletions demo/classification/higgs/popper/run_higgs_xlearn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Import dataset
import numpy as np
import pandas as pd
import xlearn as xl
from sklearn.model_selection import train_test_split

# Load dataset
higgs = pd.read_csv("HIGGS.csv", header=None, sep=",")

X = higgs[higgs.columns[1:]]
y = higgs[0]

# Split train and test set
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

# DMatrix transition
xdm_train = xl.DMatrix(x_train, y_train)
xdm_test = xl.DMatrix(x_test, y_test)

# Training task
linear_model = xl.create_linear() # Use linear model
linear_model.setTrain(xdm_train) # Training data
linear_model.setValidate(xdm_test) # Validation data

# param:
# 0. regression task
# 1. learning rate: 0.2
# 2. regular lambda: 0.002
# 3. evaluation metric: acc
param = {'task':'binary', 'lr':0.2,
'lambda':0.002, 'metric':'acc'}

# Start to train
# The trained model will be stored in model.out
linear_model.fit(param, './model_dm.out')

# Prediction task
linear_model.setTest(xdm_test) # Test data
linear_model.setSigmoid() # Convert output to 0-1

# Start to predict
# The output result will be stored in output.txt
# if no result out path setted, we return res as numpy.ndarray
res = linear_model.predict("./model_dm.out")

print(res)

24 changes: 24 additions & 0 deletions demo/classification/higgs/popper/run_liblinear.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/sh
set -ex

# start timing
start=$(date +%s)
start_fmt=$(date +%Y-%m-%d\ %r)
echo "STARTING TIMING RUN AT $start_fmt"

# run benchmark

echo "running benchmark"

python3 run_higgs_liblinear.py

# end timing
end=$(date +%s)
end_fmt=$(date +%Y-%m-%d\ %r)
echo "ENDING TIMING RUN AT $end_fmt"

# report result
result=$(( $end - $start ))
result_name="liblinear"

echo "RESULT,$result_name,$result,$USER,$start_fmt"
25 changes: 25 additions & 0 deletions demo/classification/higgs/popper/run_xlearn.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/sh
set -ex

# start timing
start=$(date +%s)
start_fmt=$(date +%Y-%m-%d\ %r)
echo "STARTING TIMING RUN AT $start_fmt"

# run benchmark

echo "running benchmark"

python3 run_higgs_xlearn.py

# end timing
end=$(date +%s)
end_fmt=$(date +%Y-%m-%d\ %r)
echo "ENDING TIMING RUN AT $end_fmt"

# report result
result=$(( $end - $start ))
result_name="xlearn"

echo "RESULT,$result_name,$result,$USER,$start_fmt"

18 changes: 18 additions & 0 deletions demo/classification/higgs/popper/show_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#/usr/bin/env python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import glob

list_reports = glob.glob("results/*/report.csv")
dir_list = glob.glob("results/*")
list_reports.sort()
dir_list.sort()

results = pd.read_csv(list_reports[-1], sep=",")

sns.barplot(x = 'library', y = 'time', data = results)
plt.title('Performance of the libraries with HIGGS dataset')
plt.savefig(dir_list[-1] + "/report.png")
plt.show()
45 changes: 45 additions & 0 deletions demo/classification/higgs/popper/wf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
steps:

- id: build-img
uses: docker://docker:19.03.10
args:
- build
- --tag=xlearn
- --file=Dockerfile
- .

- id: download-data
uses: docker://byrnedo/alpine-curl:0.1.8
runs: [sh]
dir: /workspace/demo/classification/higgs/popper
args:
- -c
- |
set -ex
if [ ! -f HIGGS.csv ]; then
curl -LO https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
if [ -f HIGGS.csv.gz ]; then
gunzip HIGGS.csv.gz
fi
fi

- id: prepare-data
uses: docker://python:3.7
dir: /workspace/demo/classification/higgs/popper
runs: [python3]
args: ['csv2libsvm.py','HIGGS.csv','libsvm.data','0','False']


- id: run-benchmark
uses: docker://xlearn
dir: /workspace/demo/classification/higgs/popper
skip_pull: true
runs: [sh]
args: [run_benchmark.sh]

- id: show-results
uses: docker://jupyter/scipy-notebook:latest
dir: /workspace/demo/classification/higgs/popper
runs: [python3]
args: [show_results.py]