nicolecastillo · ivotron · Jul 6, 2020 · Jun 22, 2020 · Jun 22, 2020 · Jun 22, 2020
diff --git a/demo/classification/higgs/README.rst b/demo/classification/higgs/README.rst
@@ -5,4 +5,8 @@ You can find the full data from this here (`Link`__)
 
 The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set.
 
-.. __: https://archive.ics.uci.edu/ml/datasets/HIGGS
+Popper
+*****
+There is a performace validation test that you can find in the popper folder that compares the liblinear and xLearn libraries with a workflow that automatically downloads the data set, runs the benchmark and shows the results on a chart.
+
+.. __: https://archive.ics.uci.edu/ml/datasets/HIGGS
diff --git a/demo/classification/higgs/popper/Dockerfile b/demo/classification/higgs/popper/Dockerfile
@@ -0,0 +1,23 @@
+FROM python:3.7-slim-buster as base
+
+ENV USER=root
+
+# install build dependencies and python libs to run benchmarks
+RUN apt update && \
+    apt install -y cmake g++ git curl && \
+    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* && \
+    pip install --no-cache-dir sklearn pandas==1.0.4
+
+# install liblinear from source
+RUN git clone https://github.com/cjlin1/liblinear /opt/liblinear && \
+    cd /opt/liblinear/python && \
+    git checkout f41e72c && \
+    make -j4
+ENV PYTHONPATH=/opt/liblinear/python
+
+# install xlearn from source
+COPY . /xlearn
+RUN cd /xlearn && \
+    ls -l && \
+    ./build.sh && \
+    rm -r /xlearn
diff --git a/demo/classification/higgs/popper/README.md b/demo/classification/higgs/popper/README.md
@@ -0,0 +1,44 @@
+# Performance Validation Workflow with HIGGS
+## Using Popper
+
+[Popper](https://github.com/systemslab/popper) is a tool for defining and executing container-native workflows in Docker, as well as other container engines. More details about Popper can be found [here](https://popper.readthedocs.io/).
+
+## Description
+
+This folder contains a `wf.yml` file that defines a Popper workflow for automatically downloading and verifying the complete [HIGGS data set](https://archive.ics.uci.edu/ml/datasets/HIGGS) from UCI (which has 11 million entries), running the benchmark to compare the liblinear library with xLearn and finally generating a report with a chart that shows the results including error bars. 
+
+The benchmark tests the performance of each library by running five times the following set of main tasks:
+- Load data set with the help of [Pandas](https://pandas.pydata.org/).
+- Generate the trained linear model
+- Predict
+
+This is an example of how the chart looks:
+
+![report](https://user-images.githubusercontent.com/33427324/86541248-39be6a00-bec0-11ea-8961-132951ac028f.png)
+### Instructions:
+
+1. Clone the repository.
+```
+git clone https://github.com/aksnzhy/xlearn.git
+```
+
+2. Install [docker](https://docs.docker.com/get-docker/).
+
+3. Install the `popper` tool.
+```
+curl -sSfL https://raw.githubusercontent.com/getpopper/popper/master/install.sh | sh
+```
+4. Run the workflow.
+```
+cd xlearn/
+popper run -f demo/classification/higgs/popper/wf.yml
+```
+There is a way to run a single step of the workflow in case you don't want to run the whole thing each time, you only have to add the name of the step at the end like the following example.
+```
+popper run -f demo/classification/higgs/popper/wf.yml prepare-data
+```
+When we are having problems with a step there is also an easy way to debug the workflow by opening an interactive shell instead of having to update the YAML file and invoke `popper run` again.
+```
+popper sh -f demo/classification/higgs/popper/wf.yml prepare-data
+```
+The example above opens a shell inside the container where other things can be done. More information on this matter can be found [here](https://popper.readthedocs.io/en/latest/sections/getting_started.html#run-your-workflow).
diff --git a/demo/classification/higgs/popper/csv2libsvm.py b/demo/classification/higgs/popper/csv2libsvm.py
@@ -0,0 +1,76 @@
+#!/usr/bin/env python
+
+"""
+Convert CSV file to libsvm format. Works only with numeric variables.
+Put -1 as label index (argv[3]) if there are no labels in your file.
+Expecting no headers. If present, headers can be skipped with argv[4] == 1.
+
+source: https://stackoverflow.com/questions/23170152/converting-csv-file-to-libsvm-compatible-data-file-using-python
+
+"""
+
+import sys
+import csv
+import operator
+from collections import defaultdict
+
+def construct_line(label, line, labels_dict):
+    new_line = []
+    if label.isnumeric():
+        if float(label) == 0.0:
+            label = "0"
+    else:
+        if label in labels_dict:
+            new_line.append(labels_dict.get(label))
+        else:
+            label_id = str(len(labels_dict))
+            labels_dict[label] = label_id
+            new_line.append(label_id)
+
+    for i, item in enumerate(line):
+        if item == '' or float(item) == 0.0:
+            continue
+        elif item=='NaN':
+            item="0.0"
+        new_item = "%s:%s" % (i + 1, item)
+        new_line.append(new_item)
+    new_line = " ".join(new_line)
+    new_line += "\n"
+    return new_line
+
+# ---
+
+input_file = sys.argv[1]
+try:
+    output_file = sys.argv[2]
+except IndexError:
+    output_file = input_file+".out"
+
+
+try:
+    label_index = int( sys.argv[3] )
+except IndexError:
+    label_index = 0
+
+try:
+    skip_headers = sys.argv[4]
+except IndexError:
+    skip_headers = 0
+
+i = open(input_file, 'rt')
+o = open(output_file, 'wb')
+
+reader = csv.reader(i)
+
+if skip_headers:
+    headers = reader.__next__()
+
+labels_dict = {}
+for line in reader:
+    if label_index == -1:
+        label = '1'
+    else:
+        label = line.pop(label_index)
+
+    new_line = construct_line(label, line, labels_dict)
+    o.write(new_line.encode('utf-8'))
diff --git a/demo/classification/higgs/popper/run_benchmark.sh b/demo/classification/higgs/popper/run_benchmark.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+set -ex
+
+timestamp=$(date "+%Y%m%d-%H%M%S")
+results_dir="results/$timestamp"
+report_file="results/$timestamp/report.csv"
+
+if [ -f $report_file ]; then
+rm -f $report_file
+fi
+
+# Generate the output directory
+if [ ! -d $results_dir ]; then
+mkdir -p ./$results_dir
+chmod -R 777 ./$results_dir
+fi
+
+echo time,library >> $report_file
+# Run the training 5 times
+counter=1
+while [ $counter -le 5 ]
+do
+. ./run_xlearn.sh
+echo $result,xlearn >> $report_file
+. ./run_liblinear.sh
+echo $result,liblinear >> $report_file
+counter=$(( counter+1 ))
+done
diff --git a/demo/classification/higgs/popper/run_higgs_liblinear.py b/demo/classification/higgs/popper/run_higgs_liblinear.py
@@ -0,0 +1,6 @@
+from liblinearutil import *
+
+# Read data in LIBSVM format
+y, x = svm_read_problem('HIGGSlibsvm')
+m = train(y[:8800000], x[:8800000], '-s 0 -c 4 -B 1')
+p_label, p_acc, p_val = predict(y[8800000:], x[8800000:], m) 
diff --git a/demo/classification/higgs/popper/run_higgs_xlearn.py b/demo/classification/higgs/popper/run_higgs_xlearn.py
@@ -0,0 +1,47 @@
+# Import dataset
+import numpy as np
+import pandas as pd
+import xlearn as xl
+from sklearn.model_selection import train_test_split
+
+# Load dataset
+higgs = pd.read_csv("HIGGS.csv", header=None, sep=",")
+
+X = higgs[higgs.columns[1:]]
+y = higgs[0]
+
+# Split train and test set 
+x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
+
+# DMatrix transition
+xdm_train = xl.DMatrix(x_train, y_train)
+xdm_test = xl.DMatrix(x_test, y_test)
+
+# Training task
+linear_model = xl.create_linear()  # Use linear model
+linear_model.setTrain(xdm_train)    # Training data
+linear_model.setValidate(xdm_test)  # Validation data
+
+# param:
+#  0. regression task
+#  1. learning rate: 0.2
+#  2. regular lambda: 0.002
+#  3. evaluation metric: acc
+param = {'task':'binary', 'lr':0.2, 
+         'lambda':0.002, 'metric':'acc'}
+
+# Start to train
+# The trained model will be stored in model.out
+linear_model.fit(param, './model_dm.out')
+
+# Prediction task
+linear_model.setTest(xdm_test)  # Test data
+linear_model.setSigmoid()  # Convert output to 0-1
+
+# Start to predict
+# The output result will be stored in output.txt
+# if no result out path setted, we return res as numpy.ndarray
+res = linear_model.predict("./model_dm.out")
+
+print(res)
+
diff --git a/demo/classification/higgs/popper/run_liblinear.sh b/demo/classification/higgs/popper/run_liblinear.sh
@@ -0,0 +1,24 @@
+#!/bin/sh
+set -ex
+
+# start timing
+start=$(date +%s)
+start_fmt=$(date +%Y-%m-%d\ %r)
+echo "STARTING TIMING RUN AT $start_fmt"
+
+# run benchmark
+
+echo "running benchmark"
+
+python3 run_higgs_liblinear.py 
+
+# end timing
+end=$(date +%s)
+end_fmt=$(date +%Y-%m-%d\ %r)
+echo "ENDING TIMING RUN AT $end_fmt"
+
+# report result
+result=$(( $end - $start ))
+result_name="liblinear"
+
+echo "RESULT,$result_name,$result,$USER,$start_fmt"
diff --git a/demo/classification/higgs/popper/run_xlearn.sh b/demo/classification/higgs/popper/run_xlearn.sh
@@ -0,0 +1,25 @@
+#!/bin/sh
+set -ex
+
+# start timing
+start=$(date +%s)
+start_fmt=$(date +%Y-%m-%d\ %r)
+echo "STARTING TIMING RUN AT $start_fmt"
+
+# run benchmark
+
+echo "running benchmark"
+
+python3 run_higgs_xlearn.py 
+
+# end timing
+end=$(date +%s)
+end_fmt=$(date +%Y-%m-%d\ %r)
+echo "ENDING TIMING RUN AT $end_fmt"
+
+# report result
+result=$(( $end - $start ))
+result_name="xlearn"
+
+echo "RESULT,$result_name,$result,$USER,$start_fmt"
+
diff --git a/demo/classification/higgs/popper/show_results.py b/demo/classification/higgs/popper/show_results.py
@@ -0,0 +1,18 @@
+#/usr/bin/env python
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+import pandas as pd
+import glob
+
+list_reports = glob.glob("results/*/report.csv")
+dir_list = glob.glob("results/*")
+list_reports.sort()
+dir_list.sort()
+
+results = pd.read_csv(list_reports[-1], sep=",")
+
+sns.barplot(x = 'library', y = 'time', data = results)
+plt.title('Performance of the libraries with HIGGS dataset')
+plt.savefig(dir_list[-1] + "/report.png")
+plt.show()
diff --git a/demo/classification/higgs/popper/wf.yml b/demo/classification/higgs/popper/wf.yml
@@ -0,0 +1,45 @@
+steps:
+
+- id: build-img
+  uses: docker://docker:19.03.10
+  args:
+  - build
+  -   --tag=xlearn
+  -   --file=Dockerfile
+  -   .
+
+- id: download-data
+  uses: docker://byrnedo/alpine-curl:0.1.8
+  runs: [sh]
+  dir: /workspace/demo/classification/higgs/popper
+  args:
+  - -c
+  - |
+    set -ex
+    if [ ! -f HIGGS.csv ]; then
+      curl -LO https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
+      if [ -f HIGGS.csv.gz ]; then
+        gunzip HIGGS.csv.gz
+      fi
+    fi
+
+- id: prepare-data
+  uses: docker://python:3.7
+  dir: /workspace/demo/classification/higgs/popper
+  runs: [python3]
+  args: ['csv2libsvm.py','HIGGS.csv','libsvm.data','0','False']
+
+
+- id: run-benchmark
+  uses: docker://xlearn
+  dir: /workspace/demo/classification/higgs/popper
+  skip_pull: true
+  runs: [sh]
+  args: [run_benchmark.sh]
+
+- id: show-results
+  uses: docker://jupyter/scipy-notebook:latest
+  dir: /workspace/demo/classification/higgs/popper
+  runs: [python3]
+  args: [show_results.py] 
+