Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM model locale fix #53

Merged
merged 32 commits into from
Sep 30, 2020
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
8289649
LGBM prototype with build for LGBM which now fails with the latest ve…
Sep 11, 2020
58d2a6d
Test using the custom PoC code to mock read/writes in C++
Sep 14, 2020
6975a2f
Use the latest make-lightgbm with caching
Sep 15, 2020
cc62077
Use new LightGBM with fmt
Sep 15, 2020
d1d5bd8
Use merged make-lightgbm with submodules support
Sep 15, 2020
a8cabc2
Use latest make-lightgbm version
Sep 16, 2020
f48d293
Add model rewrite functional tests
Sep 16, 2020
c26f3ed
Model writes use fmt library
Sep 21, 2020
1448d51
Patched LightGBM version passes all tests independently of locale
Sep 23, 2020
1c953f3
Test model diff with Java and do optional Python report
Sep 24, 2020
5359eb3
Update python scripts
Sep 25, 2020
a3f4b53
Update python usage
Sep 25, 2020
742ee32
Update test code for Codacy
Sep 25, 2020
fb88ca7
Update extend_model.py for Codacy
Sep 25, 2020
36b2637
Add docstrings for extend_model.py
Sep 25, 2020
5b5c1a8
Update pom descriptions
Sep 25, 2020
96ee811
Address review comments from Sheng
Sep 25, 2020
1113b4e
Add final keywords
Sep 28, 2020
69dc536
Move python scripts to test resources
Sep 28, 2020
98edf63
Remove LC_ALL test configuration
Sep 28, 2020
6579fe5
Update docstring
Sep 28, 2020
3191bee
Delete temporary test outputs
Sep 28, 2020
a6d64c1
Please Codacy random cryptographic checks
Sep 28, 2020
9a0ed8f
Move reference models to test resources
Sep 28, 2020
a49c274
Address Sheng's comments
Sep 28, 2020
aa83d29
Update docstrings
Sep 29, 2020
f367470
Remove python diff_utils.py from test. Now it's manual.
Sep 29, 2020
38df851
Delete temporary dir in test
Sep 29, 2020
c835fba
Cleanup unused imports
Sep 29, 2020
c299c51
Remove model logger from test
Sep 29, 2020
fb66fc2
Update pom.xml to use tag
Sep 29, 2020
c603135
Upgrade make-lightgbm to have fix-cache patch.
Sep 29, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion openml-lightgbm/lightgbm-builder/make.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,5 @@ git submodule update --init

echo "Building LightGBM $LIGHTGBM_COMMIT_REF as lightgbmlib $LIGHTGBMLIB_VERSION"
cd make-lightgbm
bash make.sh "$LIGHTGBM_COMMIT_REF" "$LIGHTGBMLIB_VERSION"
bash make.sh "$LIGHTGBM_COMMIT_REF" "$LIGHTGBMLIB_VERSION" --cache

10 changes: 6 additions & 4 deletions openml-lightgbm/lightgbm-builder/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

<groupId>com.feedzai.openml.lightgbm</groupId>
<artifactId>lightgbm-lib</artifactId>
<version>3.0.0</version>
<version>3.0.1</version>

<packaging>jar</packaging>
<name>Openml LightGBM lib</name>
Expand All @@ -36,9 +36,11 @@
<url>https://github.com/feedzai/make-lightgbm</url>

<properties>
<lightgbm.repo.url>https://github.com/microsoft/LightGBM</lightgbm.repo.url>
<lightgbm.version>v3.0.0</lightgbm.version>
<lightgbmlib.version>3.0.0</lightgbmlib.version>
<!--<lightgbm.repo.url>https://github.com/microsoft/LightGBM</lightgbm.repo.url>-->
<lightgbm.repo.url>https://github.com/feedzai/LightGBM.git</lightgbm.repo.url>
<lightgbmlib.version>3.0.1</lightgbmlib.version>
<lightgbm.version>v3.0.0</lightgbm.version> <!-- Standard codebase (won't pass the tests when LC_ALL != C) -->
<lightgbm.version>model-locale-fix</lightgbm.version> <!-- Patched version: passes all tests independently of locale -->
</properties>

<!-- jgitver changes the version 2.3.190 to project version, therefore we excluded this project from jgitver -->
Expand Down
111 changes: 111 additions & 0 deletions openml-lightgbm/lightgbm-provider/diff_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""
This script compares two folders with LGBM models.
It compares models with the same name across the two folders.
If a file with the same name is found in both folders it is compared.
Any differences in any of such compared files prints a report and triggers exit(1).
The report focuses on differences in the parameters of each line.

Requirements:
- Python 3
- (Optional) Python's termcolor for best results (`pip install termcolor`)

Usage:
script <reference_models_folder> <new_models_folder>

Exit code:
0 if no differences in matching files were found
1 if any file has differences

Author: Alberto Ferreira
Copyright: 2020, Feedzai
License: Same as the repo.
"""

import argparse
import filecmp
import sys
from pathlib import Path

try:
from termcolor import colored
def warn_missing_libs():
pass
except ModuleNotFoundError:
def warn_missing_libs():
print("Warning: Please install Python's 'termcolor' library to see detailed colored model diff!")
def colored(msg, color):
"Poor replacement for colored."
if color == "red":
return f"Original=<{msg}>"
elif color == "green":
return f"Current=<{msg}>"
else:
return msg


def file_diff_report(a, b, delim=" "):
"""
Reports the file diff by printing different lines and different elements in them.
"""
a_lines = open(a).readlines()
b_lines = open(b).readlines()

if len(a_lines) != len(b_lines):
print("Different file line sizes!")
return

for a_line, b_line in zip(a_lines, b_lines):
a_elems = a_line.split(delim)
b_elems = b_line.split(delim)

if a_elems != b_elems:
print(
colored(f"\n\n\n>>> Different line contents between lines\n\n", "yellow"),
colored(a_line, "red"),
colored(b_line, "green"),
sep=""
)
print(colored("\n>> Different elements in lines\n", "yellow"))
if len(a_elems) != len(b_elems):
print("Different line sizes!")
return
for a_elem, b_elem in zip(a_elems, b_elems):
if a_elem != b_elem:
print(
colored(a_elem, "red"),
"\n",
colored(b_elem, "green"),
sep=""
)

def compare_folders_with_model_files(folder_ref, folder_new):
"""
Compares two folders with lgbm .txt models inside.
If a file with the same name is found in the two folders is compared.
Any difference triggers an exit(1).
"""
dir_ref = Path(folder_ref)
dir_new = Path(folder_new)

diff_filenames = filecmp.dircmp(dir_ref, dir_new).diff_files

if diff_filenames:
warn_missing_libs()

for filename in diff_filenames:
print(colored(f"Found mismatches in file {filename}", "yellow"))
file_diff_report(dir_ref/filename, dir_new/filename)

warn_missing_libs()
sys.exit(1)

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("reference_models_folder")
parser.add_argument("new_models_folder")
args = parser.parse_args()

compare_folders_with_model_files(
args.reference_models_folder,
args.new_models_folder
)
9 changes: 7 additions & 2 deletions openml-lightgbm/lightgbm-provider/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
<description>OpenML Microsoft LightGBM Machine Learning Model and Classifier provider</description>

<properties>
<lightgbmlib.version>3.0.0</lightgbmlib.version>
<lightgbmlib.version>3.0.1</lightgbmlib.version>
</properties>

<dependencies>
Expand Down Expand Up @@ -90,6 +90,11 @@
<artifactId>commons-csv</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

<build>
Expand Down Expand Up @@ -125,7 +130,7 @@
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<environmentVariables>
<LC_ALL>C</LC_ALL>
<!--<LC_ALL>C</LC_ALL>-->
</environmentVariables>
</configuration>
</plugin>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,21 @@
import com.feedzai.openml.provider.exception.ModelLoadingException;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
import org.apache.commons.io.FileUtils;
import org.junit.BeforeClass;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.math.BigDecimal;
import java.math.RoundingMode;
import java.net.URISyntaxException;
import java.nio.file.Path;
import java.nio.file.Paths;

import static org.assertj.core.api.Assertions.assertThat;
import static org.junit.Assert.assertTrue;
Expand All @@ -41,6 +49,11 @@
*/
public class LightGBMBinaryClassificationModelTest {

/**
* Logger for this class.
*/
private static final Logger logger = LoggerFactory.getLogger(LightGBMBinaryClassificationModelTest.class);

/**
* LightGBM's model loader instance. Will be used to load more model instances during tests.
*/
Expand Down Expand Up @@ -236,4 +249,115 @@ private boolean compareDoubles(final double a, final double b, final int exclude
);
}

/**
* Compares model files in the standard_code_models/ and new_code_models/.
* If there are any differences in the models within such folders they are reported
* in a colored report (requires python3 with termcolor installed).
*
* @param referenceModelsFolder Folder with reference models
* @param newModelsFolder Folder with new models to compare to reference
* @return true if there are no differences and all the dependencies are installed, false otherwise.
* @throws IOException In case the process errors.
* @throws InterruptedException In case awaiting for the process to finish fails.
*/
private boolean compareModelFilesAndDoPrettyReport(
final String referenceModelsFolder,
final String newModelsFolder) throws IOException, InterruptedException {

// Check for differences and report in detailed manner!
final Process p = new ProcessBuilder(
"python",
"diff_models.py",
referenceModelsFolder,
newModelsFolder
).start();

// Fetch process output
BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
StringBuilder builder = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append(System.getProperty("line.separator"));
}
final String processOutput = builder.toString();

p.waitFor();
logger.error(processOutput);

return (p.exitValue() == 0);
}

/**
* Asserts that the two files have the same content.
*
* @param name name of the file to compare for the assert message.
* @param filepath1 path to the first file
* @param filepath2 path to the second file
* @throws IOException Raised in case of failure reading the files
*/
private void assertEqualFileContents(final String name,
final Path filepath1,
final Path filepath2) throws IOException {

File file1 = new File(filepath1.toString());
File file2 = new File(filepath2.toString());

assertThat(FileUtils.contentEquals(file1, file2))
.as(String.format("%s file comparison", name))
.isTrue();
}

/**
* This functional test ensures that LightGBM can read a model file and output one exactly alike the one read in.
* This is to ensure the new code to rewrite the model read/write layers is completely functional.
* The two reference models were generated initially with LightGBM's v3.0.0 code.
* The two generated ones will use the current code in the current locale. There should be no mismatches.
* @throws URISyntaxException
* @throws ModelLoadingException
* @throws IOException
* @throws InterruptedException
*/
@Test
public void testRewriteModel() throws URISyntaxException, ModelLoadingException, IOException, InterruptedException {

final String referenceModelsFolder = "standard_code_models";
final String testOutputModelsFolder = "new_code_models";
final String model1Name = "4f.txt";
final String model2Name = "42f.txt";

// Create the output directory if it doesn't exist:
new File(testOutputModelsFolder).mkdir();

// Rewrite model 4f.txt:
LightGBMSWIG swig = new LightGBMSWIG(
TestResources.getModelFilePath().toString(),
TestSchemas.NUMERICALS_SCHEMA_WITH_LABEL_AT_END,
"");
swig.saveModelToDisk(Paths.get(testOutputModelsFolder, model1Name));

// Rewrite model 42f.txt:
swig = new LightGBMSWIG(
TestResources.getResourcePath("lightgbm_model_42_numericals.txt").toString(),
TestSchemas.NUMERICALS_SCHEMA_WITH_LABEL_AT_END,
"");
swig.saveModelToDisk(Paths.get(testOutputModelsFolder, model2Name));

// Do a detailed report (if Python3+termcolor is available):
compareModelFilesAndDoPrettyReport(
referenceModelsFolder,
testOutputModelsFolder
);

// Compare the rewritten models:
assertEqualFileContents(
model1Name,
Paths.get(referenceModelsFolder, model1Name),
Paths.get(testOutputModelsFolder, model1Name));

assertEqualFileContents(
model1Name,
Paths.get(referenceModelsFolder, model2Name),
Paths.get(testOutputModelsFolder, model2Name));
}
}
Loading