com.upwork.common.rerank:rerank is a library to rerank a set of instances containing features using a pre-configured Weka model. The library is designed to load and manage multiple Weka models to enable comparative evaluation of two or more models.
This library can be used for reranking a set of records or data instances based on a machine learning model built using Weka.
A typical use case is of reranking of search results returned by a search engine. For such an use case, the input to the library is a list of data instances corresponding to each of the search results and a class label whose score the results are reranked on. Each of the data instances contains a set of features which are required by the model. Output is an object composed of instance ids which are reranked and useful debugging information.
For the details around how to use the library, please look at "How to use" section below.
The library loads the model configuration files and the model binary files, optionally caches the model instances in the memory and can execute a model on a set of records to rerank the records. It also provide valuable debug info for a deeper dive into the performance of the model.
The model binaries must have an extension .model and the configuration for a model should be a .json file
containing the specification for the model and its features. The library provides an utility class
JsonConfigGenerator to convert the model specification from the weka standard .ARFF format into the
custom .json config this library takes. The library expects that all the files related to a model have a name which
follows the convention modelName.extension
. For example, if the model name is modelZ
then the name of the the
model binary must be modelZ.model
, of the .arff file must be modelZ.arff
and of the generated .json file is
modelZ.json
. The library doesn't support date
or relation
weka attribute types
at present.
The library chose custom format over the standard .ARFF format so as to be sufficiently extensible for real time use cases where each of the features may have additional configuration for their data sources or functions to compute them on the fly. That is how we use it at Upwork, however the current version of the library doesn't expose that extended functionality. We may decide to do so in future.
You can read more about the Weka and ARFF by following the links.
- Java version 1.8 or greater
- Maven version 3.2.1 or greater
Please follow the following steps to make use of the library in an application.
-
Add Dependency : Include the following maven dependency in your pom.xml
<dependency> <groupId>com.upwork.common.rerank</groupId> <artifactId>rerank</artifactId> <version>{version}</version> </dependency>
-
Configure : Set the following properties in a file named config.properties and make it available on the classpath of the application or set the properties as system properties through the program.
## Relative or absolute path of the repository where model binaries and their configs are stored. rerank.models.repo=../sample_models ## Names of the supported models rerank.models.supported=rerank_model
The library uses Archaius for configuration management. The default config.properties file is available at Default Config
-
Convert ARFF to JSON : JsonConfigGenerator can be used for this purpose. The usage is as follows.
java com.upwork.rerank.apputils.JsonConfigGenerator <modelName>
e.g. for a model name modelZ
java com.upwork.rerank.apputils.JsonConfigGenerator modelZ
The above command will expect a file
modelZ.arff
in the model store repo directory set by the propertyrerank.models.repo
and generate a file namedmodelZ.json
in the same directory. OpenmodelZ.json
and make sure thename
andshortName
of the models are as expected. They are exactly same as the name of the @rel in the weka .arff file, however they should be modified to appropriately reflect the name of the model. The _ .json_ file can be optionally be formatted for easier readability. -
Create Instances and Rerank : Create instances of features and use the library to rerank them. A typical usage is as follows.
//This is the name of the model which is used for scoring each of the instances String modelName = "modelZ"; //This is the class label on which the instances are scored String classLableToRerank = "1"; //Convert the domain specific data instances to a list of TInstance List<TInstance> instances = getData(); //Get an instance of the rerank lib RerankLib lib = RerankLibFactory.getInstance(2, 10).getRerankLib(modelName); //Rerank using the lib RerankResultSet rerankResultSet = lib.rerank(instances, classLableToRerank);
The project includes a sample application SampleRerankApplication which uses a sample model named rerank_model to rerank a set of instances. While in an actual application the data i.e. List of TInstance would come from application at runtime, the sample application makes use of the data already available in the model .arff file to illustrate the usage. Otherwise, the .arff is of no use after it is converted into .json config and can be done away with thereafter.
The json format is illustrated below through the config for a sample model rerank_model included in the
project. Every feature can optionally have another attribute name customConfig
to include application specific
custom configuration e.g. data source specifications.
```
{
"features": [
{
"name": "a",
"dataType": "numeric"
},
{
"name": "b",
"dataType": "numeric"
},
{
"name": "c",
"dataType": "numeric"
},
{
"name": "d",
"dataType": "numeric"
},
{
"name": "e",
"dataType": "numeric"
},
{
"name": "f",
"dataType": "nominal",
"values": [
"0",
"1"
],
"class": true
}
],
"name": "rerank_model",
"shortName": "rerank_model"
}
```
- The RerankLib can load multiple models and caches them as per the arguments supplied while creating it. It uses an LRU cache internally.
- If the class/target data type is a "numeric" then a classLabel is not required to be supplied to rerank call.
MIT