Skip to content

From Zero to Learning to Rank in Apache Solr

Notifications You must be signed in to change notification settings

gpatwa/Solr-LTR

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

From Zero to Learning to Rank in Apache Solr

This tutorial describes how to implement a modern learning to rank (LTR) system in Apache Solr. The intended audience is people who have zero Solr experience, but who are comfortable with machine learning and information retrieval concepts. I was one of those people only a couple of months ago, and I found it extremely challenging to get up and running with the Solr materials available on the internet. This is my attempt at writing the tutorial I wish I had discovered.

Table of Contents

Setting Up Solr

Firing up a vanilla Solr instance on Fedora is actually pretty straightforward. First, download the Solr source tarball (so, one containing "src") from here and extract it to a reasonable location. Next, cd into the Solr directory:

cd /path/to/solr-<version>/solr

Building Solr requires Apache Ant and Apache Ivy, so we'll have to install those:

sudo dnf install ant ivy

And now we'll build Solr.

ant ivy-bootstrap
ant server

You can confirm Solr is working by running:

bin/solr start

and making sure you see the Solr Admin interface at http://localhost:8983/solr/. You can stop Solr (but don't stop it now) with:

bin/solr stop

Solr Basics

Solr is a search platform, so we only really need to know how to do two things to function: (1) index data and (2) define a ranking model. Solr has a REST-like API, which means we'll be making changes with the curl command. To get going, let's first create a core named test:

bin/solr create -c test

This seemingly simple command actually did a lot of stuff behind the scenes. Specifically, it defined a schema, which tells Solr how documents should be processed (think tokenization, stemming, etc.) and searched (e.g., using the tf-idf vector space model), and it set up a configuration file, which specifies what libraries and handlers Solr will use. A core can be deleted with:

bin/solr delete -c test

OK, let's add some documents. First download this XML file of tweets provided on the Solr in Action GitHub. Take a look inside the XML file. Notice how it's using an <add> tag to tell Solr to add several documents (denoted with <doc> tags) to the index. To actually index the tweets, we run:

bin/post -c test /path/to/tweets.xml

Now, if we go to http://localhost:8983/solr/ (you might have to refresh) and click on the "Core Selector" dropdown on the left hand side, we can select the test core. If we then click on the "Query" tab, the query interface will appear. If we click on the blue "Execute Query" button at the bottom, a JSON document containing information regarding the tweets we just indexed will be displayed. Congratulations, you just ran your first successful query! Specifically, you used the /select RequestHandler to execute the query *:*. The *:* is a special syntax that tells Solr to return everything. The Solr query syntax is not very intuitive, in my opinion, so it's something you'll just have to get used to.

Defining Features

OK, now that we have a basic Solr instance up and running, let's define some features for our LTR system. Like all machine learning problems, effective feature engineering is critical to success. Standard features in modern LTR models include using multiple similarity measures (e.g., cosine similarity of tf-idf vectors or BM25) to compare multiple text fields (e.g., body, title), in addition to other text characteristics (e.g., length) and document characteristics (e.g., age, PageRank). A good starting point is this list of features put together by Microsoft Research for an academic data set. A list of some other commonly used features can be found on slide 32 of these lecture notes.

To start off, we're going to modify /path/to/solr-<version>/solr/server/solr/test/conf/managed-schema so that it includes the text fields that we'll need for our model. First, we'll change the text field so that it is of the text_general type (which is already defined inside managed-schema). The text_general type will allow us to calculate BM25 similarities. Because the text field already exists (it was automatically created when we indexed the tweets), we need to use the replace-field command like so:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field" : {
     "name":"text",
     "type":"text_general",
     "indexed":"true",
     "stored":"true",
     "multiValued":"true"}
}' http://localhost:8983/solr/test/schema

I encourage you to take a look inside managed-schema following each change so that you can get a sense for what's happening. Next, we're going to specify a text_tfidf type, which will allow us to calculate tf-idf cosine similarities:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field-type" : {
     "name":"text_tfidf",
     "class":"solr.TextField",
     "positionIncrementGap":"100",
     "indexAnalyzer":{
        "tokenizer":{
           "class":"solr.StandardTokenizerFactory"},
        "filter":{
           "class":"solr.StopFilterFactory",
           "ignoreCase":"true",
           "words":"stopwords.txt"},
        "filter":{
           "class":"solr.LowerCaseFilterFactory"}},
     "queryAnalyzer":{
        "tokenizer":{
           "class":"solr.StandardTokenizerFactory"},
        "filter":{
           "class":"solr.StopFilterFactory",
           "ignoreCase":"true",
           "words":"stopwords.txt"},
        "filter":{
           "class":"solr.SynonymGraphFilterFactory",
           "ignoreCase":"true",
           "synonyms":"synonyms.txt"},
        "filter":{
           "class":"solr.LowerCaseFilterFactory"}},
     "similarity":{
           "class":"solr.ClassicSimilarityFactory"}}
}' http://localhost:8983/solr/test/schema

Let's now add a text_tfidf field that will be of the text_tfidf type we just defined:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field" : {
     "name":"text_tfidf",
     "type":"text_tfidf",
     "indexed":"true",
     "stored":"false",
     "multiValued":"true"}
}' http://localhost:8983/solr/test/schema

Because the contents of the text field and the text_tfidf field are the same (we're just handling them differently), we will tell Solr to copy the contents from text to text_tfidf:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-copy-field" : {
     "source":"text",
     "dest":"text_tfidf"}
}' http://localhost:8983/solr/test/schema

We're now ready to re-index our data:

bin/post -c test /path/to/tweets.xml

Learning to Rank

Now that our documents are properly indexed, let's build a LTR model. If you're new to LTR, I recommend checking out this (long) paper by Tie-Yan Liu and this textbook also by Liu. If you're familiar with machine learning, the ideas shouldn't be too difficult to grasp. I also recommend checking out the Solr documentation on LTR, which I'll be linking to throughout this section. Enabling LTR in Solr first requires making some changes to /path/to/solr-<version>/solr/server/solr/test/conf/solrconfig.xml. Copy and paste the below text anywhere between the <config> and </config> tags (at the top and bottom of the file, respectively).

<lib dir="${solr.install.dir:../../../..}/contrib/ltr/lib/" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-ltr-\d.*\.jar" />

<queryParser name="ltr" class="org.apache.solr.ltr.search.LTRQParserPlugin"/>

<cache name="QUERY_DOC_FV"
       class="solr.search.LRUCache"
       size="4096"
       initialSize="2048"
       autowarmCount="4096"
       regenerator="solr.search.NoOpRegenerator" />

<transformer name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory">
  <str name="fvCacheName">QUERY_DOC_FV</str>
</transformer>

We're now ready to run Solr with LTR enabled. First, stop Solr:

bin/solr stop

and then restart it with the LTR plugin enabled:

bin/solr start -Dsolr.ltr.enabled=true

Next, we need to push the model features and the model specification to Solr. In Solr, LTR features are defined using a JSON formatted file. For our model, we'll save the following features in my_efi_features.json:

[
  {
    "store" : "my_efi_feature_store",
    "name" : "tfidf_sim_a",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text_tfidf}${text_a}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "tfidf_sim_b",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text_tfidf}${text_b}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "bm25_sim_a",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text}${text_a}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "bm25_sim_b",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text}${text_b}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "max_sim",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf='text text_tfidf'}${text}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "original_score",
    "class" : "org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params" : {}
  }
]

store tells Solr where to store the feature, name is the name of the feature, class specifies which Java class will handle the feature, and params provides additional information about the feature required by its Java class. In the case of a SolrFeature, you need to provide the query. {!dismax qf=text_tfidf}${text_a} tells Solr to search the text_tfidf field with the contents of text_a using the DisMaxQParser. The reason we're using the DisMax parser instead of the seemingly more obvious FieldQParser (e.g., {!field f=text_tfidf}${text_a}) is because the FieldQParser automatically converts multi-term queries to "phrases" (i.e., it converts something like "the cat in the hat" into, effectively, "the_cat_in_the_hat", rather than "the", "cat", "in", "the", "hat"). This FieldQParser behavior (which seems like a rather strange default to me) ended up giving me quite a headache, but I eventually found a solution with DisMaxQParser.

{!dismax qf='text text_tfidf'}${text} tells Solr to search both the text and text_tfidf fields with the contents of text and then take the max of those two scores. While this feature doesn't really make sense in this context because we're already using similarities from both fields as features, it demonstrates how such a feature could be implemented. For example, imagine that the documents in your corpus are linked to, at most, five other sources of text data. It might make sense to incorporate that information during a search, and taking the max over multiple similarity scores is one way of doing that.

Finally, OriginalScoreFeature "returns the original score that the document had before performing the reranking". This feature is necessary for returning the results in their original ranking when extracting features (note: OriginalScoreFeature is broken on Solr versions prior to 7.1).

To push the features to Solr, we run the following command:

curl -XPUT 'http://localhost:8983/solr/test/schema/feature-store' --data-binary "@/path/to/my_efi_features.json" -H 'Content-type:application/json'

If you ever want to upload new features, you have to first delete the old features with:

curl -XDELETE 'http://localhost:8983/solr/test/schema/feature-store/my_efi_feature_store'

Next, we'll save the following model specification in my_efi_model.json:

{
  "store" : "my_efi_feature_store",
  "name" : "my_efi_model",
  "class" : "org.apache.solr.ltr.model.LinearModel",
  "features" : [
    { "name" : "tfidf_sim_a" },
    { "name" : "tfidf_sim_b" },
    { "name" : "bm25_sim_a" },
    { "name" : "bm25_sim_b" },
    { "name" : "max_sim" },
    { "name" : "original_score" }
  ],
  "params" : {
    "weights" : {
      "tfidf_sim_a" : 0.0,
      "tfidf_sim_b" : 0.0,
      "bm25_sim_a" : 0.0,
      "bm25_sim_b" : 0.0,
      "max_sim" : 0.0,
      "original_score" : 1.0
    }
  }
}

store specifies where the features the model is using are stored, name is the name of the model, class specifies which Java class will implement the model, features is a list of the model features, and params provides additional information required by the model's Java class. To start off with, we'll use the LinearModel, which simply takes a weighted sum of the feature values to generate a score. Here, we assign a weight of 0.0 to each feature except original_score, which is assigned a weight of 1.0. This weighting scheme will ensure the results are returned in their original order. To find better weights, we'll need to extract training data from Solr. I'll go over this topic in more depth in the RankNet section.

We can push the model to Solr with:

curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_efi_model.json" -H 'Content-type:application/json'

And now we're ready to run our first LTR query:

http://localhost:8983/solr/test/query?q=historic north&df=text&rq={!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}&fl=id,score,[features]

You should see something like:

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"historic north",
      "df":"text",
      "fl":"id,score,[features]",
      "rq":"{!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}"}},
  "response":{"numFound":1,"start":0,"maxScore":1.8617721,"docs":[
      {
        "id":"1",
        "score":1.8617721,
        "[features]":"tfidf_sim_a=0.35304558,tfidf_sim_b=0.0,bm25_sim_a=0.93088603,bm25_sim_b=0.93088603,max_sim=1.8617721,original_score=1.8617721"}]
  }}

Referring back to the request, q=historic north is the query used to fetch the initial results (using BM25 in this case), which are then re-ranked with the LTR model. df=text specifies the default field for Solr to search. rq is where all of the LTR parameters are provided. efi stands for "external feature information", which allows you to specify additional inputs at query time. In this case, we're populating the text_a argument with the term historic, the text_b argument with the term north, and the text argument with the multi-term query 'historic north' (note, this is not being treated as a "phrase"). fl=id,score,[features] tells Solr to include the id, score, and model features in the results. You can verify that the feature values are correct by performing the associated search in the "Query" interface of the Solr Admin UI. For example, typing text_tfidf:historic in the q text box and typing score in the fl text box and then clicking the "Execute Query" button should return a value of 0.35304558.

RankNet

For LTR systems, linear models are generally trained using what's called a "pointwise" approach, which is where documents are considered individually (i.e., the model asks, "Is this document relevant to the query or not?"); however, pointwise approaches are generally not well-suited for LTR problems. RankNet is a neural network that uses a "pairwise" approach, which is where documents with a known relative preference are considered in pairs (i.e., the model asks, "Is document A more relevant than document B for the query or not?"). RankNet is available in Solr as of version 7.3 (you can verify your version of Solr includes RankNet by inspecting /path/to/solr-<version>/solr/dist/solr-ltr-{version}-SNAPSHOT.jar and looking for NeuralNetworkModel.class under /org/apache/solr/ltr/model/). I've also implemented RankNet in Keras for model training. It's worth noting that LamdaMART might be more appropriate for your particular search application. However, RankNet can be trained quickly on a GPU using Keras, which makes it a good solution for search problems where only one document is relevant to any given query. For a nice (technical) overview of RankNet, LambdaRank, and LambdaMART, see this paper by Chris Burges from (at the time) Microsoft Research.

Unfortunately, the suggested method of feature extraction in Solr is painfully slow (other Solr users seem to agree it could be faster). Even when making the requests in parallel, it took me almost three days to extract features for ~200,000 queries. I think a better approach might be to do something like this, where you index the queries and then calculate the similarities between the "documents" (which consist of the true documents and queries), but this is really something that should be baked into Solr. Anyway, here is some example Python code for extracting features from Solr using queries:

import numpy as np
import requests
import simplejson

# Number of documents to be re-ranked.
RERANK = 50
with open("RERANK.int", "w") as f:
    f.write(str(RERANK))

# Build query URL.
q_id = row["id"]
text_a = row["text_a"].strip().lower()
text_b = row["text_b"].strip().lower()
text = " ".join([text_a, text_b])

url = "http://localhost:8983/solr/test/query"
# We only re-rank one document when extracting features because we want to be
# able to compare the LTR model to the BM25 ranking. Setting reRankDocs=1
# ensures the original ranking is maintained.
url += "?q={0}&df=text&rq={{!ltr model=my_efi_model ".format(text)
url += "efi.text_a='{0}' efi.text_b='{1}' efi.text='{2}'}}".format(text_a, text_b, text)
url += "&fl=id,score,[features]&rows={1}".format(text, RERANK)

# Get response and check for errors.
response = requests.request("GET", url)
try:
    json = simplejson.loads(response.text)
except simplejson.JSONDecodeError:
    print(q_id)

if "error" in json:
    print(q_id)

# Extract the features.
results_features = []
results_targets = []
results_ranks = []
add_data = False

for (rank, document) in enumerate(json["response"]["docs"]):

    features = document["[features]"].split(",")
    feature_array = []
    for feature in features:
        feature_array.append(feature.split("=")[1])

    feature_array = np.array(feature_array, dtype = "float32")
    results_features.append(feature_array)

    doc_id = document["id"]
    # Check if document is relevant to query.
    if q_id in relevant.get(doc_id, {}):
        results_ranks.append(rank + 1)
        results_targets.append(1)
        add_data = True
    else:
        results_targets.append(0)

if add_data:
    np.save("{0}_X.npy".format(q_id), np.array(results_features))
    np.save("{0}_y.npy".format(q_id), np.array(results_targets))
    np.save("{0}_rank.npy".format(q_id), np.array(results_ranks))

We're now ready to train some models. To start off with, we'll pull in the data and evaluate the BM25 rankings on the entire data set.

import glob
import numpy as np

rank_files = glob.glob("*_rank.npy")
suffix_len = len("_rank.npy")

RERANK = int(open("RERANK.int").read())

ranks = []
casenumbers = []
Xs = []
ys = []
for rank_file in rank_files:
    X = np.load(rank_file[:-suffix_len] + "_X.npy")
    casenumbers.append(rank_file[:suffix_len])
    if X.shape[0] != RERANK:
        print(rank_file[:-suffix_len])
        continue

    rank = np.load(rank_file)[0]
    ranks.append(rank)
    y = np.load(rank_file[:-suffix_len] + "_y.npy")
    Xs.append(X)
    ys.append(y)

ranks = np.array(ranks)
total_queries = len(ranks)
print("Total Queries: {0}".format(total_docs))
print("Top 1: {0}".format((ranks == 1).sum() / total_queries))
print("Top 3: {0}".format((ranks <= 3).sum() / total_queries))
print("Top 5: {0}".format((ranks <= 5).sum() / total_queries))
print("Top 10: {0}".format((ranks <= 10).sum() / total_queries))

Next, we'll build and evaluate a (pointwise) linear support vector machine.

from scipy.stats import rankdata
from sklearn.svm import LinearSVC

X = np.concatenate(Xs, 0)
y = np.concatenate(ys)

train_per = 0.8
train_cutoff = int(train_per * len(ranks)) * RERANK
train_X = X[:train_cutoff]
train_y = y[:train_cutoff]
test_X = X[train_cutoff:]
test_y = y[train_cutoff:]

model = LinearSVC()
model.fit(train_X, train_y)
preds = model._predict_proba_lr(test_X)

n_test = int(len(test_y) / RERANK)
new_ranks = []
for i in range(n_test):
    start = i * RERANK
    end = start + RERANK
    scores = preds[start:end, 1]
    score_ranks = rankdata(-scores)
    old_rank = np.argmax(test_y[start:end])
    new_rank = score_ranks[old_rank]
    new_ranks.append(new_rank)

new_ranks = np.array(new_ranks)
print("Total Queries: {0}".format(n_test))
print("Top 1: {0}".format((new_ranks == 1).sum() / n_test))
print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test))
print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test))
print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test))

Now we can try out RankNet. First we'll assemble the training data so that each row consists of a relevant document vector concatenated with an irrelevant document vector (for a given query). Because we returned 50 rows in the feature extraction phase, each query will have 49 document pairs in the data set.

Xs = []
for rank_file in rank_files:
    X = np.load(rank_file[:-suffix_len] + "_X.npy")
    if X.shape[0] != RERANK:
        print(rank_file[:-suffix_len])
        continue

    rank = np.load(rank_file)[0]
    pos_example = X[rank - 1]
    for (i, neg_example) in enumerate(X):
        if i == rank - 1:
            continue
        Xs.append(np.concatenate((pos_example, neg_example)))

X = np.stack(Xs)
dim = int(X.shape[1] / 2)

train_per = 0.8
train_cutoff = int(train_per * len(ranks)) * (RERANK - 1)

train_X = X[:train_cutoff]
test_X = X[train_cutoff:]

Here, we build the model in Keras.

from keras import backend
from keras.callbacks import ModelCheckpoint
from keras.layers import Activation, Add, Dense, Input, Lambda
from keras.models import Model

y = np.ones((train_X.shape[0], 1))

INPUT_DIM = 5
h_1_dim = 64
h_2_dim = h_1_dim // 2
h_3_dim = h_2_dim // 2

# Model.
h_1 = Dense(h_1_dim, activation = "relu")
h_2 = Dense(h_2_dim, activation = "relu")
h_3 = Dense(h_3_dim, activation = "relu")
s = Dense(1)

# Relevant document score.
rel_doc = Input(shape = (INPUT_DIM, ), dtype = "float32")
h_1_rel = h_1(rel_doc)
h_2_rel = h_2(h_1_rel)
h_3_rel = h_3(h_2_rel)
rel_score = s(h_3_rel)

# Irrelevant document score.
irr_doc = Input(shape = (INPUT_DIM, ), dtype = "float32")
h_1_irr = h_1(irr_doc)
h_2_irr = h_2(h_1_irr)
h_3_irr = h_3(h_2_irr)
irr_score = s(h_3_irr)

# Subtract scores.
negated_irr_score = Lambda(lambda x: -1 * x, output_shape = (1, ))(irr_score)
diff = Add()([rel_score, negated_irr_score])

# Pass difference through sigmoid function.
prob = Activation("sigmoid")(diff)

# Build model.
model = Model(inputs = [rel_doc, irr_doc], outputs = prob)
model.compile(optimizer = "adagrad", loss = "binary_crossentropy")

And now to train and test the model.

NUM_EPOCHS = 30
BATCH_SIZE = 32
checkpointer = ModelCheckpoint(filepath = "valid_params.h5", verbose = 1, save_best_only = True)
history = model.fit([train_X[:, :dim], train_X[:, dim:]], y,
                     epochs = NUM_EPOCHS, batch_size = BATCH_SIZE, validation_split = 0.05,
                     callbacks = [checkpointer], verbose = 2)

model.load_weights("valid_params.h5")
get_score = backend.function([rel_doc], [rel_score])
n_test = int(test_X.shape[0] / (RERANK - 1))
new_ranks = []
for i in range(n_test):
    start = i * (RERANK - 1)
    end = start + (RERANK - 1)
    pos_score = get_score([test_X[start, :dim].reshape(1, dim)])[0]
    neg_scores = get_score([test_X[start:end, dim:]])[0]

    scores = np.concatenate((pos_score, neg_scores))
    score_ranks = rankdata(-scores)
    new_rank = score_ranks[0]
    new_ranks.append(new_rank)

new_ranks = np.array(new_ranks)
print("Total Queries: {0}".format(n_test))
print("Top 1: {0}".format((new_ranks == 1).sum() / n_test))
print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test))
print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test))
print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test))

# Compare to BM25.
old_ranks = ranks[-n_test:]
print("Total Queries: {0}".format(n_test))
print("Top 1: {0}".format((old_ranks == 1).sum() / n_test))
print("Top 3: {0}".format((old_ranks <= 3).sum() / n_test))
print("Top 5: {0}".format((old_ranks <= 5).sum() / n_test))
print("Top 10: {0}".format((old_ranks <= 10).sum() / n_test))

If the model's results are satisfactory, we can save the parameters to a JSON file to be pushed to Solr:

import json

weights = model.get_weights()
solr_model = {"store" : "my_efi_feature_store",
              "name" : "my_ranknet_model",
              "class" : "org.apache.solr.ltr.model.NeuralNetworkModel",
              "features" : [
                { "name" : "tfidf_sim_a" },
                { "name" : "tfidf_sim_b" },
                { "name" : "bm25_sim_a" },
                { "name" : "bm25_sim_b" },
                { "name" : "max_sim" }
              ],
              "params": {}}
layers = []
layers.append({"matrix": weights[0].T.tolist(),
               "bias": weights[1].tolist(),
               "activation": "relu"})
layers.append({"matrix": weights[2].T.tolist(),
               "bias": weights[3].tolist(),
               "activation": "relu"})
layers.append({"matrix": weights[4].T.tolist(),
              "bias": weights[5].tolist(),
              "activation": "relu"})
layers.append({"matrix": weights[6].T.tolist(),
              "bias": weights[7].tolist(),
              "activation": "identity"})
solr_model["params"]["layers"] = layers

with open("my_ranknet_model.json", "w") as out:
    json.dump(solr_model, out, indent = 4)

and it's pushed the same as before:

curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_ranknet_model.json" -H 'Content-type:application/json'

We can also perform an LTR query like before, except this time we'll use ltr_model=my_ranknet_model.

http://localhost:8983/solr/test/query?q=historic north&df=text&rq={!ltr model=my_ranknet_model efi.text_a=historic efi.text_b=north efi.text='historic north'}&fl=id,score,[features]

And there you have it — a modern learning to rank setup in Apache Solr.

About

From Zero to Learning to Rank in Apache Solr

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published