Skip to content

vespa-cloud/vespa-documentation-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Vespa

Vespa Documentation Search

Vespa Documentation Search is a Vespa Cloud instance for searching documents in:

This sample app is auto-deployed to Vespa Cloud, see deploy-vespa-documentation-search.yaml

Vespa-Documentation-Search-Architecture

Deployment status:

  • Deploy vespa-documentation-search to Vespa Cloud
  • Vespa Cloud Documentation Search Deployment

Query API

Open API endpoints:

Example requests:

$ curl "https://doc-search.vespa.oath.cloud/document/v1/open/doc/docid/open%2Fen%2Freference%2Fquery-api-reference.html"
$ curl --data-urlencode 'yql=select * from doc where userInput(@userinput)' \
  --data-urlencode 'userinput=vespa ranking is great' \
  https://doc-search.vespa.oath.cloud/search/

Using these endpoints is a good way to get started with Vespa - see the github deploy action (use vespa:deploy to deploy to a dev instance or the quick-start) to deploy using Docker.

Refer to getting-started-ranking for example use of the Query API.

Feed your own instance

It is easy to set up your own instance on Vespa Cloud and feed documents from vespa-engine/documentation:

1: Generate the open_index.json feed file: cd vespa-engine/documentation && bundle exec jekyll build -p _plugins-vespafeed. Refer to the vespa_index_generator.rb for how the feed file is generated.

2: Add data plane credentials:

$ pwd; ll *.pem
/Users/myuser/github/vespa-engine/documentation
-rwxr-xr-x@ 1 myuser  staff  3272 Mar 17 09:30 data-plane-private-key.pem
-rwxr-xr-x@ 1 myuser  staff  1696 Mar 17 09:30 data-plane-public-key.pem

3: Set endpoint in _config.yml (get this from the Vespa Cloud Console):

diff --git a/_config.yml b/_config.yml
...
     feed_endpoints:
-        - url: https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud/
-          indexes:
-              - open_index.json
-        - url: https://vespacloud-docsearch.vespa-team.aws-ap-northeast-1a.z.vespa-app.cloud/
+        - url: https://myinstance.vespacloud-docsearch.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud/
           indexes:

Feed open_index.json:

$ ./feed_to_vespa.py

Ranking

The ranking is quite simplistic, and an introduction to using query rank features and summary features:

    rank-profile documentation inherits default {
        inputs {
            query(titleWeight) double: 2.0
            query(headersWeight) double: 1.0
            query(contentWeight) double: 1.0
            query(keywordsWeight) double: 10.0
            query(pathWeight) double: 1.0
        }
        first-phase {
            expression {
                query(titleWeight) * bm25(title) +
                query(contentWeight) * bm25(content) +
                query(headersWeight) * bm25(headers) +
                query(pathWeight) * bm25(path) +
                query(keywordsWeight) * bm25(keywords)
            }
        }
        summary-features {
            query(titleWeight)
            query(contentWeight)
            query(headersWeight)
            query(pathWeight)
            fieldMatch(title)
            fieldMatch(content)
            fieldMatch(content).matches
            fieldLength(title)
            fieldLength(content)
            bm25(title)
            bm25(content)
            bm25(headers)
            bm25(path)
            bm25(keywords)
        }
    }

With this it is easy to experiment with ranking by sending rank-properties in the query and observing the values in summary-features, like:

doc-search.vespa.oath.cloud/search/?yql=select * from doc where userInput(@userinput)&ranking=documentation&input.query(pathWeight)=10&userinput=vespa ranking is great

See approximate-nn-hnsw.md for use of (comma separated) keywords set in the frontmatter to rank higher for those, e.g.

---
title: "Approximate Nearest Neighbor Search using HNSW Index"
keywords: "ann, approximate nearest neighbor"
---

Document feed automation

Vespa Documentation is stored in GitHub:

Jekyll is used to serve the documentation, it rebuilds at each commit.

A change also triggers GitHub Actions. The Build step in the workflow uses the Jekyll Generator plugin to build a JSON feed, used in the Feed step:

Vespa Cloud secures endpoints using mTLS. Secrets can be stored in GitHub Settings for a repository. Here, the private key secret is accessed in the GitHub Actions workflow that feeds to Vespa Cloud: feed.yml

Document processing

The documents are split into paragraphs for multi-vector ranking, see example in feed-split.py.

Query integration

Query results are open to the internet. To access Vespa Documentation Search, an AWS Lambda function is used to get the private key secret from AWS Parameter Store, then add it to the https request to Vespa Cloud:

The lambda needs AmazonSSMReadOnlyAccess added to its Role to access the Parameter Store.

Note JSON-P being used (jsoncallback=) - this simplifies the search result page: search.html.

Vespa Cloud Development and Deployments

This is a Vespa Cloud application and has hence implemented automated deployments.

The feed can contain an array of links from each document. The OutLinksDocumentProcessor is custom java code that add an in-link in each target document using the Vespa Document API.

To test this functionality, the VespaDocSystemTest runs for each deployment.

Creating a System Test is also a great way to develop a Vespa application:

  • Use this application as a starting point
  • Create a Vespa Cloud tenant (i.e. account), and set tenant in pom.xml
  • Deploy the application to Vespa Cloud
  • Run the System Test from maven or IDE using the Endpoint

Feed grouping examples

cat << EOF | vespa feed -t https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud -
{"fields": {"customer": "Smith","date": 1157526000,"item": "Intake valve","price": "1000","tax": "0.24"},"put": "id:purchase:purchase::0"}
{"fields": {"customer": "Smith","date": 1157616000,"item": "Rocker arm","price": "1000","tax": "0.12"},"put": "id:purchase:purchase::1"}
{"fields": {"customer": "Smith","date": 1157619600,"item": "Spring","price": "2000","tax": "0.24"},"put": "id:purchase:purchase::2"}
{"fields": {"customer": "Jones","date": 1157709600,"item": "Valve cover","price": "3000","tax": "0.12"},"put": "id:purchase:purchase::3"}
{"fields": {"customer": "Jones","date": 1157702400,"item": "Intake port","price": "5000","tax": "0.24"},"put": "id:purchase:purchase::4"}
{"fields": {"customer": "Brown","date": 1157706000,"item": "Head","price": "8000","tax": "0.12"},"put": "id:purchase:purchase::5"}
{"fields": {"customer": "Smith","date": 1157796000,"item": "Coolant","price": "1300","tax": "0.24"},"put": "id:purchase:purchase::6"}
{"fields": {"customer": "Jones","date": 1157788800,"item": "Engine block","price": "2100","tax": "0.12"},"put": "id:purchase:purchase::7"}
{"fields": {"customer": "Brown","date": 1157792400,"item": "Oil pan","price": "3400","tax": "0.24"},"put": "id:purchase:purchase::8"}
{"fields": {"customer": "Smith","date": 1157796000,"item": "Oil sump","price": "5500","tax": "0.12"},"put": "id:purchase:purchase::9"}
{"fields": {"customer": "Jones","date": 1157875200,"item": "Camshaft","price": "8900","tax": "0.24"},"put": "id:purchase:purchase::10"}
{"fields": {"customer": "Brown","date": 1157878800,"item": "Exhaust valve","price": "1440","tax": "0.12"},"put": "id:purchase:purchase::11"}
{"fields": {"customer": "Brown","date": 1157882400,"item": "Rocker arm","price": "2330","tax": "0.24"},"put": "id:purchase:purchase::12"}
{"fields": {"customer": "Brown","date": 1157875200,"item": "Spring","price": "3770","tax": "0.12"},"put": "id:purchase:purchase::13"}
{"fields": {"customer": "Smith","date": 1157878800,"item": "Spark plug","price": "6100","tax": "0.24"},"put": "id:purchase:purchase::14"}
{"fields": {"customer": "Jones","date": 1157968800,"item": "Exhaust port","price": "9870","tax": "0.12"},"put": "id:purchase:purchase::15"}
{"fields": {"customer": "Brown","date": 1157961600,"item": "Piston","price": "1597","tax": "0.24"},"put": "id:purchase:purchase::16"}
{"fields": {"customer": "Smith","date": 1157965200,"item": "Connection rod","price": "2584","tax": "0.12"},"put": "id:purchase:purchase::17"}
{"fields": {"customer": "Jones","date": 1157968800,"item": "Rod bearing","price": "4181","tax": "0.24"},"put": "id:purchase:purchase::18"}
{"fields": {"customer": "Jones","date": 1157972400,"item": "Crankshaft","price": "6765","tax": "0.12"},"put": "id:purchase:purchase::19"}
EOF

Simplified node.js Lambda code

'use strict';
const https = require('https')
const AWS = require('aws-sdk')

const publicCert = `-----BEGIN CERTIFICATE-----
MIIFbDCCA1QCCQCTyf46/BIdpDANBgkqhkiG9w0BAQsFADB4MQswCQYDVQQGEwJO
...
NxoOxvYcP8Pnxn8UGILy7sKl3VRQWIMrlOfXK4DEg8EGqeQzlFVScfSdbH0i6gQz
-----END CERTIFICATE-----`;

exports.handler = async (event, context) => {
    console.log('Received event:', JSON.stringify(event, null, 4));
    const query = event.queryStringParameters.query ? event.queryStringParameters.query : '';
    const jsoncallback = event.queryStringParameters.jsoncallback;
    const path = encodeURI(`/search/?jsoncallback=${jsoncallback}&query=${query}&hits=${hits}&ranking=${ranking}`);

    const ssm = new AWS.SSM();
    const privateKeyParam = await new Promise((resolve, reject) => {
        ssm.getParameter({
            Name: 'ThePrivateKey',
            WithDecryption: true
        }, (err, data) => {
            if (err) { return reject(err); }
            return resolve(data);
        });
    });

    var options = {
        hostname: 'vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud',
        port: 443,
        path: path,
        method: 'GET',
        headers: { 'accept': 'application/json' },
        key: privateKeyParam.Parameter.Value,
        cert: publicCert
    }

    var body = '';
    const response = await new Promise((resolve, reject) => {
        const req = https.get(
            options,
            res => {
                res.setEncoding('utf8');
                res.on('data', (chunk) => {body += chunk})
                res.on('end', () => {
                    resolve({
                        statusCode: 200,
                        body: body
                    });
                });
            });
        req.on('error', (e) => {
          reject({
              statusCode: 500,
              body: 'Something went wrong!'
          });
        });
    });
    return response
};