Tutorial: Indexing Wikipedia with Tantivy CLI

tantivy-cli is the project hosting the command line interface for tantivy, a search engine project.

Tutorial: Indexing Wikipedia with Tantivy CLI

Introduction

In this tutorial, we will create a brand new index with the articles of English wikipedia in it.

Installing the tantivy CLI.

There are a couple ways to install tantivy-cli.

If you are a Rust programmer, you probably have cargo installed and you can just run cargo install tantivy-cli

Creating the index: `new`

Let's create a directory in which your index will be stored.

    # create the directory
    mkdir wikipedia-index

We will now initialize the index and create its schema. The schema defines the list of your fields, and for each field:

its name
its type, currently u64, i64 or str
how it should be indexed.

You can find more information about the latter on tantivy's schema documentation page

In our case, our documents will contain

a title
a body
a url

We want the title and the body to be tokenized and indexed. We also want to add the term frequency and term positions to our index.

Running tantivy new will start a wizard that will help you define the schema of the new index.

Like all the other commands of tantivy, you will have to pass it your index directory via the -i or --index parameter as follows:

    tantivy new -i wikipedia-index

Answer the questions as follows:


    Creating new index 
    Let's define its schema! 



    New field name  ? title
    Choose Field Type (Text/u64/i64/f64/Date/Facet/Bytes) ? Text
    Should the field be stored (Y/N) ? Y
    Should the field be indexed (Y/N) ? Y
    Should the term be tokenized? (Y/N) ? Y
    Should the term frequencies (per doc) be in the index (Y/N) ? Y
    Should the term positions (per doc) be in the index (Y/N) ? Y
    Add another field (Y/N) ? Y
    
    
    
    New field name  ? body
    Choose Field Type (Text/u64/i64/f64/Date/Facet/Bytes) ? Text
    Should the field be stored (Y/N) ? Y
    Should the field be indexed (Y/N) ? Y
    Should the term be tokenized? (Y/N) ? Y
    Should the term frequencies (per doc) be in the index (Y/N) ? Y
    Should the term positions (per doc) be in the index (Y/N) ? Y
    Add another field (Y/N) ? Y
    
    
    
    New field name  ? url
    Choose Field Type (Text/u64/i64/f64/Date/Facet/Bytes) ? Text
    Should the field be stored (Y/N) ? Y
    Should the field be indexed (Y/N) ? N
    Add another field (Y/N) ? N


    [
    {
        "name": "title",
        "type": "text",
        "options": {
            "indexing": "position",
            "stored": true
        }
    },
    {
        "name": "body",
        "type": "text",
        "options": {
            "indexing": "position",
            "stored": true
        }
    },
    {
        "name": "url",
        "type": "text",
        "options": {
            "indexing": "unindexed",
            "stored": true
        }
    }
    ]

After the wizard has finished, a meta.json should exist in wikipedia-index/meta.json. It is a fairly human readable JSON, so you can check its content.

It contains two sections:

segments (currently empty, but we will change that soon)
schema

Indexing the document: `index`

Tantivy's index command offers a way to index a json file. The file must contain one JSON object per line. The structure of this JSON object must match that of our schema definition.

    {"body": "some text", "title": "some title", "url": "http://somedomain.com"}

For this tutorial, you can download a corpus with the 5 million+ English Wikipedia articles in the right format here: wiki-articles.json (2.34 GB). Make sure to decompress the file. Also, you can avoid this if you have bzcat installed so that you can read it compressed.

    bunzip2 wiki-articles.json.bz2

If you are in a rush you can download 100 articles in the right format here (11 MB).

The index command will index your document. By default it will use as 3 thread, each with a buffer size of 1GB split a across these threads.

    cat wiki-articles.json | tantivy index -i ./wikipedia-index

You can change the number of threads by passing it the -t parameter, and the total buffer size used by the threads heap by using the -m. Note that tantivy's memory usage is greater than just this buffer size parameter.

On my computer (8 core Xeon(R) CPU X3450 @ 2.67GHz), on 8 threads, indexing wikipedia takes around 9 minutes.

While tantivy is indexing, you can peek at the index directory to check what is happening.

    ls ./wikipedia-index

The main file is meta.json.

You should also see a lot of files with a UUID as filename, and different extensions. Our index is in fact divided in segments. Each segment acts as an individual smaller index. Its name is simply a uuid.

If you decided to index the complete wikipedia, you may also see some of these files disappear. Having too many segments can hurt search performance, so tantivy actually automatically starts merging segments.

Serve the search index: `serve`

Tantivy's cli also embeds a search server. You can run it with the following command.

    tantivy serve -i wikipedia-index

By default, it will serve on port 3000.

You can search for the top 20 most relevant documents for the query Barack Obama by accessing the following url in your browser

http://localhost:3000/api/?q=barack+obama&nhits=20

By default this query is treated as barack OR obama. You can also search for documents that contains both term, by adding a + sign before the terms in your query.

http://localhost:3000/api/?q=%2Bbarack%20%2Bobama&nhits=20

Also, - makes it possible to remove documents the documents containing a specific term.

http://localhost:3000/api/?q=-barack%20%2Bobama&nhits=20

Finally tantivy handle phrase queries.

http://localhost:3000/api/?q=%22barack%20obama%22&nhits=20

Search the index via the command line

You may also use the search command to stream all documents matching a specific query. The documents are returned in an unspecified order.

    tantivy search -i wikipedia-index -q "barack obama"

Benchmark the index: `bench`

Tantivy's cli provides a simple benchmark tool. You can run it with the following command.

    tantivy bench -i wikipedia-index -n 10 -q queries.txt

port server: `port`

You may use the port command to run a server process that is fully controlled by the stdin/stdout, to be used as a full text search component in a Erlang/Elixir application. Please also see the Elixir part. Please note the command is meant to be run from within a Erlang/Elixir application:

	tantivy port -i wikipedia-index

The schema must be defined previously via tantivy new -i .... The wire protocol use a split command/completion style, so multiple requests can be executed in parallel and completed out of order. Right now, all writes are executed in order in a dedicated thread, to ensure data integrity and to enable batch commiting. All search requests are executed in parallel via Tokio

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.cargo		.cargo
ci		ci
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS		AUTHORS
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
appveyor.yml		appveyor.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorial: Indexing Wikipedia with Tantivy CLI

Introduction

Installing the tantivy CLI.

Creating the index: `new`

Indexing the document: `index`

Serve the search index: `serve`

Search the index via the command line

Benchmark the index: `bench`

port server: `port`

About

Releases

Packages

Languages

License

derek-zhou/tantivy-cli

Folders and files

Latest commit

History

Repository files navigation

Tutorial: Indexing Wikipedia with Tantivy CLI

Introduction

Installing the tantivy CLI.

Creating the index: new

Indexing the document: index

Serve the search index: serve

Search the index via the command line

Benchmark the index: bench

port server: port

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Creating the index: `new`

Indexing the document: `index`

Serve the search index: `serve`

Benchmark the index: `bench`

port server: `port`

Packages