After reading Improving the performance of full-text search on the Dropbox Blog (which is an excellent insight into the company) I was inspired to try building my own full-text search engine.
To build on what I learnt in last year's object oriented module, I'm going to implement this in Java 8.
- Implement an inverted index algorithm
- Measure performance
- Reiterate on original algorithm
Mapper
The Mapper
class takes a path to a file and uses a Tokenizer
instance to generate a map, with a K<Token>, V<Posting>
structure.
class Main {
public static void main(String[] args) {
Mapper mapper = new Mapper("./data/example/example_1.txt");
mapper._printMap();
}
}
This code would output the structure of the generated map.
Term Attributes
'fish' (fileId: 0, frequency: 2)
'one' (fileId: 0, frequency: 1)
'two' (fileId: 0, frequency: 1)
Grouper
The Grouper
class takes a list of Mapper
objects, and generates an inverted index, which is a map of terms to an array of (file ID
, token frequency
) pairs.
Note that terms in the map are sorted alphabetically.
class Main {
public static void main(String[] args) {
Mapper m1 = new Mapper("./data/example_1.txt");
Mapper m2 = new Mapper("./data/example_2.txt");
Grouper grouper = new Grouper(m1, m2);
grouper._printInvertedIndex();
}
}
This code would produce an inverted index with the following structure.
Term Postings
blue (fileId: 0, frequency: 1),
fish (fileId: 0, frequency: 2), (fileId: 1, frequency: 2),
one (fileId: 1, frequency: 1),
red (fileId: 0, frequency: 1),
two (fileId: 1, frequency: 1),
Reducer
The Reducer
class takes the inverted index output by the Grouper
and writes an encoded version of the postings lists to disk.
class Main {
public static void main(String[] args) {
Mapper m1 = new Mapper("./data/example_1.txt");
Mapper m2 = new Mapper("./data/example_2.txt");
Grouper grouper = new Grouper(m1, m2);
Reducer reducer = new Reducer(grouper.getInvertedIndex());
}
}
The Reducer constructor generates an _index0.txt
file.
0:1,
0:2,1:2,
1:1,
0:1,
1:1,
The encoding of this file is simple at the moment.
Lines are ordered alphabetically according to their corresponding token.
Each line contains information for files containing that token only.
The number before the :
corresponds to the file ID (which is assigned by the Mapper
class).
The second number corresponds to the number of times the token appears in that file.
The reducer is capable of distributing the index across multiple files to prevent the creation of a single monolithic file.
In this simple example, that limit is set arbitrarily at 100, meaning if the search space contains greater than 100 tokens then the index will be split across multiple _index<i>.txt
files.
Retriever
The Retriever
class builds a map of term to term location whilst terms are indexed.
This map can then be queried using the get()
method to return the encoded locations of the relevant files.
Retriever.get("fish");
This code returns
0:2,1:2
which indicates that the term "fish" occurs in the file with ID 0
twice and in the file with ID 1
twice.
This result can then be interpreted by the Query
class.