Skip to content

Concepts

Mark Papadakis edited this page May 20, 2017 · 3 revisions

Please check Query Understanding blog posts; It covers lots of various aspects and tasks related to queries and IR.

Inverted Index

From On Inverted Index Compression for Search Engine Efficiency: For each unique indexed term, the inverted index contains a posting list, where each posting contains occurrences information (e.g frequencies, and positions) for documents that contain the term.
To rank the documents in response to a query the posting lists for the terms of the query must be travers ed, which can be costly, especially for long posting lists.

Different ordering of the postings in the posting lists change both the algorithm for ranked retrieval an d the underlying repr. of postings in the inversed index, such as how they are compressed. For instance, the postings for each term can be sorted in order of impact, allowing ranked retrieval to b e short-circuited onceenough documents have been retrieved. See for example where Twitter created a special codec where the posting list is ordered by ascending doc ument ID, because they want to consider the most recent tweets first and abort earily as soon as they col lect K of them.

However, search egines repeatedly use the traditional static docid ordering, where each posting list is ordered by ascending document id(monotonically increasing), which permits a reduced inverted index size and efficien t retrieval.

Segment

In Trinity, a segment is a self-contained index, with its own terms dictionary(or otherwise logic and data used for resolving terms), posting lists, and other data. They can be created in isolation at any time and can be used along with potentially more segments(as index sources) in query executions by the execution engine.

Index Source

An index source provides term resolution and posting list decoders to the execution engine. They may also provide a set of documents that have been updated or deleted when the index source has been created. The execution engine deals with index sources. Please see index_source.h comments.

Termspaces

There are no persistent, fixed term integer IDs in Trinity. When you are creating segments or other index sources, you are indexing terms. The utility class Trinity::SegmentIndexSession is using integer IDs internally to simplify processing of indexed content, but that (term=>id) relationship is not persisted in any way. When you are executing queries, each distinct term involved in the query is assigned a transifent execution-content specific integer ID, which is done in order to simplify the execution of the query and is not exposed to the application (however, see Trinity::query_term_instance comments).

Each segment or index is self-contained; it has its own terms dictionary (a term maps to a term index context, see Trinity::term_index_ctx description).

Clone this wiki locally