Whereas the simple Index
can just consume id-content pairs, the Document
-Index is able to process more complex data structures like JSON.
Technically, a Document
-Index is a layer on top of several default indexes. You can create multiple independent Document-Indexes in parallel, any of them can use the Worker
or Persistent
model optionally.
FlexSearch Documents also contain these features:
- Document Store including Enrichment
- Multi-Field-Search
- Multi-Tag-Search
- Resolver (Chain Complex Queries)
- Result Highlighting
- Export/Import
- Worker
- Persistent
Document options basically inherits from Index Options, so you can apply most of those options either in the top scope of the config (for all fields) or as per field or both of them.
Option | Values | Description | Default |
document |
Document Descriptor | Includes any specific information about how the document data should be indexed | (mandatory) |
worker |
Boolean String |
Enable a worker distributed model. Read more about here: Worker Index | false |
Document search options basically inherit from Index Search Options, so you can apply most of those options either in the top scope of the config (for all fields) or as per field or both of them.
Option | Values | Description | Default |
index field |
String Array<String> Array<SearchOptions> |
Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. | |
tag |
Object<field:tag> | Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. | |
enrich |
Boolean | Enrich IDs from the results with the corresponding documents. | false |
highlight |
Highlighting Options String |
Highlight query matches in the result (for Document Indexes only) | false |
merge |
Boolean | Merge multiple fields in resultset into one and group results per ID | false |
pluck |
String | Pick and apply search to just one field and return a flat result representation | false |
When creating a Document
-Index you will need to define a document descriptor in the field document
. This descriptor is including any specific information about how the document data should be indexed.
Option | Values | Description | Default |
id |
String | "id" |
|
index |
String Array<String> Array<FieldOptions> |
||
tag |
String Array<String> Array<FieldOptions> |
||
store |
Boolean String Array<String> Array<FieldOptions> |
false |
You can use all standard Index Options within field options.
Option | Values | Description | Default |
field |
String | The field name (colon seperated syntax) | (mandatory) |
filter |
Function | ||
custom |
Function |
Assuming our document has a simple data structure like this:
{
"id": 0,
"content": "some text"
}
An appropriate Document Descriptor has always to define at least 2 things:
- the property
id
describes the location of the document ID within a document item - the property
index
(ortag
) containing one or multiple fields from the document, which should be indexed for searching
// create a document index
const index = new Document({
document: {
id: "id",
index: "content"
}
});
// add documents to the index
index.add({
id: 0,
content: "some text"
});
As briefly explained above, the field id
describes where the ID or unique key lives inside your documents. When not passed it will always take the field id
from the top level scope of your data.
The property index
takes all fields you would like to have indexed. When just selecting one field, then you can pass a string.
The next example will add 2 fields title
and content
to the index:
var docs = [{
id: 0,
title: "Title A",
content: "Body A"
},{
id: 1,
title: "Title B",
content: "Body B"
}];
const index = new Document({
id: "id",
index: ["title", "content"]
});
Add both fields to the document descriptor and pass individual Index-Options for each field:
const index = new Document({
id: "id",
index: [{
field: "title",
tokenize: "forward",
encoder: Charset.LatinAdvanced,
resolution: 9
},{
field: "content",
tokenize: "forward",
encoder: Charset.LatinAdvanced,
resolution: 3
}]
});
Field options inherits from top level options when passed, e.g.:
const index = new Document({
tokenize: "forward",
encoder: Charset.LatinAdvanced,
resolution: 9,
document: {
id: "id",
index:[{
field: "title"
},{
field: "content",
resolution: 3
}]
}
});
Assigning the
Encoder
instance to the top level configuration will share the encoder to all fields. You should avoid this when contents of fields don't have the same type of content (e.g. one field contains terms, another contains numeric IDs).
Assume the document array looks more complex (has nested branches etc.), e.g.:
{
"record": {
"id": 0,
"title": "some title",
"content": {
"header": "some text",
"footer": "some text"
}
}
}
Then use the colon separated notation root:child:child
as a name for each field defining the hierarchy which corresponds to the document:
const index = new Document({
document: {
id: "record:id",
index: [
"record:title",
"record:content:header",
"record:content:footer"
]
}
});
Tip
Just add fields you want to query against. Do not add fields to the index, you just need in the result. For this purpose you can store documents independently of its index (read below).
To query against one or multiple specific fields you have to pass the exact key of the field you have defined in the document descriptor as a field name (with colon syntax):
index.search(query, {
field: [
"record:title",
"record:content:header",
"record:content:footer"
]
});
Same as:
index.search(query, [
"record:title",
"record:content:header",
"record:content:footer"
]);
Using field-specific options:
index.search("some query", [{
field: "record:title",
limit: 100,
suggest: true
},{
field: "record:content:header",
limit: 100,
suggest: false
}]);
You can also perform a search through the same field with different queries:
index.search([{
field: "record:title",
query: "some query",
limit: 100,
suggest: true
},{
field: "record:title",
query: "some other query",
limit: 100,
suggest: true
}]);
You need to follow 2 rules for your documents:
- The document cannot start with an Array at the root. This will introduce sequential data and isn't supported yet. See below for a workaround for such data.
[ // <-- not allowed as document start!
{
"id": 0,
"title": "title"
}
]
- The document ID can't be nested inside an Array. This will introduce sequential data and isn't supported yet. See below for a workaround for such data.
{
"records": [ // <-- not allowed when ID or tag lives inside!
{
"id": 0,
"title": "title"
}
]
}
Here an example for a supported complex document:
{
"meta": {
"tag": "cat",
"id": 0
},
"contents": [
{
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
},
{
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
}
]
}
The corresponding document descriptor (when all fields should be indexed) looks like:
const index = new Document({
document: {
id: "meta:id",
index: [
"contents:body:title",
"contents:body:footer"
],
tag: [
"meta:tag",
"contents:keywords"
]
}
});
Remember when searching you have to use the same colon-separated-string as a key from your field definition.
index.search(query, {
index: "contents:body:title"
});
This example breaks both rules described above:
[ // <-- not allowed as document start!
{
"tag": "cat",
"records": [ // <-- not allowed when ID or tag lives inside!
{
"id": 0,
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
},
{
"id": 1,
"body": {
"title": "some title",
"footer": "some text"
},
"keywords": ["some", "key", "words"]
}
]
}
]
You need to unroll your data within a simple loop before adding to the index.
A workaround to such a data structure from above could look like:
const index = new Document({
document: {
id: "id",
index: [
"body:title",
"body:footer"
],
tag: [
"tag",
"keywords"
]
}
});
function add(sequential_data){
for(let x = 0, item; x < sequential_data.length; x++){
item = sequential_data[x];
for(let y = 0, record; y < item.records.length; y++){
record = item.records[y];
// append tag to each record
record.tag = item.tag;
// add to index
index.add(record);
}
}
}
// now just use add() helper method as usual:
add([{
// sequential structured data
// take the data example above
}]);
Add a document to the index:
index.add({
id: 0,
title: "Foo",
content: "Bar"
});
Update index:
index.update({
id: 0,
title: "Foo",
content: "Foobar"
});
Remove a document and all its contents from an index, by ID:
index.remove(id);
Or by the document data:
index.remove(doc);
Search through all fields:
index.search(query);
Search through a specific field:
index.search(query, { index: "title" });
Search through a given set of fields:
index.search(query, { index: ["title", "content"] });
Pass custom options and/or queries to each field:
index.search([{
field: "content",
query: "some query",
limit: 100,
suggest: true
},{
field: "content",
query: "some other query",
limit: 100,
suggest: true
}]);
By default, every query is limited to 100 entries. Unbounded queries leads into issues. You need to set the limit as an option to adjust the size.
You can set the limit and the offset for each query:
index.search(query, { limit: 20, offset: 100 });
You cannot pre-count the size of the result-set. That's a limit by the design of FlexSearch. When you really need a count of all results you are able to page through, then just assign a high enough limit and get back all results and apply your paging offset manually (this works also on server-side). FlexSearch is fast enough that this isn't an issue.
See all available field-search options
Schema of the default result-set:
fields[] => { field, result[] => id }
Schema of an enriched result-set:
fields[] => { field, result[] => { id, doc }}
The top-level scope of the result set is an array of fields on which the query was applied to. Each of this field has a record (object) with 2 properties field
and result
. The result
could be an array of IDs or is getting enriched by the stored document data (when index was created with store: true
).
A default non-enriched result set looks like:
[{
field: "title",
result: [0, 1, 2]
},{
field: "content",
result: [3, 4, 5]
}]
An enriched result set looks like:
[{
field: "title",
result: [
{ id: 0, doc: { /* document */ }},
{ id: 1, doc: { /* document */ }},
{ id: 2, doc: { /* document */ }}
]
},{
field: "content",
result: [
{ id: 3, doc: { /* document */ }},
{ id: 4, doc: { /* document */ }},
{ id: 5, doc: { /* document */ }}
]
}]
Schema of the merged result-set:
result[] => { id, doc, field[] }}
By passing the search option merge: true
all fields of the result set will be merged (grouped by ID):
[{
id: 1001,
doc: {/* stored document */}
field: ["fieldname-1", "fieldname-2"]
},{
id: 1002,
doc: {/* stored document */}
field: ["fieldname-3"]
}]
When using pluck
instead of field
you can explicitly select just one field and get back a flat representation:
index.search(query, {
pluck: "title",
enrich: true
});
[
{ id: 0, doc: { /* document */ }},
{ id: 1, doc: { /* document */ }},
{ id: 2, doc: { /* document */ }}
]
Like the property index
within a document descriptor just define a property tag
:
const index = new Document({
document: {
id: "id",
tag: "species",
index: "content"
}
});
index.add({
id: 0,
species: "cat",
content: "Some content ..."
});
Your data also can include multiple tags as an array:
index.add({
id: 1,
species: ["fish", "dog"],
content: "Some content ..."
});
You can perform a tag-specific search by:
index.search(query, {
tag: { species: "fish" }
});
This just gives you results which was tagged with the given tag.
Use multiple tags when searching:
index.search(query, {
tag: { species: ["cat", "dog"] }
});
This give you results which was tagged with at least one of the given tags.
Get back all tagged results without passing any query:
index.search({
tag: { species: "cat" }
});
Assume this document schema (a dataset from IMDB):
{
"tconst": "tt0000001",
"titleType": "short",
"primaryTitle": "Carmencita",
"originalTitle": "Carmencita",
"isAdult": 0,
"startYear": "1894",
"endYear": "",
"runtimeMinutes": "1",
"genres": [
"Documentary",
"Short"
]
}
An appropriate document descriptor could look like:
import Charset from "flexsearch";
const index = new Document({
encoder: Charset.Normalize,
resolution: 3,
document: {
id: "tconst",
//store: true, // document store
index: [{
field: "primaryTitle",
tokenize: "forward"
},{
field: "originalTitle",
tokenize: "forward"
}],
tag: [
"startYear",
"genres"
]
}
});
The field contents of primaryTitle
and originalTitle
are encoded by the forward tokenizer. The field contents of startYear
and genres
are added as tags.
Get all entries of a specific tag:
const result = index.search({
//enrich: true, // enrich documents
tag: { "genres": "Documentary" },
limit: 1000,
offset: 0
});
Get entries of multiple tags (intersection):
const result = index.search({
//enrich: true, // enrich documents
tag: {
"genres": ["Documentary", "Short"],
"startYear": "1894"
}
});
Combine tags with queries (intersection):
const result = index.search({
query: "Carmen", // forward tokenizer
tag: {
"genres": ["Documentary", "Short"],
"startYear": "1894"
}
});
Alternative declaration:
const result = index.search("Carmen", {
tag: [{
field: "genres",
tag: ["Documentary", "Short"]
},{
field: "startYear",
tag: "1894"
}]
});
Only a document index can have a store. You can use a document index instead of a flat index to get this functionality also when only storing ID-content-pairs.
You can define independently which fields should be indexed and which fields should be stored. This way you can index fields which should not be included in the search result.
Do not use a store when: 1. an array of IDs as the result is good enough, or 2. you already have the contents/documents stored elsewhere (outside the index).
When the
store
attribute was set, you have to include all fields which should be stored explicitly (acts like a whitelist).
When the
store
attribute was not set, the original document is stored as a fallback.
This will add the whole original content to the store:
const index = new Document({
document: {
index: "content",
store: true
}
});
index.add({ id: 0, content: "some text" });
You can get indexed documents from the store:
var data = index.get(1);
You can update/change store contents directly without changing the index by:
index.set(1, data);
To update the store and also update the index then just use index.update
, index.add
or index.append
.
When you perform a query, weather it is a document index or a flat index, then you will always get back an array of IDs.
Optionally you can enrich the query results automatically with stored contents by:
index.search(query, { enrich: true });
Your results look now like:
[{
id: 0,
doc: { /* content from store */ }
},{
id: 1,
doc: { /* content from store */ }
}]
When storing documents, you can configure independently what should be indexed and what should be stored. This can reduce required index space significantly. Indexed fields do not require to be included in the stored data (also the ID isn't necessary to keep in store). It is recommended to just add fields to the store you'll need in the final result to process further on.
A short example of configuring a document store:
const index = new Document({
document: {
index: "content",
store: ["author", "email"]
}
});
index.add({
id: 0,
author: "Jon Doe",
email: "john@mail.com",
content: "Some content for the index ..."
});
You can query through the contents and will get back the stored values instead:
index.search("some content", { enrich: true });
Your results are now looking like:
[{
field: "content",
result: [{
id: 0,
doc: {
author: "Jon Doe",
email: "john@mail.com",
}
}]
}]
Both field "author" and "email" are not indexed, whereas the indexed field "content" was not included in the stored data.
You can pass a function to the field option property filter
. This function just has to return true
if the document should be indexed.
const index = new Document({
document: {
id: "id",
index: [{
// custom field:
field: "somefield",
filter: function(data){
// return false to filter out
// return anything else to keep
return true;
}
}],
tag: [{
field: "city",
filter: function(data){
// return false to filter out
// return anything else to keep
return true;
}
}],
store: [{
field: "anotherfield",
filter: function(data){
// return false to filter out
// return anything else to keep
return true;
}
}]
}
});
You can pass a function to the field option property custom
to either:
- change and/or extend the original input string
- create a new "virtual" field which is not included in document data
Dataset example:
{
"id": 10001,
"firstname": "John",
"lastname": "Doe",
"city": "Berlin",
"street": "Alexanderplatz",
"number": "1a",
"postal": "10178"
}
You can apply custom fields derived from document data or by any external data:
const index = new Document({
document: {
id: "id",
index: [{
// custom field:
field: "fullname",
custom: function(data){
// return custom string
return data.firstname + " " +
data.lastname;
}
},{
// custom field:
field: "location",
custom: function(data){
return data.street + " " +
data.number + ", " +
data.postal + " " +
data.city;
}
}],
tag: [{
// existing field
field: "city"
},{
// custom field:
field: "category",
custom: function(data){
let tags = [];
// push one or multiple tags
// ....
return tags;
}
}],
store: [{
field: "anotherfield",
custom: function(data){
// return a falsy value to filter out
// return anything else as to keep in store
return data;
}
}]
}
});
Filter is also available in custom functions when returning
false
.
Perform a query against the custom field as usual:
const result = index.search({
query: "10178 Berlin Alexanderplatz",
field: "location"
});
const result = index.search({
query: "john doe",
tag: { "city": "Berlin" }
});
When using TypeScript, you can type your document data when creating a Document
-Index. This will provide enhanced type checks of your syntax.
Create a schema accordingly to your document data, e.g.:
type doctype = {
id: number,
title: string,
description: string,
tags: string[]
};
Create the document index by assigning the type doctype
:
const document = new Document<doctype>({
id: "id",
store: true,
index: [{
field: "title"
},{
field: "description"
}],
tag: "tags"
});