It is possible to add new spaces and search methods. This is done in three steps, which we only outline here. A more detailed description can be found in the following subsections.
In the first step, the user writes the code that implements a
functionality of a method or a space.
In the second step, the user writes a special helper file
containing a method that creates a class or a method.
In this helper file, it is necessary to include
the method/space header.
Because we tend to give the helper file the same name as the name of header for a specific method/space, we should not include method/space headers using quotes (in other words, use only angle brackets). The code with quoes fails to compile under the Visual Studio. Here is an example of a proper include-directive:
#include <method/vptree.h>
In the third step, the user adds
the registration code to either the file
init_spaces.h (for spaces)
or to the file
init_methods.h (for methods).
This step has two sub-steps.
First, the user includes the previously created helper file into either
init_spaces.h
or init_methods.h
.
Second, the function initMethods
or initSpaces
is extended
by adding a macro call that actually registers the space or method in a factory class.
Note that no explicit/manual modification of makefiles (or other configuration files) is required.
However, you have to re-run cmake
each time a new source file is created (addition
of header files does not require a cmake
run).
This is necessary to automatically update makefiles so that they include new source files.
Is is noteworthy that all implementations of methods and spaces
are mostly template classes parameterized by the distance value type.
Recall that the distance function can return an integer (int
),
a single-precision (float
), or a double-precision (double
) real value.
The user may choose to provide specializations for all possible
distance values or decide to focus, e.g., only on integer-valued distances.
The user can also add new applications, which are meant to be
a part of the testing framework/library.
However, adding new applications does require minor editing of the meta-makefile CMakeLists.txt
,
deleting the cmake cache and re-running cmake
(see "Building the main library").
It is also possible to create standalone applications that use the library.
In the following subsections, we consider extension tasks in more detail. For illustrative purposes, we created a zero-functionality space DummySpace, method DummyMethod, and application dummy_app. These zero-functionality examples can also be used as starting points to develop fully functional code.
The main benchmarking utility experiment
parses command line parameters.
Then, it creates a space and a search method using the space and the method factories.
Both search method and spaces can have parameters,
which are passed to the method/space in an instance
of the class AnyParams
.
We consider this in detail in the following sections.
When we create a class representing a search method,
the constructor of the class does not create an index in the memory.
The index is created using either the function CreateIndex
(from scratch)
or the function LoadIndex
(from a previously created index image).
The index can be saved to disk using the function SaveIndex
.
Note, however, that most methods do not support index (de)-serialization.
Depending on parameters passed to the benchmarking utility, two test scenarios are possible. In the first scenario, the user specifies separate data and test files. In the second scenario, a test file is created by bootstrapping: The data set is randomly divided into training and a test set. Then, we call the function RunAll and subsequently Execute for all possible test sets.
The function Execute
is a main workhorse, which creates queries, runs searches,
produces gold standard data, and collects execution statistics.
There are two types of queries: nearest-neighbor and range queries,
which are represented by (template) classes RangeQuery
and KNNQuery
.
Both classes inherit from the class Query
.
Similar to spaces, these template classes are parameterized by the type of the distance value.
Both types of queries are similar in that they implement the Radius
function
and the functions CheckAndAddToResult
.
In the case of the range query, the radius of a query is constant.
However, in the case of the nearest-neighbor query,
the radius typically decreases as we compare the query
with previously unseen data objects (by calling the function CheckAndAddToResult
).
In both cases, the value of the function Radius
can be used to prune unpromising
partitions and data points.
This commonality between the RangeQuery
and KNNQuery
allows us in many cases to carry out a nearest-neighbor query
using an algorithm designed to answer range queries.
Thus, only a single implementation of a search method--that answers queries of both types--can be used in many cases.
A query object proxies distance computations during the testing phase.
Namely, the distance function is accessible through the function
IndexTimeDistance
, which is defined in the class Space
.
During the testing phase, a search method can compute a distance
only by accessing functions Distance
,
DistanceObjLeft
(for left queries) and
DistanceObjRight
for right queries,
which are member functions of the class Query
.
The function Distance
accepts two parameters (i.e., object pointers) and
can be used to compare two arbitrary objects.
The functions DistanceObjLeft
and DistanceObjRight
are used
to compare data objects with the query.
Note that it is a query object memorizes the number of distance computations.
A space is a collection of data objects.
In our library, objects are represented by instances
of the class Object
.
The functionality of this class is limited to
creating new objects and/or their copies as well providing
access to the raw (i.e., unstructured) representation of the data
(through functions data
and datalength
).
We would re-iterate that currently (though this may change in the future releases),
Object
is a very basic class that only keeps a blob of data and blob's size.
For example, the Object
can store an array of single-precision floating point
numbers, but it has no function to obtain the number of elements.
These are the spaces that are responsible for reading objects from files,
interpreting the structure of the data blobs (stored in the Object
),
and computing a distance between two objects.
For dense vector spaces the easiest way to create a new space, is to create a functor (function object class) that computes a distance. Then, this function should be used to instantiate a template VectorSpaceGen. A sample implementation of this approach can be found in sample_standalone_app1.cc. However, as we explain below, additional work is needed if the space should work correctly with all projection methods or any other methods that rely on projections (e.g., OMEDRANK).
To further illustrate the process of developing a new space,
we created a sample zero-functionality space DummySpace
.
It is represented by
the header file
space_dummy.h
and the source file
space_dummy.cc.
The user is encouraged to study these files and read the comments.
Here we focus only on the main aspects of creating a new space.
The sample files include a template class DummySpace
,
which is declared and defined in the namespace similarity
.
It is a direct ancestor of the class Space
.
It is possible to provide the complete implementation of the DummySpace
in the header file. However, this would make compilation slower.
Instead, we recommend to use the mechanism of explicit template instantiation.
To this end, the user should instantiate the template in the source file
for all possible combination of parameters.
In our case, the source file
space_dummy.cc
contains the following lines:
template class SpaceDummy<int>;
template class SpaceDummy<float>;
template class SpaceDummy<double>;
Most importantly, the user needs to implement the function HiddenDistance
,
which computes the distance between objects,
and the function CreateObjFromStr
that creates a data point object from an instance
of a C++ class string
.
For simplicity--even though this is not the most efficient approach--all our spaces create
objects from textual representations. However, this is not a principal limitation,
because a C++ string can hold binary data as well.
Perhaps, the next most important function is ReadNextObjStr
,
which reads a string representation of the next object from a file.
A file is represented by a reference to a subclass of the class DataFileInputState
.
Compared to some older releases, the current Space
API is substantially more complex.
This is necessary to standardize reading/writing of generic objects.
In turn, this has been essential to implementing a generic query server.
The query server accepts data points in the same format as they are stored in a data file.
The above mentioned function CreateObjFromStr
is used for de-serialization
of both the data points stored in a file and query data points passed to the query server.
Additional complexity arises from the need to update space parameters after a space object is created.
This permits a more complex storage model where, e.g., parameters are stored
in a special dedicated header file, while data points are stored elsewhere,
e.g., split among several data files.
To support such functionality, we have a function that opens a data file (OpenReadFileHeader
)
and creates a state object (sub-classed from DataFileInputState
),
which keeps the current file(s) state as well as all space-related parameters.
When we read data points using the function ReadNextObjStr
,
the state object is updated.
The function ReadNextObjStr
may also read an optional external identifier for an object.
When it produces a non-empty identifier it is memorized by the query server and is further
used for query processing (see building and using the query server).
After all data points are read, this state object is supposed to be passed to the Space
object
in the following fashion:
unique_ptr<DataFileInputState>
inpState(space->ReadDataset(dataSet, externIds, fileName, maxNumRec));
space->UpdateParamsFromFile(*inpState);
For a more advanced implementation of the space-related functions, please, see the file space_vector.cc.
In addition to regular data reading/writing functionality, a space may reimplement functions
ReadObjectVectorFromBinData
and WriteObjectVectorBinData
.
The default implementations of these functions save/load the array of Object
instances
in the simple binary format. One drawback of this implementation is that it does not store
external IDs, because no standard space currently uses them.
Remember that the function HiddenDistance
should not be directly accessible
by classes that are not friends of the Space
.
As explained in the previous section,
during the indexing phase,
HiddenDistance
is accessible through the function
Space::IndexTimeDistance
.
During the testing phase, a search method can compute a distance
only by accessing functions Distance
, DistanceObjLeft
, or
DistanceObjRight
, which are member functions of the Query
.
This solution is not perfect, but we are not going to change this for now.
Should we implement a vector space that works properly with projection methods
and classic random projections, we need to define functions GetElemQty
and CreateDenseVectFromObj
.
In the case of a dense vector space, GetElemQty
should return the number of vector elements stored in the object.
For sparse vector spaces, it should return zero. The function CreateDenseVectFromObj
extracts elements stored in a vector.
For dense vector spaces,
it merely copies vector elements to a buffer.
For sparse space vector spaces,
it should do some kind of basic dimensionality reduction.
Currently, we do it via the hashing trick.
Importantly, we need to "tell" the library about the space,
by registering the space in the space factory.
At runtime, the space is created through a helper function.
In our case, it is called CreateDummy
.
The function, accepts only one parameter,
which is a reference to an object of the type AllParams
:
template <typename dist_t>
Space<dist_t>* CreateDummy(const AnyParams& AllParams) {
AnyParamManager pmgr(AllParams);
int param1, param2;
pmgr.GetParamRequired("param1", param1);
pmgr.GetParamRequired("param2", param2);
pmgr.CheckUnused();
return new SpaceDummy<dist_t>(param1, param2);
To extract parameters, the user needs an instance of the class AnyParamManager
(see the above example).
In most cases, it is sufficient to call two functions: GetParamOptional
and
GetParamRequired
.
To verify that no extra parameters are added, it is recommended to call the function CheckUnused
(it fires an exception if some parameters are unclaimed).
This may also help to identify situations where the user misspells
a parameter's name.
Parameter values specified in the commands line are interpreted as strings.
The GetParam*
functions can convert these string values
to integer or floating-point numbers if necessary.
A conversion occurs, if the type of a receiving variable (passed as a second parameter
to the functions GetParam*
) is different from a string.
It is possible to use boolean variables as parameters.
In that, in the command line, one has to specify 1 (for true
) or 0 (for false
).
Note that the function GetParamRequired
raises an exception,
if the request parameter was not supplied in the command line.
The function CreateDummy
is registered in the space factory using a special macro.
This macro should be used for all possible values of the distance function,
for which our space is defined. For example, if the space is defined
only for integer-valued distance function, this macro should be used only once.
However, in our case the space CreateDummy
is defined for integers,
single- and double-precision floating pointer numbers. Thus, we use this macro
three times as follows:
REGISTER_SPACE_CREATOR(int, SPACE_DUMMY, CreateDummy)
REGISTER_SPACE_CREATOR(float, SPACE_DUMMY, CreateDummy)
REGISTER_SPACE_CREATOR(double, SPACE_DUMMY, CreateDummy)
This macro should be placed into the function initSpaces
in the
file init_spaces.h.
Last, but not least we need to add the include-directive
for the helper function, which creates
the class, to the file init_spaces.h
as follows:
#include "factory/space/space_dummy.h"
To conlcude, we recommend to make a Space
object is non-copyable.
This can be done by using our macro DISABLE_COPY_AND_ASSIGN
.
To illustrate the basics of developing a new search method,
we created a sample zero-functionality method DummyMethod
.
It is represented by
the header file
dummy.h
and the source file
dummy.cc.
The user is encouraged to study these files and read the comments.
Here we would omit certain details.
Similar to the space and query classes, a search method is implemented using
a template class, which is parameterized by the distance function value.
Note again that the constructor of the class does not create an index in the memory.
The index is created using either the function CreateIndex
(from scratch)
or the function LoadIndex
(from a previously created index image).
The index can be saved to disk using the function SaveIndex
.
It does not have to be a comprehensive index that contains a copy of the data set.
Instead, it is sufficient to memorize only the index structure itself (because
the data set is always loaded separately).
Also note that most methods do not support index (de)-serialization.
The constructor receives a reference to a space object as well as a reference to an array of data objects.
In some cases, e.g., when we wrap existing methods such as the multiprobe LSH,
we create a copy of the data set (simply because is was easier to write the wrapper this way).
The framework can be informed about such a situation via the virtual function DuplicateData
.
If this function returns true, the framework "knows" that the data was duplicated.
Thus, it can correct an estimate for the memory required by the method.
The function CreateIndex
receives a parameter object.
In our example, the parameter object is used to retrieve the single index-time parameter: doSeqSearch
.
When this parameter value is true, our dummy method carries out a sequential search.
Otherwise, it does nothing useful.
Again, it is recommended to call the function CheckUnused
to ensure
that the user did not enter parameters with incorrect names.
It is also recommended to call the function ResetQueryTimeParams
(this
pointer needs to
be specified explicitly here) to reset query-time parameters after the index is created (or loaded from disk).
Unlike index-time parameters, query-time parameters can be changed without rebuilding the index
by invoking the function SetQueryTimeParams
.
The function SetQueryTimeParams
accepts a constant reference to a parameter object.
The programmer, in turn, creates a parameter manager object to extract actual parameter values.
To this end, two functions are used: GetParamRequired
and GetParamOptional
.
Note that the latter function must be supplied with a mandatory default value for the parameter.
Thus, the parameter value is properly reset to its default value when the user does not specify the parameter value
explicitly (e.g., the parameter specification is omitted when the user invokes the benchmarking utility experiment
)!
There are two search functions each of which receives two parameters.
The first parameter is a pointer to a query (either a range or a k-NN query).
The second parameter is currently unused.
Note again that during the search phase, a search method can
compute a distance only by accessing functions Distance
, DistanceObjLeft
, or
DistanceObjRight
, which are member functions of a query object.
The function IndexTimeDistance
should not be used in a function Search
,
but it can be used in the function CreateIndex
.
If the user attempts to invoke IndexTimeDistance
during the test phase,
the program will terminate.
Finally, we need to "tell" the library about the method,
by registering the method in the method factory,
similarly to registering a space.
At runtime, the method is created through a helper function,
which accepts several parameters.
One parameter is a reference to an object of the type AllParams
.
In our case, the function name is CreateDummy
:
#include <method/dummy.h>
namespace similarity {
template <typename dist_t>
Index<dist_t>* CreateDummy(bool PrintProgress,
const string& SpaceType,
Space<dist_t>& space,
const ObjectVector& DataObjects) {
return new DummyMethod<dist_t>(space, DataObjects);
}
There is an include-directive preceding the creation function, which uses angle brackets. As explained previously, if you opt to using quotes (in the include-directive), the code may not compile under the Visual Studio.
Again, similarly to the case of the space,
the method-creating function CreateDummy
needs
to be registered in the method factory in two steps.
First, we need to include dummy.h
into the file
init_methods.h as follows:
#include "factory/method/dummy.h"
Then, this file is further modified by adding the following lines to the function `initMethods}:
REGISTER_METHOD_CREATOR(float, METH_DUMMY, CreateDummy)
REGISTER_METHOD_CREATOR(double, METH_DUMMY, CreateDummy)
REGISTER_METHOD_CREATOR(int, METH_DUMMY, CreateDummy)
If we want our method to work only with integer-valued distances, we only need the following line:
REGISTER_METHOD_CREATOR(int, METH_DUMMY, CreateDummy)
When adding the method, please, consider expanding
the test utility test_integr
.
This is especially important if for some combination of parameters the method is expected
to return all answers (and will have a perfect recall). Then, if we break the code in the future,
this will be detected by test_integr
.
To create a test case, the user needs to add one or more test cases
to the file
test_integr.cc.
A test case is an instance of the class MethodTestCase
.
It encodes the range of plausible values
for the following performance parameters: the recall,
the number of points closer to the query than the nearest returned point,
and the improvement in the number of distance computations.
Imagine, we need to add an additional benchmarking/testing/etc utility that is built as a part of NMSLIB. First, we would create a hello-world source file dummy_app.cc:
#include <iostream>
using namespace std;
int main(void) {
cout << "Hello world!" << endl;
}
Then we need to modify
the meta-makefile and
re-run cmake
as described in "Building the main library".