- Title: Distinct Attribute
- Start Date: 2021-04-16
- Specification PR: #32
- MeiliSearch Tracking-Issues: milli#168
The distinct attribute is usefull to discard other occurences of document having the same value as the field setted as a distinct attribute.
The value of a field whose attribute is set as a distinct attribute will always be unique in the returned documents.
The new search engine called Milli no longer processes the distinct attribute as the current MeilliSearch. The specification aims to make Milli as backward compatible as possible with the current release (v.0.20.0). Including both the API usage and the expected search results.
Algolia distinct feature is based on one attribute, as defined in attributeForDistinct
.
Adding an attribute to this setting will make that all search results have a different value in the given attributes field. This is done at indexing time.
Algolia distinct functionality enables deduplication and aggregation by allowing a numeric value for the defined distinct attribute.
Behavior | Distinct value | Description |
---|---|---|
de-duplication | N = 1 | Used to remove similar records from the search result. Only the most relevant record is returned. |
grouping | N > 1 | N records containing the same value for the distinct attribute will be returned. |
distinct
is silently ignored at query time ifattibuteForDistinct
is not defined. It is not mandatory but possible to setdistinct
at indexing time via the index settings endpoint.It is possible to disable distinct feature at query time, by giving a falsy value (
false
orO
) for the distinct parameter. Givingtrue
is equivalent to1
.
TypeSense distinct feature uses group_by
and group_limit
to achieve de-duplication and grouping at query time.
A field that is setted as a parameter of group_by must be previously faceted.
Parameter | Description |
---|---|
group_by | It is possible to aggregate records into groups by setting multiple fields separated by a comma. E.g. group_by=country,company_name |
group_limit | Control the maximum number of top records returned for groups. By default, TypeSense group_limit parameter is set to 3. |
Using group_by add a nested structure in the search result. Buckets are returned in grouped_hits
field.
Elasticsearch does not provide this functionality directly. Nor a keyword or an operator exists to get de-duplicated or grouped results.
It is difficult to find the information in the official documentation. A lot of people are asking about the distinct feature on StackOverflow or ElasticSearch forums.
By specifying the search with terms or composite aggregations it is possible to have buckets fed by documents that have the same values on a specified field.
E.g Terms aggregation
{
"aggs": {
"distinct_colors": {
"terms": {
"field": "color",
"size": 1000
}
}
}
}
The size parameter can be set to define how many term buckets should be returned out of the overall terms list. Term aggregations by default return 10 buckets only.
Composite aggregation allows to paginate over buckets containing a lot of values.
Cardinality aggregation can calculates the count of distinct values for a field.
It is also possible to use the keyword DISTINCT
from the SQL access feature from X-Pack. However the functionality only allows to return tabular data.
Let's say that we have 2 documents with the same product_id
. Each document exists to materialize the color variation.
{
"hits": [
{
"colors": "red",
"id": 1,
"label": "t-shirt",
"product_id": 1
},
{
"colors": "black",
"id": 2,
"label": "t-shirt",
"product_id": 1
}
],
"offset": 0,
"limit": 20,
"nbHits": 2,
"exhaustiveNbHits": false,
"processingTimeMs": 1,
"query": "t-shirt"
}
Without setting product_id
as a distinct attribute, a search with t-shirt
as a query will return the two documents.
It can be useful to display one product per color variation as a search result for example. But as your number of products variations grows over time, you might want to display only one result as a top search result, mostly for UI concerns.
It's in this case that the distinct attribute finds all its interest.
Setting product_id
as a distinct attribute will discard all others documents having the same value for product_id
from the search result.
MeiliSearch accepts only one value which is the document attribute that needs to be de-duplicated from the other document sharing the same attribute value.
It is possible to configure the distinct attribute using two endpoints. Update All Settings and the distinct attribute endpoint.
Given this setting :
{
"distinctAttribute": "product_id"
}
MeiliSearch will return de-duplicated hits by product_id
.
{
"hits": [
{
"colors": "red",
"id": 1,
"label": "t-shirt",
"product_id": 1
}
],
"offset": 0,
"limit": 20,
"nbHits": 1,
"exhaustiveNbHits": false,
"processingTimeMs": 1,
"query": "\"t-shirt\""
}
It returns the top most relevant document by discarding all the others.
For a search with "q": "black"
as parameter, MeiliSearch returns:
{
"hits": [
{
"colors": "black",
"id": 2,
"label": "t-shirt",
"product_id": 1
}
],
"offset": 0,
"limit": 20,
"nbHits": 1,
"exhaustiveNbHits": false,
"processingTimeMs": 0,
"query": "\"black\""
}
Milli shoud be identical as v.0.20 concerning API endpoints and returned results.
N/A
N/A
To apply the distinct behavior, the search engine needs to create a database. This database is the same as we create to apply facets and filters on an attribute. However, the users will not pass the attribute into attributesForFaceting
when setting a distinct attribute. It means the search engine must create this database for the related attribute.
Following this example, it also means the search engine would be technically able to apply filters and facet distribution on the distinct attribute, however, we should prevent this. To avoid confusion, the search engine should prevent the users to execute a filter or get facet distribution on the distinct attribute. Only the distinct capability should be available for this field.
If the user wants to filter on that attribute, he will have to add it in attributesForFaceting
as well.
Using this new data structure allows interesting future possibilities.
These ideas will be materialized as feature-proposals in the product team backlog. Keep in mind that this specification is related to v0.21.
Since MeiliSearch can only de-duplicate documents matching the distinct attribute, it would be interesting to be able to set the number of topmost relevant documents before discarding the others.
groupLimit
could be applied at search time to fit users needs.
It probably require to add a nested structure to return hits for each groups with clarity.
Given this distinct setting:
{
"distinctAttributes": ["colors", "size"]
}
We could add a groupBy
like parameter to let the user choose a specific grouping behavior:
{
"groupBy": ["colors"]
}
or
{
"groupBy": ["size", "colors"]
}
It can be used in combination with groupLimit to define the topmost relevant documents before discarding the others for each group.