Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] Wildcard support #770

Open
jess-lord opened this issue Feb 25, 2024 · 3 comments
Open

[ENHANCEMENT] Wildcard support #770

jess-lord opened this issue Feb 25, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@jess-lord
Copy link

Is your feature request related to a problem? Please describe.
Marqo 1.4 supported wildcards in the query string, which we relied on to do metadata-only filters and queries.

Describe the solution you'd like
Please support wildcard queries again.

Describe alternatives you've considered
The only alternative for us is to stay on marqo 1.x

Additional context
This worked in marqo 1.4 but 2.2 does not return the records. These are metadata records that have no content
{"q":"*", "filter":"tag:_summary", "searchMethod":"LEXICAL"}
A workaround here can be to set the query to _summary but that doesn't work for the next example.
{"q":"*", "filter":"NOT topic:(Trolling) AND content:(Trolling)", "searchMethod":"LEXICAL", "searchableAttributes":["content"]}
This used to work but now returns 0 results, and I can't set the query to Trolling because I need a literal match on that string (lexical is fuzzy and will return results for permutations like Troll). The records do not have any value set for their topic attribute. Content is a tensor field in a structured index that is configured to also have lexical and filter during index creation.

@jess-lord jess-lord added the enhancement New feature or request label Feb 25, 2024
@farshidz
Copy link
Collaborator

farshidz commented May 2, 2024

Hi @jess-lord . Thanks for raising this issue. Is your requirement to search only based on a filter with no query? Or do you intend to use the wildcard potentially as part of a string? e.g. q="somevalue*" for a prefix search

In the meantime, I believe having q="Trolling" with a filter could in fact give you the desired outcome. Your query might match content=Troll due to linguistic processing (stemming), but the filter will eliminate those results.

Here's an example I just tried

ix.add_documents(
    documents=[
        {
            '_id': '1',
            'title': 'Trolling',
            'topic': 'Fun'
,        },
        {
            '_id': '2',
            'title': 'Troll',
            'topic': 'Fun'
,        }
    ],
    tensor_fields=[]
)

response = ix.search(q='Trolling', limit=10, search_method="lexical", filter_string='NOT topic:(Trolling) AND title:(Trolling)')

response['hits']

Results:

[{'title': 'Trolling',
  'topic': 'Fun',
  '_id': '1',
  '_score': 0.1823215567939546,
  '_highlights': []}]

As you can see, this didn't return the document with title=Troll.

@jess-lord
Copy link
Author

@farshidz Thanks for looking into this. I'm looking for exact token matches, so "troll" should match "the troll under the bridge" but not "the trolling of online forums". The use case is to search marqo document content for important keywords that need an exact match. So the filter would target the "content" property of the documents. Maybe a more abstract example is easier:

ix.add_documents(
    documents=[
        {
            '_id': '1',
            'content': 'lorem ipusm abc1 lorem',
            'topic': ''
,        },
        {
            '_id': '2',
            'content': 'lorem ipusm abc110 lorem',
            'topic': ''
,        }
    ],
    tensor_fields=[content]
)

In this example my objective is to filter the index for docs with content of abc1, and tag all matching results with a topic of genreA, and tag docs containing abc110 with genreB. When filtering for "abc1" I don't want to get this second document.

@farshidz
Copy link
Collaborator

farshidz commented Jun 26, 2024

@jess-lord since Marqo 2.7, you can now search with q="*" like you did with Marqo 1 (searching using only your filters). However, this doesn't immediately enable exact matching of token within a string. This is because

  • Lexical (inverted) indexes (lexical_search feature) store processed/stemmed tokens
  • Filter indexes (filter feature) can only exact match the full string for efficiency reasons

The best workaround I can think of is to split your text (content in the example above) based on whitespace to create a list and store this as an array<string> field in Marqo (if using an unstructured index, just pass the list as a document field and the type will be inferred). Then searching with q="*" and filter_string="content:abc1" will achieve what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants