Skip to content

Conversation

sampaccoud
Copy link
Member

@sampaccoud sampaccoud commented Aug 7, 2025

Purpose

We want to add fulltext (and semantic in a second phase) search to Docs.

The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.

Proposal

  • Add indexing logic in a search indexer that can be declared as a backend
  • Implement indexing for the Find backend. See corresponding PR in Find
  • Implement search views as a proxy
  • Implement triggers to update search index when a document or its accesses change. Synchronization should be done asyncrhonously as changing a document or its accesses affects all its descendants...

Fixes #322

@sampaccoud sampaccoud requested a review from joehybird August 7, 2025 16:40
@sampaccoud sampaccoud added feature add a new feature backend labels Aug 7, 2025
@joehybird joehybird force-pushed the index-to-search branch 3 times, most recently from 10bfd94 to 5bd6b18 Compare September 8, 2025 12:38
Copy link

gitguardian bot commented Sep 8, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
21473509 Triggered Generic High Entropy Secret d528b58 env.d/development/common View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Revoke and rotate the secret.

  2. If possible, rewrite git history with git commit --amend and git push --force.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Member

@qbey qbey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review, I know work is still ongoing and I did not read all the tests... :)

Comment on lines 829 to 837
q = serializers.CharField(required=True)

def validate_q(self, value):
"""Ensure the text field is not empty."""

if len(value.strip()) == 0:
raise serializers.ValidationError("Text field cannot be empty.")

return value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
q = serializers.CharField(required=True)
def validate_q(self, value):
"""Ensure the text field is not empty."""
if len(value.strip()) == 0:
raise serializers.ValidationError("Text field cannot be empty.")
return value
q = serializers.CharField(required=True, allow_blank=False)

You may also add trim_whitespace=True

serializer.is_valid(raise_exception=True)

try:
indexer = FindDocumentIndexer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this class should come from settings, because not everyone will have an indexer. I also think this view might fallback on searching locally on title if no indexer is configured.

url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)

if not url:
raise RuntimeError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise RuntimeError(
raise ImproperlyConfigured(

Returns:
dict: A JSON-serializable dictionary.
"""
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)
url = settings.SEARCH_INDEXER_QUERY_URL

Comment on lines 224 to 225
logger.error("HTTPError: %s", e)
logger.error("Response content: %s", response.text) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the error only once

}


@receiver(signals.post_save, sender=DocumentAccess)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to follow "Use signals as a last resort" (Two scoops of Django): is there a problem calling this from the save method?

Comment on lines 37 to 38
def sortkey(d):
return d["id"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push_call_args = [call.args[0] for call in mock_push.call_args_list]

assert len(push_call_args) == 1 # called once but with a batch of docs
assert sorted(push_call_args[0], key=sortkey) == sorted(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, actually I think the documents sorting should be deterministic, in case we need tu run several times the index command => I think we should change the index command to sort documents by creation date or something ^^

We can keep it this way for now, but we surely need to add a comment in the "index" management command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sort is by id because the indexation is done by batch : loop + id__gt=prev_batch_last_id

Comment on lines 43 to 45
push_call_args = [call.args[0] for call in mock_push.call_args_list]

assert len(push_call_args) == 1 # called once but with a batch of docs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to simply check the assert_called_once, then use the first value?

@joehybird joehybird force-pushed the index-to-search branch 16 times, most recently from 7cfa907 to 7255ec2 Compare September 15, 2025 13:01
Comment on lines 54 to 57
client = APIClient()
client.force_login(user)

response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much APIClient() ^^'

response = client.get


response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"})

assert response.status_code == 401
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the expected status code for misconfiguration?

Comment on lines 70 to 73
client = APIClient()
client.force_login(user)

response = APIClient().get("/api/v1.0/documents/search/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
client = APIClient()
client.force_login(user)
response = APIClient().get("/api/v1.0/documents/search/")
client = APIClient()
client.force_login(user)
response = client.get("/api/v1.0/documents/search/")

assert response.status_code == 400
assert response.json() == {"q": ["This field is required."]}

response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "})
response = client.get("/api/v1.0/documents/search/", data={"q": " "})

except ImproperlyConfigured:
return drf.response.Response(
{"detail": "The service is not properly configured."},
status=status.HTTP_401_UNAUTHORIZED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a 401?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree... why ? A 500 is a better idea 😅

Comment on lines 960 to 964
# Setup indexer configuration to make test working on the CI.
SEARCH_INDEXER_SECRET = "ThisIsAKeyForTest" # noqa
SEARCH_INDEXER_URL = "http://localhost:8081/api/v1.0/documents/index/"
SEARCH_INDEXER_QUERY_URL = "http://localhost:8081/api/v1.0/documents/search/"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand.

My two cents:

  • you should only set the settings in the tests which requires it
  • there must be a way to disable the indexation, because not every instance of docs will have an index, and it must be the default case.

=> I would remove those settings, determine which settings enables the indexation (I think SEARCH_INDEXER_CLASS should be None by default, and if None, then the document must not be indexed). Add new tests on the documents save, with SEARCH_INDEXER_CLASS not None, to check the indexation task etc.

@joehybird joehybird force-pushed the index-to-search branch 7 times, most recently from 73c8b89 to 95d0ef6 Compare October 1, 2025 09:58
sampaccoud and others added 13 commits October 2, 2025 16:18
Search in Docs relies on an external project like "La Suite Find".
We need to declare a common external network in order to connect to
the search app and index our documents.
We need to content in our demo documents so that we can test
indexing.
Add indexer that loops across documents in the database, formats them
as json objects and indexes them in the remote "Find" mico-service.
On document content or permission changes, start a celery job that will call the
indexation API of the app "Find".

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
New API view that calls the indexed documents search view
(resource server) of app "Find".

Signed-off-by: Fabre Florian <ffabre@hybird.org>
New SEARCH_INDEXER_CLASS setting to define the indexer service class.
Raise ImpoperlyConfigured errors instead of RuntimeError in index service.

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Filter deleted documents from visited ones.
Set default ordering to the Find API search call (-updated_at)
BaseDocumentIndexer.search now returns a list of document ids instead of models.
Do not call the indexer in signals when SEARCH_INDEXER_CLASS is not defined
or properly configured.

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Only documents without title and content are ignored by indexer.
Add SEARCH_INDEXER_COUNTDOWN as configurable setting.
Make the search backend creation simplier (only 'get_document_indexer' now).
Allow indexation of deleted documents.

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Generate a fernet key for the OIDC_STORE_REFRESH_TOKEN_KEY in development
settings if not set.

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Add nginx with 'nginx' alias to the 'lasuite-net' network (keycloak calls)
Add celery-dev to the 'lasuite-net' network (Find API calls in jobs)
Set app-dev alias as 'impress' in the 'lasuite-net' network
Add indexer configuration in common settings

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Applies filtering based on request parameter 'q' from `FindDocumentSerializer`.
Depending of the configuration it can be:
- A fulltext search through the opensearch indexation app "find" if the backend is
enabled (see SEARCH_BACKEND_CLASS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enabled (see SEARCH_BACKEND_CLASS)
enabled (see SEARCH_INDEXER_CLASS)

I don't see the SEARCH_BACKEND_CLASS settings

Comment on lines 1067 to 1080
queryset = self.get_queryset()
filterset = DocumentFilter({"title": text}, queryset=queryset)

if not filterset.is_valid():
raise drf.exceptions.ValidationError(filterset.errors)

queryset = filterset.filter_queryset(queryset).order_by("-updated_at")

return self.get_response_for_queryset(
queryset,
context={
"request": request,
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can maybe extract this in a private method ?

Comment on lines 1082 to 1100
queryset = models.Document.objects.all()

# Retrieve the documents ids from Find.
results = indexer.search(
text=text,
token=access_token,
visited=get_visited_document_ids_of(queryset, user),
page=serializer.validated_data.get("page", 1),
page_size=serializer.validated_data.get("page_size", 20),
)

queryset = queryset.filter(pk__in=results).order_by("-updated_at")

return self.get_response_for_queryset(
queryset,
context={
"request": request,
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here ?

Comment on lines +52 to +55
indexer = get_document_indexer()

if indexer is None:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be do earlier to abort the task before checking the debounce ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do a indexer_debounce_lock(document.pk) > 1 to skip the job start in the trigger_document_indexer method. But I'm not sure about the side effects I have to think about it.


access_qs = models.DocumentAccess.objects.filter(
document__path__in=ancestor_paths
).values("document__path", "user__sub", "team")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
).values("document__path", "user__sub", "team")
).values("document__path", "user__sub", "team").iterator()

As it is used only in the task, maybe it is interested to use .iterator() ?

@@ -0,0 +1,89 @@
"""Trigger document indexation using celery task."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are mixing find and search indexers in the PR. IMO keeping search_indexer here make sense, this what you are doing. (for the module name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point 👍

str(no_title_doc.path): {"users": [user.sub]},
}

with mock.patch.object(FindDocumentIndexer, "push") as mock_push:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of mocking the push, why not using responses ?

count = 0

while True:
documents_batch = list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To force the queryset execution ?

Comment on lines 293 to 303
try:
response = requests.post(
self.indexer_url,
json=data,
headers={"Authorization": f"Bearer {self.indexer_secret}"},
timeout=10,
)
response.raise_for_status()
except requests.exceptions.HTTPError as e:
logger.error("HTTPError: %s", e)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
response = requests.post(
self.indexer_url,
json=data,
headers={"Authorization": f"Bearer {self.indexer_secret}"},
timeout=10,
)
response.raise_for_status()
except requests.exceptions.HTTPError as e:
logger.error("HTTPError: %s", e)
raise
response = requests.post(
self.indexer_url,
json=data,
headers={"Authorization": f"Bearer {self.indexer_secret}"},
timeout=10,
)
response.raise_for_status()

I would let the exception raises without catching it. Sentry is configured to handle it.

Comment on lines 273 to 284
try:
response = requests.post(
self.search_url,
json=data,
headers={"Authorization": f"Bearer {token}"},
timeout=10,
)
response.raise_for_status()
return response.json()
except requests.exceptions.HTTPError as e:
logger.error("HTTPError: %s", e)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, no try/catch needed IMO

Set a valid Ferney for OIDC_STORE_REFRESH_TOKEN_KEY in env.d/development/common
Add .gitguardian.yaml configuration to ignore this key.

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Rename FindDocumentIndexer as SearchIndexer
Rename FindDocumentSerializer as SearchDocumentSerializer
Rename package core.tasks.find as core.task.search
Remove logs on http errors in SearchIndexer
Factorise some code in search API view.

Signed-off-by: Fabre Florian <ffabre@hybird.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend feature add a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full-Blown search feature

4 participants