Index to search #1276

sampaccoud · 2025-08-07T16:31:17Z

Purpose

We want to add fulltext (and semantic in a second phase) search to Docs.

The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.

Proposal

Add indexing logic in a search indexer that can be declared as a backend
Implement indexing for the Find backend. See corresponding PR in Find
Implement search views as a proxy
Implement triggers to update search index when a document or its accesses change. Synchronization should be done asyncrhonously as changing a document or its accesses affects all its descendants...

Fixes #322

gitguardian · 2025-09-08T12:39:18Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
21473509	Triggered	Generic High Entropy Secret	`d528b58`	env.d/development/common	View secret

🛠 Guidelines to remediate hardcoded secrets

Revoke and rotate the secret.
If possible, rewrite git history with git commit --amend and git push --force.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

qbey

First review, I know work is still ongoing and I did not read all the tests... :)

qbey · 2025-09-10T19:45:34Z

src/backend/core/api/serializers.py

+    q = serializers.CharField(required=True)
+
+    def validate_q(self, value):
+        """Ensure the text field is not empty."""
+
+        if len(value.strip()) == 0:
+            raise serializers.ValidationError("Text field cannot be empty.")
+
+        return value


Suggested change

q = serializers.CharField(required=True)

def validate_q(self, value):

"""Ensure the text field is not empty."""

if len(value.strip()) == 0:

raise serializers.ValidationError("Text field cannot be empty.")

return value

q = serializers.CharField(required=True, allow_blank=False)

You may also add trim_whitespace=True

qbey · 2025-09-10T19:56:25Z

src/backend/core/api/viewsets.py

+        serializer.is_valid(raise_exception=True)
+
+        try:
+            indexer = FindDocumentIndexer()


I guess this class should come from settings, because not everyone will have an indexer. I also think this view might fallback on searching locally on title if no indexer is configured.

qbey · 2025-09-10T19:58:16Z

src/backend/core/services/search_indexers.py

+        url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)
+
+        if not url:
+            raise RuntimeError(


Suggested change

raise RuntimeError(

raise ImproperlyConfigured(

qbey · 2025-09-10T19:58:48Z

src/backend/core/services/search_indexers.py

+        Returns:
+            dict: A JSON-serializable dictionary.
+        """
+        url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)


Suggested change

url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)

url = settings.SEARCH_INDEXER_QUERY_URL

qbey · 2025-09-10T19:59:42Z

src/backend/core/services/search_indexers.py

+            logger.error("HTTPError: %s", e)
+            logger.error("Response content: %s", response.text)  # type: ignore


Log the error only once

qbey · 2025-09-10T20:08:43Z

src/backend/core/models.py

        }


+@receiver(signals.post_save, sender=DocumentAccess)


We try to follow "Use signals as a last resort" (Two scoops of Django): is there a problem calling this from the save method?

src/backend/core/services/search_indexers.py

qbey · 2025-09-10T20:25:02Z

src/backend/core/tests/commands/test_index.py

+    def sortkey(d):
+        return d["id"]


Use https://docs.python.org/3/library/operator.html#operator.itemgetter instead

qbey · 2025-09-10T20:32:25Z

src/backend/core/tests/commands/test_index.py

+        push_call_args = [call.args[0] for call in mock_push.call_args_list]
+
+        assert len(push_call_args) == 1  # called once but with a batch of docs
+        assert sorted(push_call_args[0], key=sortkey) == sorted(


Interesting, actually I think the documents sorting should be deterministic, in case we need tu run several times the index command => I think we should change the index command to sort documents by creation date or something ^^

We can keep it this way for now, but we surely need to add a comment in the "index" management command.

the sort is by id because the indexation is done by batch : loop + id__gt=prev_batch_last_id

qbey · 2025-09-10T20:33:04Z

src/backend/core/tests/commands/test_index.py

+        push_call_args = [call.args[0] for call in mock_push.call_args_list]
+
+        assert len(push_call_args) == 1  # called once but with a batch of docs


Don't you want to simply check the assert_called_once, then use the first value?

qbey · 2025-09-15T21:18:28Z

src/backend/core/tests/documents/test_api_documents_search.py

+    client = APIClient()
+    client.force_login(user)
+
+    response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"})


Too much APIClient() ^^'

response = client.get

qbey · 2025-09-15T21:19:03Z

src/backend/core/tests/documents/test_api_documents_search.py

+
+    response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"})
+
+    assert response.status_code == 401


Is it the expected status code for misconfiguration?

qbey · 2025-09-15T21:20:08Z

src/backend/core/tests/documents/test_api_documents_search.py

+    client = APIClient()
+    client.force_login(user)
+
+    response = APIClient().get("/api/v1.0/documents/search/")


Suggested change

client = APIClient()

client.force_login(user)

response = APIClient().get("/api/v1.0/documents/search/")

client = APIClient()

client.force_login(user)

response = client.get("/api/v1.0/documents/search/")

qbey · 2025-09-15T21:20:32Z

src/backend/core/tests/documents/test_api_documents_search.py

+    assert response.status_code == 400
+    assert response.json() == {"q": ["This field is required."]}
+
+    response = APIClient().get("/api/v1.0/documents/search/", data={"q": "    "})


Suggested change

response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "})

response = client.get("/api/v1.0/documents/search/", data={"q": " "})

qbey · 2025-09-15T21:22:45Z

src/backend/core/api/viewsets.py

+        except ImproperlyConfigured:
+            return drf.response.Response(
+                {"detail": "The service is not properly configured."},
+                status=status.HTTP_401_UNAUTHORIZED,


I agree... why ? A 500 is a better idea 😅

qbey · 2025-09-15T21:30:41Z

src/backend/impress/settings.py

+    # Setup indexer configuration to make test working on the CI.
+    SEARCH_INDEXER_SECRET = "ThisIsAKeyForTest"  # noqa
+    SEARCH_INDEXER_URL = "http://localhost:8081/api/v1.0/documents/index/"
+    SEARCH_INDEXER_QUERY_URL = "http://localhost:8081/api/v1.0/documents/search/"
+


I'm not sure to understand.

My two cents:

you should only set the settings in the tests which requires it

there must be a way to disable the indexation, because not every instance of docs will have an index, and it must be the default case.

=> I would remove those settings, determine which settings enables the indexation (I think SEARCH_INDEXER_CLASS should be None by default, and if None, then the document must not be indexed). Add new tests on the documents save, with SEARCH_INDEXER_CLASS not None, to check the indexation task etc.

Search in Docs relies on an external project like "La Suite Find". We need to declare a common external network in order to connect to the search app and index our documents.

We need to content in our demo documents so that we can test indexing.

Add indexer that loops across documents in the database, formats them as json objects and indexes them in the remote "Find" mico-service.

On document content or permission changes, start a celery job that will call the indexation API of the app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>

Signed-off-by: Fabre Florian <ffabre@hybird.org>

New API view that calls the indexed documents search view (resource server) of app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>

New SEARCH_INDEXER_CLASS setting to define the indexer service class. Raise ImpoperlyConfigured errors instead of RuntimeError in index service. Signed-off-by: Fabre Florian <ffabre@hybird.org>

Signed-off-by: Fabre Florian <ffabre@hybird.org>

Filter deleted documents from visited ones. Set default ordering to the Find API search call (-updated_at) BaseDocumentIndexer.search now returns a list of document ids instead of models. Do not call the indexer in signals when SEARCH_INDEXER_CLASS is not defined or properly configured. Signed-off-by: Fabre Florian <ffabre@hybird.org>

Only documents without title and content are ignored by indexer.

Add SEARCH_INDEXER_COUNTDOWN as configurable setting. Make the search backend creation simplier (only 'get_document_indexer' now). Allow indexation of deleted documents. Signed-off-by: Fabre Florian <ffabre@hybird.org>

Generate a fernet key for the OIDC_STORE_REFRESH_TOKEN_KEY in development settings if not set. Signed-off-by: Fabre Florian <ffabre@hybird.org>

Add nginx with 'nginx' alias to the 'lasuite-net' network (keycloak calls) Add celery-dev to the 'lasuite-net' network (Find API calls in jobs) Set app-dev alias as 'impress' in the 'lasuite-net' network Add indexer configuration in common settings Signed-off-by: Fabre Florian <ffabre@hybird.org>

lunika · 2025-10-07T13:27:16Z

src/backend/core/api/viewsets.py

+        Applies filtering based on request parameter 'q' from `FindDocumentSerializer`.
+        Depending of the configuration it can be:
+         - A fulltext search through the opensearch indexation app "find" if the backend is
+           enabled (see SEARCH_BACKEND_CLASS)


Suggested change

enabled (see SEARCH_BACKEND_CLASS)

enabled (see SEARCH_INDEXER_CLASS)

I don't see the SEARCH_BACKEND_CLASS settings

lunika · 2025-10-07T13:40:16Z

src/backend/core/api/viewsets.py

+            queryset = self.get_queryset()
+            filterset = DocumentFilter({"title": text}, queryset=queryset)
+
+            if not filterset.is_valid():
+                raise drf.exceptions.ValidationError(filterset.errors)
+
+            queryset = filterset.filter_queryset(queryset).order_by("-updated_at")
+
+            return self.get_response_for_queryset(
+                queryset,
+                context={
+                    "request": request,
+                },
+            )


You can maybe extract this in a private method ?

lunika · 2025-10-07T13:40:23Z

src/backend/core/api/viewsets.py

+        queryset = models.Document.objects.all()
+
+        # Retrieve the documents ids from Find.
+        results = indexer.search(
+            text=text,
+            token=access_token,
+            visited=get_visited_document_ids_of(queryset, user),
+            page=serializer.validated_data.get("page", 1),
+            page_size=serializer.validated_data.get("page_size", 20),
+        )
+
+        queryset = queryset.filter(pk__in=results).order_by("-updated_at")
+
+        return self.get_response_for_queryset(
+            queryset,
+            context={
+                "request": request,
+            },
+        )


Same here ?

lunika · 2025-10-07T13:43:14Z

src/backend/core/tasks/search.py

+    indexer = get_document_indexer()
+
+    if indexer is None:
+        return


Can be do earlier to abort the task before checking the debounce ?

we can do a indexer_debounce_lock(document.pk) > 1 to skip the job start in the trigger_document_indexer method. But I'm not sure about the side effects I have to think about it.

lunika · 2025-10-07T14:01:07Z

src/backend/core/services/search_indexers.py

+
+    access_qs = models.DocumentAccess.objects.filter(
+        document__path__in=ancestor_paths
+    ).values("document__path", "user__sub", "team")


Suggested change

).values("document__path", "user__sub", "team")

).values("document__path", "user__sub", "team").iterator()

As it is used only in the task, maybe it is interested to use .iterator() ?

lunika · 2025-10-07T14:05:14Z

src/backend/core/tasks/search.py

@@ -0,0 +1,89 @@
+"""Trigger document indexation using celery task."""


You are mixing find and search indexers in the PR. IMO keeping search_indexer here make sense, this what you are doing. (for the module name)

good point 👍

lunika · 2025-10-07T14:08:09Z

src/backend/core/tests/commands/test_index.py

+        str(no_title_doc.path): {"users": [user.sub]},
+    }
+
+    with mock.patch.object(FindDocumentIndexer, "push") as mock_push:


Instead of mocking the push, why not using responses ?

lunika · 2025-10-07T14:09:59Z

src/backend/core/services/search_indexers.py

+        count = 0
+
+        while True:
+            documents_batch = list(


To force the queryset execution ?

lunika · 2025-10-07T14:11:30Z

src/backend/core/services/search_indexers.py

+        try:
+            response = requests.post(
+                self.indexer_url,
+                json=data,
+                headers={"Authorization": f"Bearer {self.indexer_secret}"},
+                timeout=10,
+            )
+            response.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            logger.error("HTTPError: %s", e)
+            raise


Suggested change

try:

response = requests.post(

self.indexer_url,

json=data,

headers={"Authorization": f"Bearer {self.indexer_secret}"},

timeout=10,

)

response.raise_for_status()

except requests.exceptions.HTTPError as e:

logger.error("HTTPError: %s", e)

raise

response = requests.post(

self.indexer_url,

json=data,

headers={"Authorization": f"Bearer {self.indexer_secret}"},

timeout=10,

)

response.raise_for_status()

I would let the exception raises without catching it. Sentry is configured to handle it.

lunika · 2025-10-07T14:12:00Z

src/backend/core/services/search_indexers.py

+        try:
+            response = requests.post(
+                self.search_url,
+                json=data,
+                headers={"Authorization": f"Bearer {token}"},
+                timeout=10,
+            )
+            response.raise_for_status()
+            return response.json()
+        except requests.exceptions.HTTPError as e:
+            logger.error("HTTPError: %s", e)
+            raise


Same here, no try/catch needed IMO

Set a valid Ferney for OIDC_STORE_REFRESH_TOKEN_KEY in env.d/development/common Add .gitguardian.yaml configuration to ignore this key. Signed-off-by: Fabre Florian <ffabre@hybird.org>

Rename FindDocumentIndexer as SearchIndexer Rename FindDocumentSerializer as SearchDocumentSerializer Rename package core.tasks.find as core.task.search Remove logs on http errors in SearchIndexer Factorise some code in search API view. Signed-off-by: Fabre Florian <ffabre@hybird.org>

sampaccoud mentioned this pull request Aug 7, 2025

Full-Blown search feature #322

Open

sampaccoud requested a review from joehybird August 7, 2025 16:40

sampaccoud assigned joehybird Aug 7, 2025

sampaccoud added feature add a new feature backend labels Aug 7, 2025

joehybird force-pushed the index-to-search branch from 89fd00e to 526d757 Compare August 13, 2025 14:22

joehybird force-pushed the index-to-search branch 3 times, most recently from 10bfd94 to 5bd6b18 Compare September 8, 2025 12:38

joehybird force-pushed the index-to-search branch from 5bd6b18 to e966594 Compare September 10, 2025 15:18

qbey reviewed Sep 10, 2025

View reviewed changes

joehybird force-pushed the index-to-search branch 16 times, most recently from 7cfa907 to 7255ec2 Compare September 15, 2025 13:01

qbey reviewed Sep 15, 2025

View reviewed changes

joehybird force-pushed the index-to-search branch from 7255ec2 to ee1105f Compare September 17, 2025 06:29

joehybird force-pushed the index-to-search branch 7 times, most recently from 73c8b89 to 95d0ef6 Compare October 1, 2025 09:58

sampaccoud and others added 13 commits October 2, 2025 16:18

🔧(compose) configure external network for communication with search

76c218a

Search in Docs relies on an external project like "La Suite Find". We need to declare a common external network in order to connect to the search app and index our documents.

✨(backend) add dummy content to demo documents

d29741b

We need to content in our demo documents so that we can test indexing.

✨(backend) add document search indexer

b702d8d

Add indexer that loops across documents in the database, formats them as json objects and indexes them in the remote "Find" mico-service.

✨(backend) add async triggers to enable document indexation with find

bffb101

On document content or permission changes, start a celery job that will call the indexation API of the app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>

🔧(compose) Add some ignore for docker-compose local overrides

d954986

Signed-off-by: Fabre Florian <ffabre@hybird.org>

✨(backend) add unit test for the 'index' command

8e73c88

Signed-off-by: Fabre Florian <ffabre@hybird.org>

✨(backend) add document search view

1ff0dda

New API view that calls the indexed documents search view (resource server) of app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>

✨(backend) improve search indexer service configuration

f830fc6

New SEARCH_INDEXER_CLASS setting to define the indexer service class. Raise ImpoperlyConfigured errors instead of RuntimeError in index service. Signed-off-by: Fabre Florian <ffabre@hybird.org>

✨(backend) refactor indexation signals and fix circular import issues

b5a7af9

Signed-off-by: Fabre Florian <ffabre@hybird.org>

✨(backend) Index partially empty documents

33b4e2e

Only documents without title and content are ignored by indexer.

✨(backend) Index deleted documents

cc47ff2

Add SEARCH_INDEXER_COUNTDOWN as configurable setting. Make the search backend creation simplier (only 'get_document_indexer' now). Allow indexation of deleted documents. Signed-off-by: Fabre Florian <ffabre@hybird.org>

🔧(backend) force a valid key for token storage in development mode

8a483a7

Generate a fernet key for the OIDC_STORE_REFRESH_TOKEN_KEY in development settings if not set. Signed-off-by: Fabre Florian <ffabre@hybird.org>

joehybird requested a review from lunika October 3, 2025 04:10

joehybird force-pushed the index-to-search branch from 95d0ef6 to 8a483a7 Compare October 3, 2025 08:18

joehybird force-pushed the index-to-search branch from 83064ed to d4d279c Compare October 7, 2025 13:37

lunika reviewed Oct 7, 2025

View reviewed changes

🔧(backend) force a valid key for token storage in development mode

d528b58

Set a valid Ferney for OIDC_STORE_REFRESH_TOKEN_KEY in env.d/development/common Add .gitguardian.yaml configuration to ignore this key. Signed-off-by: Fabre Florian <ffabre@hybird.org>

joehybird force-pushed the index-to-search branch from d4d279c to d528b58 Compare October 7, 2025 15:43

joehybird added 2 commits October 8, 2025 09:36

WIP 💩(front) hack to use the fulltext search api

51cdf23

joehybird force-pushed the index-to-search branch from 7f8c9f3 to 51cdf23 Compare October 8, 2025 07:41

	url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None)
	url = settings.SEARCH_INDEXER_QUERY_URL

		logger.error("HTTPError: %s", e)
		logger.error("Response content: %s", response.text) # type: ignore

		push_call_args = [call.args[0] for call in mock_push.call_args_list]

		assert len(push_call_args) == 1 # called once but with a batch of docs


		response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"})

		assert response.status_code == 401

	response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "})
	response = client.get("/api/v1.0/documents/search/", data={"q": " "})

	enabled (see SEARCH_BACKEND_CLASS)
	enabled (see SEARCH_INDEXER_CLASS)

	).values("document__path", "user__sub", "team")
	).values("document__path", "user__sub", "team").iterator()

		@@ -0,0 +1,89 @@
		"""Trigger document indexation using celery task."""

		}


		@receiver(signals.post_save, sender=DocumentAccess)

Index to search #1276

Are you sure you want to change the base?

Index to search #1276

Uh oh!

Conversation

sampaccoud commented Aug 7, 2025 • edited by joehybird Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Proposal

Uh oh!

gitguardian bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

qbey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

sampaccoud commented Aug 7, 2025 •

edited by joehybird

Loading

gitguardian bot commented Sep 8, 2025 •

edited

Loading