-
Notifications
You must be signed in to change notification settings - Fork 410
Index to search #1276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Index to search #1276
Conversation
89fd00e
to
526d757
Compare
10bfd94
to
5bd6b18
Compare
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
21473509 | Triggered | Generic High Entropy Secret | d528b58 | env.d/development/common | View secret |
🛠 Guidelines to remediate hardcoded secrets
-
Revoke and rotate the secret.
-
If possible, rewrite git history with
git commit --amend
andgit push --force
.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
5bd6b18
to
e966594
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review, I know work is still ongoing and I did not read all the tests... :)
src/backend/core/api/serializers.py
Outdated
q = serializers.CharField(required=True) | ||
|
||
def validate_q(self, value): | ||
"""Ensure the text field is not empty.""" | ||
|
||
if len(value.strip()) == 0: | ||
raise serializers.ValidationError("Text field cannot be empty.") | ||
|
||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q = serializers.CharField(required=True) | |
def validate_q(self, value): | |
"""Ensure the text field is not empty.""" | |
if len(value.strip()) == 0: | |
raise serializers.ValidationError("Text field cannot be empty.") | |
return value | |
q = serializers.CharField(required=True, allow_blank=False) |
You may also add trim_whitespace=True
src/backend/core/api/viewsets.py
Outdated
serializer.is_valid(raise_exception=True) | ||
|
||
try: | ||
indexer = FindDocumentIndexer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this class should come from settings, because not everyone will have an indexer. I also think this view might fallback on searching locally on title if no indexer is configured.
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) | ||
|
||
if not url: | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise RuntimeError( | |
raise ImproperlyConfigured( |
Returns: | ||
dict: A JSON-serializable dictionary. | ||
""" | ||
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) | |
url = settings.SEARCH_INDEXER_QUERY_URL |
logger.error("HTTPError: %s", e) | ||
logger.error("Response content: %s", response.text) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the error only once
src/backend/core/models.py
Outdated
} | ||
|
||
|
||
@receiver(signals.post_save, sender=DocumentAccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to follow "Use signals as a last resort" (Two scoops of Django): is there a problem calling this from the save
method?
def sortkey(d): | ||
return d["id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
push_call_args = [call.args[0] for call in mock_push.call_args_list] | ||
|
||
assert len(push_call_args) == 1 # called once but with a batch of docs | ||
assert sorted(push_call_args[0], key=sortkey) == sorted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, actually I think the documents sorting should be deterministic, in case we need tu run several times the index command => I think we should change the index command to sort documents by creation date or something ^^
We can keep it this way for now, but we surely need to add a comment in the "index" management command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sort is by id because the indexation is done by batch : loop + id__gt=prev_batch_last_id
push_call_args = [call.args[0] for call in mock_push.call_args_list] | ||
|
||
assert len(push_call_args) == 1 # called once but with a batch of docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to simply check the assert_called_once, then use the first value?
7cfa907
to
7255ec2
Compare
client = APIClient() | ||
client.force_login(user) | ||
|
||
response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much APIClient()
^^'
response = client.get
|
||
response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"}) | ||
|
||
assert response.status_code == 401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the expected status code for misconfiguration?
client = APIClient() | ||
client.force_login(user) | ||
|
||
response = APIClient().get("/api/v1.0/documents/search/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client = APIClient() | |
client.force_login(user) | |
response = APIClient().get("/api/v1.0/documents/search/") | |
client = APIClient() | |
client.force_login(user) | |
response = client.get("/api/v1.0/documents/search/") |
assert response.status_code == 400 | ||
assert response.json() == {"q": ["This field is required."]} | ||
|
||
response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "}) | |
response = client.get("/api/v1.0/documents/search/", data={"q": " "}) |
src/backend/core/api/viewsets.py
Outdated
except ImproperlyConfigured: | ||
return drf.response.Response( | ||
{"detail": "The service is not properly configured."}, | ||
status=status.HTTP_401_UNAUTHORIZED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a 401?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree... why ? A 500 is a better idea 😅
src/backend/impress/settings.py
Outdated
# Setup indexer configuration to make test working on the CI. | ||
SEARCH_INDEXER_SECRET = "ThisIsAKeyForTest" # noqa | ||
SEARCH_INDEXER_URL = "http://localhost:8081/api/v1.0/documents/index/" | ||
SEARCH_INDEXER_QUERY_URL = "http://localhost:8081/api/v1.0/documents/search/" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand.
My two cents:
- you should only set the settings in the tests which requires it
- there must be a way to disable the indexation, because not every instance of docs will have an index, and it must be the default case.
=> I would remove those settings, determine which settings enables the indexation (I think SEARCH_INDEXER_CLASS should be None by default, and if None, then the document must not be indexed). Add new tests on the documents save, with SEARCH_INDEXER_CLASS not None, to check the indexation task etc.
7255ec2
to
ee1105f
Compare
73c8b89
to
95d0ef6
Compare
Search in Docs relies on an external project like "La Suite Find". We need to declare a common external network in order to connect to the search app and index our documents.
We need to content in our demo documents so that we can test indexing.
Add indexer that loops across documents in the database, formats them as json objects and indexes them in the remote "Find" mico-service.
On document content or permission changes, start a celery job that will call the indexation API of the app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
New API view that calls the indexed documents search view (resource server) of app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>
New SEARCH_INDEXER_CLASS setting to define the indexer service class. Raise ImpoperlyConfigured errors instead of RuntimeError in index service. Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Filter deleted documents from visited ones. Set default ordering to the Find API search call (-updated_at) BaseDocumentIndexer.search now returns a list of document ids instead of models. Do not call the indexer in signals when SEARCH_INDEXER_CLASS is not defined or properly configured. Signed-off-by: Fabre Florian <ffabre@hybird.org>
Only documents without title and content are ignored by indexer.
Add SEARCH_INDEXER_COUNTDOWN as configurable setting. Make the search backend creation simplier (only 'get_document_indexer' now). Allow indexation of deleted documents. Signed-off-by: Fabre Florian <ffabre@hybird.org>
Generate a fernet key for the OIDC_STORE_REFRESH_TOKEN_KEY in development settings if not set. Signed-off-by: Fabre Florian <ffabre@hybird.org>
95d0ef6
to
8a483a7
Compare
Add nginx with 'nginx' alias to the 'lasuite-net' network (keycloak calls) Add celery-dev to the 'lasuite-net' network (Find API calls in jobs) Set app-dev alias as 'impress' in the 'lasuite-net' network Add indexer configuration in common settings Signed-off-by: Fabre Florian <ffabre@hybird.org>
83064ed
to
d4d279c
Compare
src/backend/core/api/viewsets.py
Outdated
Applies filtering based on request parameter 'q' from `FindDocumentSerializer`. | ||
Depending of the configuration it can be: | ||
- A fulltext search through the opensearch indexation app "find" if the backend is | ||
enabled (see SEARCH_BACKEND_CLASS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled (see SEARCH_BACKEND_CLASS) | |
enabled (see SEARCH_INDEXER_CLASS) |
I don't see the SEARCH_BACKEND_CLASS
settings
src/backend/core/api/viewsets.py
Outdated
queryset = self.get_queryset() | ||
filterset = DocumentFilter({"title": text}, queryset=queryset) | ||
|
||
if not filterset.is_valid(): | ||
raise drf.exceptions.ValidationError(filterset.errors) | ||
|
||
queryset = filterset.filter_queryset(queryset).order_by("-updated_at") | ||
|
||
return self.get_response_for_queryset( | ||
queryset, | ||
context={ | ||
"request": request, | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can maybe extract this in a private method ?
src/backend/core/api/viewsets.py
Outdated
queryset = models.Document.objects.all() | ||
|
||
# Retrieve the documents ids from Find. | ||
results = indexer.search( | ||
text=text, | ||
token=access_token, | ||
visited=get_visited_document_ids_of(queryset, user), | ||
page=serializer.validated_data.get("page", 1), | ||
page_size=serializer.validated_data.get("page_size", 20), | ||
) | ||
|
||
queryset = queryset.filter(pk__in=results).order_by("-updated_at") | ||
|
||
return self.get_response_for_queryset( | ||
queryset, | ||
context={ | ||
"request": request, | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here ?
indexer = get_document_indexer() | ||
|
||
if indexer is None: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be do earlier to abort the task before checking the debounce ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do a indexer_debounce_lock(document.pk) > 1
to skip the job start in the trigger_document_indexer
method. But I'm not sure about the side effects I have to think about it.
|
||
access_qs = models.DocumentAccess.objects.filter( | ||
document__path__in=ancestor_paths | ||
).values("document__path", "user__sub", "team") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
).values("document__path", "user__sub", "team") | |
).values("document__path", "user__sub", "team").iterator() |
As it is used only in the task, maybe it is interested to use .iterator()
?
@@ -0,0 +1,89 @@ | |||
"""Trigger document indexation using celery task.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are mixing find
and search indexers
in the PR. IMO keeping search_indexer
here make sense, this what you are doing. (for the module name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point 👍
str(no_title_doc.path): {"users": [user.sub]}, | ||
} | ||
|
||
with mock.patch.object(FindDocumentIndexer, "push") as mock_push: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of mocking the push
, why not using responses
?
count = 0 | ||
|
||
while True: | ||
documents_batch = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To force the queryset execution ?
try: | ||
response = requests.post( | ||
self.indexer_url, | ||
json=data, | ||
headers={"Authorization": f"Bearer {self.indexer_secret}"}, | ||
timeout=10, | ||
) | ||
response.raise_for_status() | ||
except requests.exceptions.HTTPError as e: | ||
logger.error("HTTPError: %s", e) | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try: | |
response = requests.post( | |
self.indexer_url, | |
json=data, | |
headers={"Authorization": f"Bearer {self.indexer_secret}"}, | |
timeout=10, | |
) | |
response.raise_for_status() | |
except requests.exceptions.HTTPError as e: | |
logger.error("HTTPError: %s", e) | |
raise | |
response = requests.post( | |
self.indexer_url, | |
json=data, | |
headers={"Authorization": f"Bearer {self.indexer_secret}"}, | |
timeout=10, | |
) | |
response.raise_for_status() |
I would let the exception raises without catching it. Sentry is configured to handle it.
try: | ||
response = requests.post( | ||
self.search_url, | ||
json=data, | ||
headers={"Authorization": f"Bearer {token}"}, | ||
timeout=10, | ||
) | ||
response.raise_for_status() | ||
return response.json() | ||
except requests.exceptions.HTTPError as e: | ||
logger.error("HTTPError: %s", e) | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, no try/catch needed IMO
Set a valid Ferney for OIDC_STORE_REFRESH_TOKEN_KEY in env.d/development/common Add .gitguardian.yaml configuration to ignore this key. Signed-off-by: Fabre Florian <ffabre@hybird.org>
d4d279c
to
d528b58
Compare
Rename FindDocumentIndexer as SearchIndexer Rename FindDocumentSerializer as SearchDocumentSerializer Rename package core.tasks.find as core.task.search Remove logs on http errors in SearchIndexer Factorise some code in search API view. Signed-off-by: Fabre Florian <ffabre@hybird.org>
7f8c9f3
to
51cdf23
Compare
Purpose
We want to add fulltext (and semantic in a second phase) search to Docs.
The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.
Proposal
Fixes #322