-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add section linking for the search result #5829
Add section linking for the search result #5829
Conversation
Yea, I think we likely need to have an approach where sections take the place of the existing content -- we don't want to index the same data twice. I believe ES has the ability to do this reasonably well. |
That means we will be modifying/upgrading to remove the
Which feature of ES are you talking about? |
I would like to continue work on this PR only.. if that's okay? |
Perhaps. We still need to know what the headers are, but could group them with the section. I have reached out to my friend at Elastic to ask his opinion on this. I will update here with the approach he suggests. |
I went ahead and try to index each section as a separate document in the ES. I think headers are mostly titles of the page ... right? |
@dojutsu-user Great. That is how we used to do it, and it worked ok. So that might be a path forward 👍 |
@dojutsu-user if you could push up the code here or somewhere else, I can try and take a look. |
@ericholscher |
@ericholscher Edit: It might not be very clean and robust because I was just testing things out. |
Makes sense. I think this is a good way to test at least. If we can provide good results with this, we can definitely move to this approach in the short term. I want to think a bit more about how to combine this, along with the SphinxDomain objects that I want to return in search as well. |
@ericholscher
Where do we want to show these results?
Assuming that we want to show sphinx domain results at both of these places -- we need to make an api endpoint which return results from both indexes -- something like One other thing that we need to discuss is how we will be showing this to the users..?
In both cases -- I haven't given the thought that we will be showing results from Sphinx Domain in the full page search UI... so the extension is not prepared for that now. But it will. cc: @davidfischer |
Yea, I'd love to return SphinxDomain results to users in the same API results. I believe the approach we've discussed with Nested queries can work for both |
@@ -446,9 +446,6 @@ def USE_PROMOS(self): # noqa | |||
'settings': { | |||
'number_of_shards': 2, | |||
'number_of_replicas': 0, | |||
"index": { | |||
"sort.field": ["project", "version"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was giving an error -- https://discuss.elastic.co/t/unable-to-create-index-sorting/166019
Did this work in testing? |
@ericholscher |
Turns out that I was wrong earlier and it is working nicely. |
This looks great. I think we'll need to index some more data in order to generate the links. We need to know the |
I didn't follow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of good changes here, this is getting close, but has a few more nits.
'<div>' + | ||
'<%= section_content[i] %>' + | ||
'</div>' + | ||
'<% } %>'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These strings are really cumbersome, is there not a good way to do multi-line strings in JS? :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are, but eslint is not allowing any: https://travis-ci.org/readthedocs/readthedocs.org/jobs/556276376#L593
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidfischer thoughts here? Is it too fancy and new?
} | ||
|
||
// preparing domain_content | ||
domain_content.append(domain._source.type_display + " -- "); | ||
// domain_content = type_display -- | ||
domain_content = domain._source.type_display + " -- "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels weird. Should this be a template also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reduced the ifs
here.
It is less complicated and more readable now.
@@ -1282,13 +1282,8 @@ def fileify(version_pk, commit, build): | |||
except Exception: | |||
log.exception('Failed during ImportedFile creation') | |||
|
|||
try: | |||
_update_intersphinx_data(version, path, commit, build) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this as 3 functions called from here, instead of one. That will make it more robust to issues of exceptions. So we should have:
- _create_imported_files
- _create_intersphinx_data
- _sync_imported_files
And refactor the code to match
readthedocs/projects/tasks.py
Outdated
_create_intersphinx_data(version, path, commit, build) | ||
except Exception: | ||
log.exception('Failed during SphinxDomain objects creation') | ||
|
||
# Index new HTMLFiles to elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the new _sync_imported_files
function should start.
readthedocs/search/views.py
Outdated
# if the control comes in this block, | ||
# that implies that there was PageSearch | ||
# that implies that there was a PageSearch | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log this log.exception
instead of pass
, or fix this logic. We shouldn't hit this in a normal case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the logic.
Turns out that a simple if
was sufficient. 😃
I have still wrapped the code in try... except
block to log any exceptions (if occur) to help in debugging later.
{% endfor %} | ||
{% endwith %} | ||
{% else %} | ||
{{ inner_hit.source.content|slice:"100" }} ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 100
should be a constant also, instead of a random string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes look good 👍
@@ -25,6 +25,8 @@ | |||
|
|||
{% block content %} | |||
|
|||
{% trans "100" as MAX_SUBSTRING_LIMIT %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need trans
here, I think we can use with
: https://docs.djangoproject.com/en/2.2/ref/templates/builtins/#with
domains = inner_hits.domains or [] | ||
all_results = itertools.chain(sections, domains) | ||
|
||
sorted_results = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stsewd
Here, If I used a generator expression -- test_search_works_with_title_query
and test_search_works_with_sections_query
will fail.
I can't find the reason though. For now, I have changed them to list comprehension for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I wasn't able to run the test because there is an import error, my guess is that when the generator gets evaluated the object inner_hits
has changed. You can confirm this if you do a copy of inner_hits.sections
and inner_hits.domains
before assign them.
Also, I'd just left the list comprehension, since we don't know when the generator gets evaluated by django rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add comments there to avoid any confusion in the future.
fields = ['title^10', 'headers^5', 'content'] | ||
|
||
_outer_fields = ['title^4'] | ||
_section_fields = ['sections.title^3', 'sections.content'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the boosters.
They are working fine.
readthedocs/search/tests/test_api.py
Outdated
assert res['project'] == 'docs' | ||
|
||
# def test_doc_search_filter_by_version(self, api_client, project): | ||
# """Test Doc search result are filtered according to version""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out the other tests.
I will update them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good direction. I haven't given it a full review quite yet, since it looks like there was a good amount of refactoring?
@@ -23,3 +23,6 @@ def test_h2_parsing(self): | |||
'You can use Slumber' | |||
)) | |||
self.assertEqual(data['title'], 'Read the Docs Public API') | |||
|
|||
for section in data['sections']: | |||
self.assertFalse('\n' in section['content']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably use a comment. Likely it should also test for a length before doing this, otherwise this check could be running on 0 sections.
readthedocs/search/tests/test_api.py
Outdated
# assert data[0]['project'] == subproject.slug | ||
# # Check the link is the subproject document link | ||
# document_link = subproject.get_docs_url(version_slug=version.slug) | ||
# assert document_link in data[0]['link'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these all commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't worked on them yet.
readthedocs/search/tests/utils.py
Outdated
elif data_type.startswith('sections'): | ||
|
||
# generates query from section title | ||
if data_type.endswith('title'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we using endswith and startswith here, instead of just string checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds more pythonic.
I will update the PR.
</a> | ||
</li> | ||
{% endfor %} | ||
{% with "100" as MAX_SUBSTRING_LIMIT %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this file change so much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every line is indented one more level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I have corrected the indentation of the whole file.
👍 Merged into the feature branch as a base |
This is a WIP.
Related PR in readthedocs-sphinx-search -- readthedocs/readthedocs-sphinx-search#19