Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section linking for the search result #5829

Merged
merged 70 commits into from
Jul 12, 2019
Merged

Add section linking for the search result #5829

merged 70 commits into from
Jul 12, 2019

Conversation

dojutsu-user
Copy link
Member

@dojutsu-user dojutsu-user commented Jun 19, 2019

This is a WIP.

Related PR in readthedocs-sphinx-search -- readthedocs/readthedocs-sphinx-search#19

@dojutsu-user dojutsu-user added the PR: work in progress Pull request is not ready for full review label Jun 19, 2019
@ericholscher
Copy link
Member

Yea, I think we likely need to have an approach where sections take the place of the existing content -- we don't want to index the same data twice. I believe ES has the ability to do this reasonably well.

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jun 19, 2019

@ericholscher

Yea, I think we likely need to have an approach where sections take the place of the existing content -- we don't want to index the same data twice

That means we will be modifying/upgrading to remove the headers field completely and indexing each section as a separate document... right?

I believe ES has the ability to do this reasonably well.

Which feature of ES are you talking about?

@dojutsu-user
Copy link
Member Author

I would like to continue work on this PR only.. if that's okay?

@ericholscher
Copy link
Member

That means we will be modifying/upgrading to removing the headers field completely and indexing each section as a separate document... right?

Perhaps. We still need to know what the headers are, but could group them with the section. I have reached out to my friend at Elastic to ask his opinion on this. I will update here with the approach he suggests.

@dojutsu-user
Copy link
Member Author

@ericholscher

We still need to know what the headers are

I went ahead and try to index each section as a separate document in the ES. I think headers are mostly titles of the page ... right?
So I was having documents with -- project, version, title, section_title, section_id, section_content and other required fields. And it works well with the full page search UI. I haven't tested that with the search results page but I think it will work.

@ericholscher
Copy link
Member

@dojutsu-user Great. That is how we used to do it, and it worked ok. So that might be a path forward 👍

@ericholscher
Copy link
Member

@dojutsu-user if you could push up the code here or somewhere else, I can try and take a look.

@dojutsu-user
Copy link
Member Author

@ericholscher
Pushing the code... just 1 min.

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jun 21, 2019

@ericholscher
Pushed the code.
Many tests are going to fail for this.

Edit: It might not be very clean and robust because I was just testing things out.

@ericholscher
Copy link
Member

Makes sense. I think this is a good way to test at least. If we can provide good results with this, we can definitely move to this approach in the short term. I want to think a bit more about how to combine this, along with the SphinxDomain objects that I want to return in search as well.

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jun 21, 2019

@ericholscher
I think results from sphinx domain are not shown to the user right now...??

SphinxDomain objects that I want to return in search as well.

Where do we want to show these results?

  • Full page search ui
  • In the search results page
  • Both

Assuming that we want to show sphinx domain results at both of these places -- we need to make an api endpoint which return results from both indexes -- something like AllSearch but with only two indexes.

One other thing that we need to discuss is how we will be showing this to the users..?

  • Either we give choice to the user to select one of the search results (facets)
    • In this case, for the full page search UI, we can have a dropdown or something like this with the input field and it will look good. But I can't think of how we will be giving choice to the user in the search result page.
  • Or we show results from both indexes.
    • We don't have to think much about the ui/ux in this case -- as we will be showing results from both indexes, everything will remain same with the inclusion of extra results.

In both cases -- I haven't given the thought that we will be showing results from Sphinx Domain in the full page search UI... so the extension is not prepared for that now. But it will.

cc: @davidfischer

@ericholscher
Copy link
Member

Yea, I'd love to return SphinxDomain results to users in the same API results. I believe the approach we've discussed with Nested queries can work for both sections and sphinx_domains on the Page document. We should test this and see how it works.

@@ -446,9 +446,6 @@ def USE_PROMOS(self): # noqa
'settings': {
'number_of_shards': 2,
'number_of_replicas': 0,
"index": {
"sort.field": ["project", "version"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericholscher
Copy link
Member

Did this work in testing?

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jun 25, 2019

@ericholscher
No, not in the way we want.
We need to discuss it a little more.

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jun 25, 2019

Turns out that I was wrong earlier and it is working nicely.
Here is the sample result for query sponsors (ignore the value of link)
https://pastebin.com/ZuZq4cfp
What are your thoughts on this approach?
@ericholscher @davidfischer

@ericholscher
Copy link
Member

This looks great. I think we'll need to index some more data in order to generate the links. We need to know the id attribute of the H2 section so we can properly generate a link to it.

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Jun 25, 2019

I didn't follow.
I mentioned to ignore the value of link because I have rtd-test as a subproject of template and so the link is not simple.
For generating links -- we can have the full_path (#5821 -- after closing of this issue) and then we have to add #section-id to it and we have the link.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of good changes here, this is getting close, but has a few more nits.

'<div>' +
'<%= section_content[i] %>' +
'</div>' +
'<% } %>';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings are really cumbersome, is there not a good way to do multi-line strings in JS? :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfischer thoughts here? Is it too fancy and new?

}

// preparing domain_content
domain_content.append(domain._source.type_display + " -- ");
// domain_content = type_display --
domain_content = domain._source.type_display + " -- ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird. Should this be a template also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reduced the ifs here.
It is less complicated and more readable now.

@@ -1282,13 +1282,8 @@ def fileify(version_pk, commit, build):
except Exception:
log.exception('Failed during ImportedFile creation')

try:
_update_intersphinx_data(version, path, commit, build)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this as 3 functions called from here, instead of one. That will make it more robust to issues of exceptions. So we should have:

  • _create_imported_files
  • _create_intersphinx_data
  • _sync_imported_files

And refactor the code to match

_create_intersphinx_data(version, path, commit, build)
except Exception:
log.exception('Failed during SphinxDomain objects creation')

# Index new HTMLFiles to elasticsearch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the new _sync_imported_files function should start.

# if the control comes in this block,
# that implies that there was PageSearch
# that implies that there was a PageSearch
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log this log.exception instead of pass, or fix this logic. We shouldn't hit this in a normal case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the logic.
Turns out that a simple if was sufficient. 😃
I have still wrapped the code in try... except block to log any exceptions (if occur) to help in debugging later.

{% endfor %}
{% endwith %}
{% else %}
{{ inner_hit.source.content|slice:"100" }} ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 100 should be a constant also, instead of a random string.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look good 👍

@@ -25,6 +25,8 @@

{% block content %}

{% trans "100" as MAX_SUBSTRING_LIMIT %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need trans here, I think we can use with: https://docs.djangoproject.com/en/2.2/ref/templates/builtins/#with

domains = inner_hits.domains or []
all_results = itertools.chain(sections, domains)

sorted_results = [
Copy link
Member Author

@dojutsu-user dojutsu-user Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd
Here, If I used a generator expression -- test_search_works_with_title_query and test_search_works_with_sections_query will fail.
I can't find the reason though. For now, I have changed them to list comprehension for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I wasn't able to run the test because there is an import error, my guess is that when the generator gets evaluated the object inner_hits has changed. You can confirm this if you do a copy of inner_hits.sections and inner_hits.domains before assign them.

Also, I'd just left the list comprehension, since we don't know when the generator gets evaluated by django rest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comments there to avoid any confusion in the future.

fields = ['title^10', 'headers^5', 'content']

_outer_fields = ['title^4']
_section_fields = ['sections.title^3', 'sections.content']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the boosters.
They are working fine.

assert res['project'] == 'docs'

# def test_doc_search_filter_by_version(self, api_client, project):
# """Test Doc search result are filtered according to version"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out the other tests.
I will update them.

@ericholscher ericholscher changed the base branch from master to gsoc-19-indoc-search July 12, 2019 14:52
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good direction. I haven't given it a full review quite yet, since it looks like there was a good amount of refactoring?

@@ -23,3 +23,6 @@ def test_h2_parsing(self):
'You can use Slumber'
))
self.assertEqual(data['title'], 'Read the Docs Public API')

for section in data['sections']:
self.assertFalse('\n' in section['content'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably use a comment. Likely it should also test for a length before doing this, otherwise this check could be running on 0 sections.

# assert data[0]['project'] == subproject.slug
# # Check the link is the subproject document link
# document_link = subproject.get_docs_url(version_slug=version.slug)
# assert document_link in data[0]['link']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these all commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't worked on them yet.

elif data_type.startswith('sections'):

# generates query from section title
if data_type.endswith('title'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using endswith and startswith here, instead of just string checking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds more pythonic.
I will update the PR.

</a>
</li>
{% endfor %}
{% with "100" as MAX_SUBSTRING_LIMIT %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this file change so much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every line is indented one more level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I have corrected the indentation of the whole file.

@ericholscher ericholscher merged commit d526249 into readthedocs:gsoc-19-indoc-search Jul 12, 2019
@ericholscher
Copy link
Member

👍 Merged into the feature branch as a base

@dojutsu-user dojutsu-user deleted the search-section-linking branch July 12, 2019 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: work in progress Pull request is not ready for full review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants