Add section linking for the search result #5829

dojutsu-user · 2019-06-19T19:53:11Z

This is a WIP.

Related PR in readthedocs-sphinx-search -- readthedocs/readthedocs-sphinx-search#19

ericholscher · 2019-06-19T20:01:37Z

Yea, I think we likely need to have an approach where sections take the place of the existing content -- we don't want to index the same data twice. I believe ES has the ability to do this reasonably well.

dojutsu-user · 2019-06-19T20:19:07Z

@ericholscher

Yea, I think we likely need to have an approach where sections take the place of the existing content -- we don't want to index the same data twice

That means we will be modifying/upgrading to remove the headers field completely and indexing each section as a separate document... right?

I believe ES has the ability to do this reasonably well.

Which feature of ES are you talking about?

dojutsu-user · 2019-06-19T20:19:27Z

I would like to continue work on this PR only.. if that's okay?

ericholscher · 2019-06-19T20:34:32Z

That means we will be modifying/upgrading to removing the headers field completely and indexing each section as a separate document... right?

Perhaps. We still need to know what the headers are, but could group them with the section. I have reached out to my friend at Elastic to ask his opinion on this. I will update here with the approach he suggests.

dojutsu-user · 2019-06-21T13:49:16Z

@ericholscher

We still need to know what the headers are

I went ahead and try to index each section as a separate document in the ES. I think headers are mostly titles of the page ... right?
So I was having documents with -- project, version, title, section_title, section_id, section_content and other required fields. And it works well with the full page search UI. I haven't tested that with the search results page but I think it will work.

ericholscher · 2019-06-21T14:23:59Z

@dojutsu-user Great. That is how we used to do it, and it worked ok. So that might be a path forward 👍

ericholscher · 2019-06-21T14:26:52Z

@dojutsu-user if you could push up the code here or somewhere else, I can try and take a look.

dojutsu-user · 2019-06-21T14:32:58Z

@ericholscher
Pushing the code... just 1 min.

dojutsu-user · 2019-06-21T14:35:54Z

@ericholscher
Pushed the code.
Many tests are going to fail for this.

Edit: It might not be very clean and robust because I was just testing things out.

ericholscher · 2019-06-21T15:55:09Z

Makes sense. I think this is a good way to test at least. If we can provide good results with this, we can definitely move to this approach in the short term. I want to think a bit more about how to combine this, along with the SphinxDomain objects that I want to return in search as well.

dojutsu-user · 2019-06-21T17:47:22Z

@ericholscher
I think results from sphinx domain are not shown to the user right now...??

SphinxDomain objects that I want to return in search as well.

Where do we want to show these results?

Full page search ui
In the search results page
Both

Assuming that we want to show sphinx domain results at both of these places -- we need to make an api endpoint which return results from both indexes -- something like AllSearch but with only two indexes.

One other thing that we need to discuss is how we will be showing this to the users..?

Either we give choice to the user to select one of the search results (facets)
- In this case, for the full page search UI, we can have a dropdown or something like this with the input field and it will look good. But I can't think of how we will be giving choice to the user in the search result page.
Or we show results from both indexes.
- We don't have to think much about the ui/ux in this case -- as we will be showing results from both indexes, everything will remain same with the inclusion of extra results.

In both cases -- I haven't given the thought that we will be showing results from Sphinx Domain in the full page search UI... so the extension is not prepared for that now. But it will.

cc: @davidfischer

ericholscher · 2019-06-24T17:07:37Z

Yea, I'd love to return SphinxDomain results to users in the same API results. I believe the approach we've discussed with Nested queries can work for both sections and sphinx_domains on the Page document. We should test this and see how it works.

dojutsu-user · 2019-06-24T20:19:42Z

readthedocs/settings/base.py

@@ -446,9 +446,6 @@ def USE_PROMOS(self):  # noqa
            'settings': {
                'number_of_shards': 2,
                'number_of_replicas': 0,
-                "index": {
-                    "sort.field": ["project", "version"]


This was giving an error -- https://discuss.elastic.co/t/unable-to-create-index-sorting/166019

ericholscher · 2019-06-24T21:34:07Z

Did this work in testing?

dojutsu-user · 2019-06-25T11:18:34Z

@ericholscher
No, not in the way we want.
We need to discuss it a little more.

dojutsu-user · 2019-06-25T14:43:22Z

Turns out that I was wrong earlier and it is working nicely.
Here is the sample result for query sponsors (ignore the value of link)
https://pastebin.com/ZuZq4cfp
What are your thoughts on this approach?
@ericholscher @davidfischer

ericholscher · 2019-06-25T15:43:39Z

This looks great. I think we'll need to index some more data in order to generate the links. We need to know the id attribute of the H2 section so we can properly generate a link to it.

dojutsu-user · 2019-06-25T16:09:13Z

I didn't follow.
I mentioned to ignore the value of link because I have rtd-test as a subproject of template and so the link is not simple.
For generating links -- we can have the full_path (#5821 -- after closing of this issue) and then we have to add #section-id to it and we have the link.

ericholscher

Lots of good changes here, this is getting close, but has a few more nits.

ericholscher · 2019-07-09T19:37:32Z

readthedocs/core/static-src/core/js/doc-embed/search.js

+                                    '<div>' +
+                                        '<%= section_content[i] %>' +
+                                    '</div>' +
+                                '<% } %>';


These strings are really cumbersome, is there not a good way to do multi-line strings in JS? :(

There are, but eslint is not allowing any: https://travis-ci.org/readthedocs/readthedocs.org/jobs/556276376#L593

@davidfischer thoughts here? Is it too fancy and new?

ericholscher · 2019-07-09T19:38:40Z

readthedocs/core/static-src/core/js/doc-embed/search.js

                                }

                                // preparing domain_content
-                                domain_content.append(domain._source.type_display + " -- ");
+                                // domain_content = type_display --
+                                domain_content = domain._source.type_display + " -- ";


This feels weird. Should this be a template also?

I have reduced the ifs here.
It is less complicated and more readable now.

ericholscher · 2019-07-09T19:40:38Z

readthedocs/projects/tasks.py

@@ -1282,13 +1282,8 @@ def fileify(version_pk, commit, build):
    except Exception:
        log.exception('Failed during ImportedFile creation')

-    try:
-        _update_intersphinx_data(version, path, commit, build)


I think we should keep this as 3 functions called from here, instead of one. That will make it more robust to issues of exceptions. So we should have:

_create_imported_files

_create_intersphinx_data

_sync_imported_files

And refactor the code to match

ericholscher · 2019-07-09T19:41:09Z

readthedocs/projects/tasks.py

+        _create_intersphinx_data(version, path, commit, build)
+    except Exception:
+        log.exception('Failed during SphinxDomain objects creation')
+
    # Index new HTMLFiles to elasticsearch


This is where the new _sync_imported_files function should start.

ericholscher · 2019-07-09T19:43:20Z

readthedocs/search/views.py

            # if the control comes in this block,
-            # that implies that there was PageSearch
+            # that implies that there was a PageSearch
            pass


We should log this log.exception instead of pass, or fix this logic. We shouldn't hit this in a normal case.

I have changed the logic.
Turns out that a simple if was sufficient. 😃
I have still wrapped the code in try... except block to log any exceptions (if occur) to help in debugging later.

ericholscher · 2019-07-09T19:44:26Z

readthedocs/templates/search/elastic_search.html

+                            {% endfor %}
+                          {% endwith %}
+                        {% else %}
+                          {{ inner_hit.source.content|slice:"100" }} ...


This 100 should be a constant also, instead of a random string.

ericholscher

Latest changes look good 👍

ericholscher · 2019-07-11T15:54:25Z

readthedocs/templates/search/elastic_search.html

@@ -25,6 +25,8 @@

 {% block content %}

+{% trans "100" as MAX_SUBSTRING_LIMIT %}


I don't think we need trans here, I think we can use with: https://docs.djangoproject.com/en/2.2/ref/templates/builtins/#with

dojutsu-user · 2019-07-11T20:40:42Z

readthedocs/search/api.py

+            domains = inner_hits.domains or []
+            all_results = itertools.chain(sections, domains)
+
+            sorted_results = [


@stsewd
Here, If I used a generator expression -- test_search_works_with_title_query and test_search_works_with_sections_query will fail.
I can't find the reason though. For now, I have changed them to list comprehension for now.

So, I wasn't able to run the test because there is an import error, my guess is that when the generator gets evaluated the object inner_hits has changed. You can confirm this if you do a copy of inner_hits.sections and inner_hits.domains before assign them.

Also, I'd just left the list comprehension, since we don't know when the generator gets evaluated by django rest.

I will add comments there to avoid any confusion in the future.

dojutsu-user · 2019-07-11T20:41:18Z

readthedocs/search/faceted_search.py

-    fields = ['title^10', 'headers^5', 'content']
+
+    _outer_fields = ['title^4']
+    _section_fields = ['sections.title^3', 'sections.content']


Added the boosters.
They are working fine.

dojutsu-user · 2019-07-11T20:42:24Z

readthedocs/search/tests/test_api.py

+            assert res['project'] == 'docs'
+
+    # def test_doc_search_filter_by_version(self, api_client, project):
+    #     """Test Doc search result are filtered according to version"""


Commented out the other tests.
I will update them.

ericholscher

Looks like a good direction. I haven't given it a full review quite yet, since it looks like there was a good amount of refactoring?

ericholscher · 2019-07-12T14:54:03Z

readthedocs/rtd_tests/tests/test_search_json_parsing.py

@@ -23,3 +23,6 @@ def test_h2_parsing(self):
            'You can use Slumber'
        ))
        self.assertEqual(data['title'], 'Read the Docs Public API')
+
+        for section in data['sections']:
+            self.assertFalse('\n' in section['content'])


This could probably use a comment. Likely it should also test for a length before doing this, otherwise this check could be running on 0 sections.

ericholscher · 2019-07-12T15:00:24Z

readthedocs/search/tests/test_api.py

+    #     assert data[0]['project'] == subproject.slug
+    #     # Check the link is the subproject document link
+    #     document_link = subproject.get_docs_url(version_slug=version.slug)
+    #     assert document_link in data[0]['link']


Why are these all commented out?

I haven't worked on them yet.

ericholscher · 2019-07-12T15:00:50Z

readthedocs/search/tests/utils.py

+    elif data_type.startswith('sections'):
+
+        # generates query from section title
+        if data_type.endswith('title'):


why are we using endswith and startswith here, instead of just string checking?

Sounds more pythonic.
I will update the PR.

ericholscher · 2019-07-12T15:02:13Z

readthedocs/templates/search/elastic_search.html

-          </a>
-        </li>
-      {% endfor %}
+{% with "100" as MAX_SUBSTRING_LIMIT %}


Why did this file change so much?

Every line is indented one more level.

Also I have corrected the indentation of the whole file.

ericholscher · 2019-07-12T15:47:58Z

👍 Merged into the feature branch as a base

add sections field

ee1ba1a

dojutsu-user added the PR: work in progress Pull request is not ready for full review label Jun 19, 2019

dojutsu-user requested a review from ericholscher June 19, 2019 19:53

index each section as separate document in ES

4b05f8a

Merge branch 'master' into search-section-linking

79d2459

few refactoring

54ceb5c

dojutsu-user added 3 commits June 24, 2019 23:14

revert all

b11e357

Merge branch 'master' into search-section-linking

5b81471

update document mapping (nested fields)

762a79d

dojutsu-user commented Jun 24, 2019

View reviewed changes

format text

7a61dbd

get results from inner_hits

644565b

Merge branch 'master' into search-section-linking

fa51a1c

ericholscher reviewed Jul 9, 2019

View reviewed changes

dojutsu-user added 13 commits July 10, 2019 14:08

reduce complexity in search.js

aeaba6f

refactor tasks.py file

6f9b2bc

fix logic in search.views

6135cde

make 100 a constant

d3566ac

Add checkbox for searching in current section

992c72e

remove checkbox code for now

f0babf1

Merge branch 'master' into search-section-linking

4527839

fix test_imported_file

7e75d7e

fix test_search_json_parsing

1e6721d

fix test_search_json_parsing

2a4c070

update test_search_json_parsing

4beec39

Merge branch 'master' into search-section-linking

01346a0

refactor parse_json and its test

91282de

ericholscher reviewed Jul 11, 2019

View reviewed changes

dojutsu-user added 2 commits July 12, 2019 02:00

write initial tests

cfe8f5b

make 100 as constant

7e99f6a

dojutsu-user commented Jul 11, 2019

View reviewed changes

dojutsu-user added 3 commits July 12, 2019 14:26

fix lint

b7ce777

add test for domains and filter by version and project

6701a4e

revert changes to python_environments.py

cee24ed

ericholscher changed the base branch from master to gsoc-19-indoc-search July 12, 2019 14:52

ericholscher reviewed Jul 12, 2019

View reviewed changes

remove tests from this pr

685f6db

update template to make 100 as constant

d7edeee

ericholscher merged commit d526249 into readthedocs:gsoc-19-indoc-search Jul 12, 2019

dojutsu-user deleted the search-section-linking branch July 12, 2019 15:48

		@@ -25,6 +25,8 @@

		{% block content %}

		{% trans "100" as MAX_SUBSTRING_LIMIT %}

Add section linking for the search result #5829

Add section linking for the search result #5829

Conversation

dojutsu-user commented Jun 19, 2019 • edited Loading

ericholscher commented Jun 19, 2019

dojutsu-user commented Jun 19, 2019 • edited Loading

dojutsu-user commented Jun 19, 2019

ericholscher commented Jun 19, 2019

dojutsu-user commented Jun 21, 2019

ericholscher commented Jun 21, 2019

ericholscher commented Jun 21, 2019

dojutsu-user commented Jun 21, 2019

dojutsu-user commented Jun 21, 2019 • edited Loading

ericholscher commented Jun 21, 2019

dojutsu-user commented Jun 21, 2019 • edited Loading

ericholscher commented Jun 24, 2019

Choose a reason for hiding this comment

ericholscher commented Jun 24, 2019

dojutsu-user commented Jun 25, 2019 • edited Loading

dojutsu-user commented Jun 25, 2019 • edited Loading

ericholscher commented Jun 25, 2019

dojutsu-user commented Jun 25, 2019 • edited Loading

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dojutsu-user Jul 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Jul 12, 2019

dojutsu-user commented Jun 19, 2019 •

edited

Loading

dojutsu-user commented Jun 19, 2019 •

edited

Loading

dojutsu-user commented Jun 21, 2019 •

edited

Loading

dojutsu-user commented Jun 21, 2019 •

edited

Loading

dojutsu-user commented Jun 25, 2019 •

edited

Loading

dojutsu-user commented Jun 25, 2019 •

edited

Loading

dojutsu-user commented Jun 25, 2019 •

edited

Loading

dojutsu-user Jul 11, 2019 •

edited

Loading