Optimize json parsing #6160

dojutsu-user · 2019-09-10T11:29:50Z

Related #6130

This PR removes the PyQuery and lxml and instead uses the selectolax which is much faster than lxml (benchmark). Parsing logic is exactly same with some small improvements.

Results:
Before

$ time python manage.py reindex_elasticsearch
real	0m24.272s
user	0m20.457s
sys	0m0.337s

After

$ time python manage.py reindex_elasticsearch
real	0m14.907s
user	0m10.274s
sys	0m0.336s

davidfischer

There's two pieces here:

Swapping out pyquery/lxml for selectolax/modest
Swapping out the stdlib json for orjson

In both cases, we are looking to swap to more performant libraries. One thing I'd like to see if where is this performance benefit coming from. Was JSON parsing the slow part and orjson is making a huge difference or was HTML parsing the slow part and selectolax is improving things overall?

My biggest concern about this is that selectolax has a single developer even though it does appear to be actively developed. The modest engine has a bit more developer backing. Orjson I've used in the past where JSON parsing speed was really important and it worked reasonably well.

readthedocs/search/parse_json.py

dojutsu-user · 2019-09-11T07:07:05Z

@davidfischer
I think these results might give some insights

# with `json` and `selectolax`
real	0m20.142s
user	0m14.789s
sys	0m0.405s

# with `orjson` and `selectolax`
real	0m17.934s
user	0m13.506s
sys	0m0.412s

# with `json` and `pyquery`
real	0m31.838s
user	0m25.688s
sys	0m0.414s

# with `orjson` and `pyquery`
real	0m26.549s
user	0m21.444s
sys	0m0.314s

I believe both orjson and selectolax are responsible for performance benefits.

humitos

I do not have too much experience with this code. Although, it seems a good change and I think that it's easy to revert if we find out that it does not behave better in the next deploy (reindex) than the current implementation.

My only concern about this PR is that it seems that we are changing some small logic on it as well (I'm not sure, though). If that's the case, I'd recommend to make this changes in different PR.

Also, if we care too much about having different implementations, this could be refactored to use a class with a specific API so we can swap from one to another. I suppose it's not worth the effort at this point, though.

humitos · 2019-09-11T14:50:58Z

readthedocs/search/parse_json.py

    try:
-        body('.toctree-wrapper').remove()
+        nodes_to_be_removed += body.css('.toctree-wrapper') + body.css('.contents.local.topic')


It seems there is a logic change here. Are we removing an extra node here?

Yes, we are removing extra node here.
Earlier, I thought that all the Table of Contents are under a div having class .toctree-wrapper but latest I realised that pages which have Table of Contents describing the contents in the same page (like: SEO Guide) are having different css class -- contents local topic.
So I included that also.

humitos · 2019-09-11T14:53:24Z

readthedocs/search/parse_json.py

                    break

-            text = parse_content(next_p.text(), remove_first_line=True)
+            text = parse_content(next_p.text(), remove_first_line=False)


Also here. It seems we are changing the logic.

Yes, when working on it again, I find that few lines were getting removed from indexing because of this.

Should I target these things in another PR?

I would say "yes".

If we first apply the logic changes and after that we apply the migration, we can easily come back to the previous (pyquery, etc) implementation with just one revert if we ever want.

rushter · 2019-09-26T21:24:11Z

Hey, I'm the author of selectolax.

I will be glad to help you or add new features for your particular need.
Just ping me if you will have any questions or ideas on how to speed up your particular use-case even more by introducing new features in selectolax.

ericholscher · 2019-09-30T14:32:40Z

I'd like to move this forward. I think I'm OK with merging it with the logic changes, since it's already done. In the future we should try and separate these though.

ericholscher

👍 on getting this in. Having faster indexing makes everything better.

dojutsu-user · 2019-09-30T15:04:43Z

Thanks @rushter for your help. ❤️

dojutsu-user added 7 commits September 9, 2019 23:54

use orjson

1299648

rewrite parse_json with selectolax

92daa71

remove pyquery and add selectolax

4b14841

Merge branch 'master' into optimize-json-parsing

30cb91f

add orjson to requirements.txt

9fe15a9

remove pyqery from test

e3f6152

remove simple analyzer

892ce48

dojutsu-user requested review from ericholscher and a team September 10, 2019 11:31

dojutsu-user added the PR: work in progress Pull request is not ready for full review label Sep 10, 2019

dojutsu-user mentioned this pull request Sep 10, 2019

Implemented orjson #6150

Closed

nightlark mentioned this pull request Sep 10, 2019

Add documentation test to Travis GMLC-TDC/HELICS#803

Closed

davidfischer reviewed Sep 10, 2019

View reviewed changes

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved

import orjson as json

1f47947

humitos reviewed Sep 11, 2019

View reviewed changes

ericholscher approved these changes Sep 30, 2019

View reviewed changes

dojutsu-user removed the PR: work in progress Pull request is not ready for full review label Sep 30, 2019

ericholscher merged commit 37cabe3 into readthedocs:master Oct 9, 2019

dojutsu-user deleted the optimize-json-parsing branch October 13, 2019 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize json parsing #6160

Optimize json parsing #6160

dojutsu-user commented Sep 10, 2019 •

edited

Loading

davidfischer left a comment

dojutsu-user commented Sep 11, 2019

humitos left a comment

humitos Sep 11, 2019

dojutsu-user Sep 11, 2019

humitos Sep 11, 2019

dojutsu-user Sep 11, 2019 •

edited

Loading

dojutsu-user Sep 11, 2019

humitos Sep 11, 2019 •

edited

Loading

rushter commented Sep 26, 2019

ericholscher commented Sep 30, 2019

ericholscher left a comment

dojutsu-user commented Sep 30, 2019

Optimize json parsing #6160

Optimize json parsing #6160

Conversation

dojutsu-user commented Sep 10, 2019 • edited Loading

davidfischer left a comment

Choose a reason for hiding this comment

dojutsu-user commented Sep 11, 2019

humitos left a comment

Choose a reason for hiding this comment

humitos Sep 11, 2019

Choose a reason for hiding this comment

dojutsu-user Sep 11, 2019

Choose a reason for hiding this comment

humitos Sep 11, 2019

Choose a reason for hiding this comment

dojutsu-user Sep 11, 2019 • edited Loading

Choose a reason for hiding this comment

dojutsu-user Sep 11, 2019

Choose a reason for hiding this comment

humitos Sep 11, 2019 • edited Loading

Choose a reason for hiding this comment

rushter commented Sep 26, 2019

ericholscher commented Sep 30, 2019

ericholscher left a comment

Choose a reason for hiding this comment

dojutsu-user commented Sep 30, 2019

dojutsu-user commented Sep 10, 2019 •

edited

Loading

dojutsu-user Sep 11, 2019 •

edited

Loading

humitos Sep 11, 2019 •

edited

Loading