Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize json parsing #6160

Merged
merged 8 commits into from
Oct 9, 2019
Merged

Optimize json parsing #6160

merged 8 commits into from
Oct 9, 2019

Conversation

dojutsu-user
Copy link
Member

@dojutsu-user dojutsu-user commented Sep 10, 2019

Related #6130

This PR removes the PyQuery and lxml and instead uses the selectolax which is much faster than lxml (benchmark). Parsing logic is exactly same with some small improvements.

Results:
Before

$ time python manage.py reindex_elasticsearch
real	0m24.272s
user	0m20.457s
sys	0m0.337s

After

$ time python manage.py reindex_elasticsearch
real	0m14.907s
user	0m10.274s
sys	0m0.336s

@dojutsu-user dojutsu-user requested review from ericholscher and a team September 10, 2019 11:31
@dojutsu-user dojutsu-user added the PR: work in progress Pull request is not ready for full review label Sep 10, 2019
@dojutsu-user dojutsu-user mentioned this pull request Sep 10, 2019
Copy link
Contributor

@davidfischer davidfischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two pieces here:

  • Swapping out pyquery/lxml for selectolax/modest
  • Swapping out the stdlib json for orjson

In both cases, we are looking to swap to more performant libraries. One thing I'd like to see if where is this performance benefit coming from. Was JSON parsing the slow part and orjson is making a huge difference or was HTML parsing the slow part and selectolax is improving things overall?

My biggest concern about this is that selectolax has a single developer even though it does appear to be actively developed. The modest engine has a bit more developer backing. Orjson I've used in the past where JSON parsing speed was really important and it worked reasonably well.

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
@dojutsu-user
Copy link
Member Author

@davidfischer
I think these results might give some insights

# with `json` and `selectolax`
real	0m20.142s
user	0m14.789s
sys	0m0.405s

# with `orjson` and `selectolax`
real	0m17.934s
user	0m13.506s
sys	0m0.412s

# with `json` and `pyquery`
real	0m31.838s
user	0m25.688s
sys	0m0.414s

# with `orjson` and `pyquery`
real	0m26.549s
user	0m21.444s
sys	0m0.314s

I believe both orjson and selectolax are responsible for performance benefits.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have too much experience with this code. Although, it seems a good change and I think that it's easy to revert if we find out that it does not behave better in the next deploy (reindex) than the current implementation.

My only concern about this PR is that it seems that we are changing some small logic on it as well (I'm not sure, though). If that's the case, I'd recommend to make this changes in different PR.

Also, if we care too much about having different implementations, this could be refactored to use a class with a specific API so we can swap from one to another. I suppose it's not worth the effort at this point, though.

try:
body('.toctree-wrapper').remove()
nodes_to_be_removed += body.css('.toctree-wrapper') + body.css('.contents.local.topic')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is a logic change here. Are we removing an extra node here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are removing extra node here.
Earlier, I thought that all the Table of Contents are under a div having class .toctree-wrapper but latest I realised that pages which have Table of Contents describing the contents in the same page (like: SEO Guide) are having different css class -- contents local topic.
So I included that also.

break

text = parse_content(next_p.text(), remove_first_line=True)
text = parse_content(next_p.text(), remove_first_line=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here. It seems we are changing the logic.

Copy link
Member Author

@dojutsu-user dojutsu-user Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when working on it again, I find that few lines were getting removed from indexing because of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I target these things in another PR?

Copy link
Member

@humitos humitos Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "yes".

If we first apply the logic changes and after that we apply the migration, we can easily come back to the previous (pyquery, etc) implementation with just one revert if we ever want.

@rushter
Copy link

rushter commented Sep 26, 2019

Hey, I'm the author of selectolax.

I will be glad to help you or add new features for your particular need.
Just ping me if you will have any questions or ideas on how to speed up your particular use-case even more by introducing new features in selectolax.

@ericholscher
Copy link
Member

I'd like to move this forward. I think I'm OK with merging it with the logic changes, since it's already done. In the future we should try and separate these though.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on getting this in. Having faster indexing makes everything better.

@dojutsu-user
Copy link
Member Author

Thanks @rushter for your help. ❤️

@dojutsu-user dojutsu-user removed the PR: work in progress Pull request is not ready for full review label Sep 30, 2019
@ericholscher ericholscher merged commit 37cabe3 into readthedocs:master Oct 9, 2019
@dojutsu-user dojutsu-user deleted the optimize-json-parsing branch October 13, 2019 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants