-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize json parsing #6160
Optimize json parsing #6160
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's two pieces here:
- Swapping out
pyquery
/lxml
forselectolax
/modest
- Swapping out the stdlib
json
fororjson
In both cases, we are looking to swap to more performant libraries. One thing I'd like to see if where is this performance benefit coming from. Was JSON parsing the slow part and orjson
is making a huge difference or was HTML parsing the slow part and selectolax
is improving things overall?
My biggest concern about this is that selectolax has a single developer even though it does appear to be actively developed. The modest engine has a bit more developer backing. Orjson I've used in the past where JSON parsing speed was really important and it worked reasonably well.
@davidfischer # with `json` and `selectolax`
real 0m20.142s
user 0m14.789s
sys 0m0.405s
# with `orjson` and `selectolax`
real 0m17.934s
user 0m13.506s
sys 0m0.412s
# with `json` and `pyquery`
real 0m31.838s
user 0m25.688s
sys 0m0.414s
# with `orjson` and `pyquery`
real 0m26.549s
user 0m21.444s
sys 0m0.314s I believe both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have too much experience with this code. Although, it seems a good change and I think that it's easy to revert if we find out that it does not behave better in the next deploy (reindex) than the current implementation.
My only concern about this PR is that it seems that we are changing some small logic on it as well (I'm not sure, though). If that's the case, I'd recommend to make this changes in different PR.
Also, if we care too much about having different implementations, this could be refactored to use a class with a specific API so we can swap from one to another. I suppose it's not worth the effort at this point, though.
try: | ||
body('.toctree-wrapper').remove() | ||
nodes_to_be_removed += body.css('.toctree-wrapper') + body.css('.contents.local.topic') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is a logic change here. Are we removing an extra node here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are removing extra node here.
Earlier, I thought that all the Table of Contents
are under a div having class .toctree-wrapper
but latest I realised that pages which have Table of Contents
describing the contents in the same page (like: SEO Guide) are having different css class -- contents local topic
.
So I included that also.
break | ||
|
||
text = parse_content(next_p.text(), remove_first_line=True) | ||
text = parse_content(next_p.text(), remove_first_line=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here. It seems we are changing the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when working on it again, I find that few lines were getting removed from indexing because of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I target these things in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "yes".
If we first apply the logic changes and after that we apply the migration, we can easily come back to the previous (pyquery, etc) implementation with just one revert if we ever want.
Hey, I'm the author of selectolax. I will be glad to help you or add new features for your particular need. |
I'd like to move this forward. I think I'm OK with merging it with the logic changes, since it's already done. In the future we should try and separate these though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on getting this in. Having faster indexing makes everything better.
Thanks @rushter for your help. ❤️ |
Related #6130
This PR removes the
PyQuery
andlxml
and instead uses the selectolax which is much faster thanlxml
(benchmark). Parsing logic is exactly same with some small improvements.Results:
Before
$ time python manage.py reindex_elasticsearch real 0m24.272s user 0m20.457s sys 0m0.337s
After
$ time python manage.py reindex_elasticsearch real 0m14.907s user 0m10.274s sys 0m0.336s