Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize json parsing for faster indexing in elasticsearch #6130

Closed
dojutsu-user opened this issue Sep 3, 2019 · 7 comments
Closed

Optimize json parsing for faster indexing in elasticsearch #6130

dojutsu-user opened this issue Sep 3, 2019 · 7 comments
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required

Comments

@dojutsu-user
Copy link
Member

dojutsu-user commented Sep 3, 2019

Optimizing the json parsing will result in faster indexing/reindexing.
https://github.com/readthedocs/readthedocs.org/blob/master/readthedocs/search/parse_json.py

@dojutsu-user dojutsu-user added Improvement Minor improvement to code Accepted Accepted issue on our roadmap labels Sep 3, 2019
@humitos
Copy link
Member

humitos commented Sep 3, 2019

I'm lacking context here. What are the ideas around this?

@dojutsu-user
Copy link
Member Author

I don't currently have any "optimization ideas".
If you mean about this issue, when we do indexing of projects, json files are parsed and the relevant data is sent to elasticsearch, I believe that json parsing can be made more faster.
It will reduce the time taken for a project to get indexed into elasticsearch. Also, the time taken for indexing of all projects will also gets reduced.

@stsewd
Copy link
Member

stsewd commented Sep 3, 2019

We use the json package from the std lib

data = json.loads(file_contents)

not sure how we can improve that

@ericholscher
Copy link
Member

I imagine it's the HTML parsing with pyquery that is the slow part.

@dojutsu-user
Copy link
Member Author

dojutsu-user commented Sep 6, 2019

I think most of it is because the HTML parsing but we can also use a faster json library for deserialization.
Like: https://github.com/ijl/orjson

@humitos humitos added Needed: design decision A core team decision is required and removed Accepted Accepted issue on our roadmap labels Sep 9, 2019
@dojutsu-user
Copy link
Member Author

dojutsu-user commented Sep 9, 2019

@ericholscher

I imagine it's the HTML parsing with pyquery that is the slow part.

pyquery by default uses html as its parser,
https://github.com/gawel/pyquery/blob/2351446c56d5c3df93e3fd81a2b828cf00e3b648/pyquery/pyquery.py#L225-L226

and it is quite slow.
I believe if we use html5_parser, this might gets faster.

Edit: html5_parser doesn't seem to work as expected. 😕

@ericholscher
Copy link
Member

Believe we fixed this already 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required
Projects
None yet
Development

No branches or pull requests

4 participants