Optimize json parsing for faster indexing in elasticsearch #6130

dojutsu-user · 2019-09-03T13:07:02Z

Optimizing the json parsing will result in faster indexing/reindexing.
https://github.com/readthedocs/readthedocs.org/blob/master/readthedocs/search/parse_json.py

humitos · 2019-09-03T14:30:18Z

I'm lacking context here. What are the ideas around this?

dojutsu-user · 2019-09-03T14:34:35Z

I don't currently have any "optimization ideas".
If you mean about this issue, when we do indexing of projects, json files are parsed and the relevant data is sent to elasticsearch, I believe that json parsing can be made more faster.
It will reduce the time taken for a project to get indexed into elasticsearch. Also, the time taken for indexing of all projects will also gets reduced.

stsewd · 2019-09-03T15:28:28Z

We use the json package from the std lib

readthedocs.org/readthedocs/search/parse_json.py

Line 93 in 34d1a15

data = json.loads(file_contents)

not sure how we can improve that

ericholscher · 2019-09-06T05:17:37Z

I imagine it's the HTML parsing with pyquery that is the slow part.

dojutsu-user · 2019-09-06T05:33:43Z

I think most of it is because the HTML parsing but we can also use a faster json library for deserialization.
Like: https://github.com/ijl/orjson

dojutsu-user · 2019-09-09T14:00:05Z

@ericholscher

I imagine it's the HTML parsing with pyquery that is the slow part.

pyquery by default uses html as its parser,
https://github.com/gawel/pyquery/blob/2351446c56d5c3df93e3fd81a2b828cf00e3b648/pyquery/pyquery.py#L225-L226

and it is quite slow.
I believe if we use html5_parser, this might gets faster.

Edit: html5_parser doesn't seem to work as expected. 😕

ericholscher · 2020-04-29T16:42:29Z

Believe we fixed this already 👍

dojutsu-user added Improvement Minor improvement to code Accepted Accepted issue on our roadmap labels Sep 3, 2019

tapaswenipathak mentioned this issue Sep 7, 2019

Implemented orjson #6150

Closed

humitos added Needed: design decision A core team decision is required and removed Accepted Accepted issue on our roadmap labels Sep 9, 2019

dojutsu-user mentioned this issue Sep 10, 2019

Optimize json parsing #6160

Merged

ericholscher closed this as completed Apr 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize json parsing for faster indexing in elasticsearch #6130

Optimize json parsing for faster indexing in elasticsearch #6130

dojutsu-user commented Sep 3, 2019 •

edited

Loading

humitos commented Sep 3, 2019

dojutsu-user commented Sep 3, 2019

stsewd commented Sep 3, 2019 •

edited

Loading

ericholscher commented Sep 6, 2019

dojutsu-user commented Sep 6, 2019 •

edited

Loading

dojutsu-user commented Sep 9, 2019 •

edited

Loading

ericholscher commented Apr 29, 2020

Optimize json parsing for faster indexing in elasticsearch #6130

Optimize json parsing for faster indexing in elasticsearch #6130

Comments

dojutsu-user commented Sep 3, 2019 • edited Loading

humitos commented Sep 3, 2019

dojutsu-user commented Sep 3, 2019

stsewd commented Sep 3, 2019 • edited Loading

ericholscher commented Sep 6, 2019

dojutsu-user commented Sep 6, 2019 • edited Loading

dojutsu-user commented Sep 9, 2019 • edited Loading

ericholscher commented Apr 29, 2020

dojutsu-user commented Sep 3, 2019 •

edited

Loading

stsewd commented Sep 3, 2019 •

edited

Loading

dojutsu-user commented Sep 6, 2019 •

edited

Loading

dojutsu-user commented Sep 9, 2019 •

edited

Loading