Speed up scrapers #8

smcgivern · 2022-09-18T17:31:40Z

This significantly speeds up the scrapers.

Perform fewer Pandas appends

Unless I'm mistaken, users are always better off building up a list of values and passing them to the constructor, or building up a list of NDFrames followed by a single concat.

So let's do that. It is significantly faster: https://stackoverflow.com/a/56746204

Using py-spy (https://github.com/benfred/py-spy) for flamegraphs and good old time for overall timings, I created this comparison.

Before (uncached, for women's Test team records only):

$ rm -r data/*
$ sudo time py-spy record -o before.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.

# snip
Scraping page 4
https://stats.espncricinfo.com/ci/engine/stats/index.html?class=8;filter=advanced;orderby=start;page=4;size=200;template=results;type=team;view=innings
All done!

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'before.svg'. Samples: 1235 Errors: 0
       28.53 real        10.93 user         1.65 sys

After:

$ rm -r data/*
$ sudo time py-spy record -o list.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.

# snip
Scraping page 4
https://stats.espncricinfo.com/ci/engine/stats/index.html?class=8;filter=advanced;orderby=start;page=4;size=200;template=results;type=team;view=innings
All done!

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'list.svg'. Samples: 395 Errors: 0
       20.91 real         4.06 user         1.41 sys

For a full update, before:

$ sudo time py-spy record -o before-2.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.
# snip
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'before-2.svg'. Samples: 21656 Errors: 0
      626.02 real       194.27 user        31.50 sys

And after:

$ sudo time py-spy record -o list-2.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.
# snip
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'list-2.svg'. Samples: 8958 Errors: 0
      492.66 real        85.26 user        26.58 sys

If we then add the cache (#6) into the equation:

      226.39 real        73.39 user        11.82 sys

This needed an upgrade to Pandas 1.4 in order to work with the nullable integer types. This was to take pandas-dev/pandas#43949 which fixed pandas-dev/pandas#25472.

Selectolax

The second commit uses Selectolax to speed up HTML parsing. https://rushter.com/blog/python-fast-html-parser/ has a good overview, and this is very straightforward.

Combining all these changes gives a big improvement: with caching, more efficient dataframe construction, and a faster HTML parser, we go from 626 seconds (just over 10 minutes) to 155 seconds (just over 2.5 minutes)!

$ sudo time py-spy record -o selectolax-cached.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.
# snip
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'selectolax-cached.svg'. Samples: 2618 Errors: 0
      155.07 real        21.05 user         8.74 sys

We now spend about half our time just writing the CSVs!

This speeds up the scrapers significantly (on a clean run of the women's Test team records, it takes two thirds of the time from before) by changing the appending approach for building up the intermediate dataframes. Before, we were constructing a single-row dataframe for each row in the HTML table. We were then appending those on to a page dataframe (with up to 200 rows), and finally appending _that_ to the overall dataframe. This is slow, and it's known to be slow in the Pandas community. See https://stackoverflow.com/a/56746204 for an example. We can improve this by replacing all the intermediate dataframes with regular Python lists. Then, once a page is complete, we construct a single dataframe from that list of lists, convert it to the correct types, and finally concat that with the overall dataframe.

This is much faster: https://rushter.com/blog/python-fast-html-parser/ It's almost a drop-in replacement, so this is a big win.

I'm not sure why Cricinfo has both SWA and SWZ here.

smcgivern self-assigned this Sep 18, 2022

smcgivern force-pushed the speed-up-scrapers branch from 8687d6d to 28e35ad Compare September 18, 2022 18:02

smcgivern added 3 commits September 19, 2022 16:35

Use Selectolax instead of BeautifulSoup

e093135

This is much faster: https://rushter.com/blog/python-fast-html-parser/ It's almost a drop-in replacement, so this is a big win.

Add another abbreviation for Eswatini

f74fd8c

I'm not sure why Cricinfo has both SWA and SWZ here.

smcgivern force-pushed the speed-up-scrapers branch from 28e35ad to f74fd8c Compare September 19, 2022 15:36

smcgivern mentioned this pull request Sep 29, 2022

Use consistent team names and add match and player IDs #9

Merged

smcgivern merged commit 63e6fe5 into main Oct 1, 2022

smcgivern deleted the speed-up-scrapers branch October 1, 2022 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up scrapers #8

Speed up scrapers #8

smcgivern commented Sep 18, 2022

Speed up scrapers #8

Speed up scrapers #8

Conversation

smcgivern commented Sep 18, 2022

Perform fewer Pandas appends

Selectolax