Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up scrapers #8

Merged
merged 3 commits into from
Oct 1, 2022
Merged

Speed up scrapers #8

merged 3 commits into from
Oct 1, 2022

Conversation

smcgivern
Copy link
Collaborator

This significantly speeds up the scrapers.

Perform fewer Pandas appends

pandas-dev/pandas#35407 says:

Unless I'm mistaken, users are always better off building up a list of values and passing them to the constructor, or building up a list of NDFrames followed by a single concat.

So let's do that. It is significantly faster: https://stackoverflow.com/a/56746204

Using py-spy (https://github.com/benfred/py-spy) for flamegraphs and good old time for overall timings, I created this comparison.

Before (uncached, for women's Test team records only):

$ rm -r data/*
$ sudo time py-spy record -o before.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.

# snip
Scraping page 4
https://stats.espncricinfo.com/ci/engine/stats/index.html?class=8;filter=advanced;orderby=start;page=4;size=200;template=results;type=team;view=innings
All done!

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'before.svg'. Samples: 1235 Errors: 0
       28.53 real        10.93 user         1.65 sys

before

After:

$ rm -r data/*
$ sudo time py-spy record -o list.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.

# snip
Scraping page 4
https://stats.espncricinfo.com/ci/engine/stats/index.html?class=8;filter=advanced;orderby=start;page=4;size=200;template=results;type=team;view=innings
All done!

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'list.svg'. Samples: 395 Errors: 0
       20.91 real         4.06 user         1.41 sys

list

For a full update, before:

$ sudo time py-spy record -o before-2.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.
# snip
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'before-2.svg'. Samples: 21656 Errors: 0
      626.02 real       194.27 user        31.50 sys

before-2

And after:

$ sudo time py-spy record -o list-2.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.
# snip
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'list-2.svg'. Samples: 8958 Errors: 0
      492.66 real        85.26 user        26.58 sys

list-2

If we then add the cache (#6) into the equation:

      226.39 real        73.39 user        11.82 sys

This needed an upgrade to Pandas 1.4 in order to work with the nullable integer types. This was to take pandas-dev/pandas#43949 which fixed pandas-dev/pandas#25472.

Selectolax

The second commit uses Selectolax to speed up HTML parsing. https://rushter.com/blog/python-fast-html-parser/ has a good overview, and this is very straightforward.

Combining all these changes gives a big improvement: with caching, more efficient dataframe construction, and a faster HTML parser, we go from 626 seconds (just over 10 minutes) to 155 seconds (just over 2.5 minutes)!

$ sudo time py-spy record -o selectolax-cached.svg -- python update_data.py
py-spy> Sampling process 100 times a second. Press Control-C to exit.
# snip
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'selectolax-cached.svg'. Samples: 2618 Errors: 0
      155.07 real        21.05 user         8.74 sys

selectolax-cached

We now spend about half our time just writing the CSVs!

@smcgivern smcgivern self-assigned this Sep 18, 2022
This speeds up the scrapers significantly (on a clean run of the women's
Test team records, it takes two thirds of the time from before) by
changing the appending approach for building up the intermediate
dataframes.

Before, we were constructing a single-row dataframe for each row in the
HTML table. We were then appending those on to a page dataframe (with up
to 200 rows), and finally appending _that_ to the overall dataframe.

This is slow, and it's known to be slow in the Pandas community. See
https://stackoverflow.com/a/56746204 for an example. We can improve this
by replacing all the intermediate dataframes with regular Python lists.
Then, once a page is complete, we construct a single dataframe from that
list of lists, convert it to the correct types, and finally concat that
with the overall dataframe.
This is much faster: https://rushter.com/blog/python-fast-html-parser/

It's almost a drop-in replacement, so this is a big win.
I'm not sure why Cricinfo has both SWA and SWZ here.
@smcgivern smcgivern merged commit 63e6fe5 into main Oct 1, 2022
@smcgivern smcgivern deleted the speed-up-scrapers branch October 1, 2022 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv fails with TypeError: object cannot be converted to an IntegerDtype yet succeeds when reading chunks
1 participant