Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This significantly speeds up the scrapers.
Perform fewer Pandas appends
pandas-dev/pandas#35407 says:
So let's do that. It is significantly faster: https://stackoverflow.com/a/56746204
Using py-spy (https://github.com/benfred/py-spy) for flamegraphs and good old
time
for overall timings, I created this comparison.Before (uncached, for women's Test team records only):
After:
For a full update, before:
And after:
If we then add the cache (#6) into the equation:
This needed an upgrade to Pandas 1.4 in order to work with the nullable integer types. This was to take pandas-dev/pandas#43949 which fixed pandas-dev/pandas#25472.
Selectolax
The second commit uses Selectolax to speed up HTML parsing. https://rushter.com/blog/python-fast-html-parser/ has a good overview, and this is very straightforward.
Combining all these changes gives a big improvement: with caching, more efficient dataframe construction, and a faster HTML parser, we go from 626 seconds (just over 10 minutes) to 155 seconds (just over 2.5 minutes)!
We now spend about half our time just writing the CSVs!