Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize netloc extraction #284

Merged
merged 6 commits into from
May 16, 2023

Conversation

elliotwutingfeng
Copy link
Contributor

SCHEME_RE can be replaced with an if-else equivalent for tangible speed improvement.

Benchmarks of this optimization together with improvements in #283

Python 3.10, Linux x64, Ryzen 7 5800X

import tldextract

%timeit tldextract.extract("")
%timeit tldextract.extract("com")
%timeit tldextract.extract("example\u3002com")
%timeit tldextract.extract("subdomain\uff0eexample\uff61com")
%timeit tldextract.extract("a\u3002very\uff0elong\uff61subdomain\u3002example\uff0ecom")
%timeit tldextract.extract("an\uff61even\u3002longer\uff0eand\uff61complex\u3002subdomain\uff0eexample\uff61com")
%timeit tldextract.extract("https://a\u3002b\uff0ec\uff61d\u3002e\uff0ef\uff61g\u3002h\uff0ei\uff61j\u3002k\uff0el\uff61m\u3002n\uff0eoo\uff61pp\u3002qqq\uff0errrr\uff61ssssss\u3002tttttttt\uff0euuuuuuuuuuu\uff61vvvvvvvvvvvvvvv\u3002wwwwwwwwwwwwwwwwwwwwww\uff0exxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\uff61yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy\u3002zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.tw")
%timeit tldextract.extract("\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\u3002\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff0e\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61\uff61")
1.72 µs ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.75 µs ± 15.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2.34 µs ± 9.41 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.41 µs ± 4.97 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.6 µs ± 4.92 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.67 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.76 µs ± 13.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
4.49 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Changes

  • Replaced SCHEME_RE with faster if-else equivalent

Copy link
Owner

@john-kurkowski john-kurkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I took your code in a little different direction, 1 public function in the module calling 1 private utility function, vs. 2 public functions with 1 wrapping the other.

tldextract/remote.py Show resolved Hide resolved
@john-kurkowski john-kurkowski merged commit ad27cca into john-kurkowski:master May 16, 2023
@john-kurkowski
Copy link
Owner

Thank you!

@elliotwutingfeng elliotwutingfeng deleted the scheme branch May 16, 2023 19:27
bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request May 21, 2023
https://build.opensuse.org/request/show/1088132
by user mia + dimstar_suse
- Update to 3.4.4:
Bugfixes
  * Honor private domains flag on self, not only when passed to
    __call__
    #gh/john-kurkowski/tldextract#289
- Changes in 3.4.3:
Bugfixes
  * Speed up 10-15% over all inputs
  * Refactor suffix_index() to use a trie
    #gh/john-kurkowski/tldextract#285
Docs
  * Adopt PEP257 doc style
- Changes in 3.4.2:
Bugfixes
  * Speed up 10-40% on "average" inputs, and even more on
    pathological inputs, like long subdomains
  * Optimize suffix_index(): search from right to left
    #gh/john-kurkowski/tldextract#283
  * Optimize netloc extraction: switch from regex to if/else
    #gh/john-kurkowski/tldextract#284
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants