-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch List implementation to use Hash-based lookup #133
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The Hash doesn't require manual reindexing when new rules are added. Moreover, the Hash-based algorithm has almost O(1) lookup time. Actually, the lookup time is O(k), where k is the number of parts in the input string. find("www.example.com") -> k = 2 find("www.example.com") -> k = 3 find("www.subdomain.example.com") -> k = 4 It's fair to consider that the average number of parts is 3, and hostnames longer than 5 parts are quite uncommon. Note that the Hash-based lookup is highly influenced by whatever underlying Hash implementation is provided by the programming language. A Perfect Hash would be preferable in terms of lookup time as it offers real O(1) lookup time complexity (whereas a dynamic Hash is avg O(1)), however a Perfect Hash would require a computation of a perfect hashing function, without considering that it would not allow the flexibility of adding/removing rules at runtime.
➜ publicsuffix-ruby git:(thesis-hash) ✗ ruby benchmarks/bm_parts.rb Warming up -------------------------------------- tokenizer1 26.384k i/100ms tokenizer2 26.571k i/100ms tokenizer3 32.293k i/100ms tokenizer4 27.595k i/100ms Calculating ------------------------------------- tokenizer1 310.488k (± 6.6%) i/s - 1.557M in 5.035961s tokenizer2 308.801k (± 8.3%) i/s - 1.541M in 5.027643s tokenizer3 378.716k (± 5.3%) i/s - 1.905M in 5.045422s tokenizer4 305.493k (± 9.6%) i/s - 1.518M in 5.018550s Comparison: tokenizer3: 378716.5 i/s tokenizer1: 310488.3 i/s - 1.22x slower tokenizer2: 308800.6 i/s - 1.23x slower tokenizer4: 305493.5 i/s - 1.24x slower
After I finally realize why the benchmarks were still using the old code, and fixing the issue in 5ed8d00, here's the new benchmarks that compare the existing implementation with the new lookup based on Hash. Using the naive indexing: ➜ publicsuffix-ruby git:(master) ruby benchmarks/bm_find.rb Rehearsal ------------------------------------------------------------- NAME_SHORT 1.550000 0.010000 1.560000 ( 1.563616) NAME_SHORT (noprivate) 2.060000 0.020000 2.080000 ( 2.117548) NAME_MEDIUM 1.720000 0.020000 1.740000 ( 1.760489) NAME_MEDIUM (noprivate) 2.430000 0.020000 2.450000 ( 2.649166) NAME_LONG 1.630000 0.000000 1.630000 ( 1.643268) NAME_LONG (noprivate) 2.210000 0.020000 2.230000 ( 2.262352) NAME_WILD 0.600000 0.000000 0.600000 ( 0.601043) NAME_WILD (noprivate) 1.320000 0.070000 1.390000 ( 1.475682) NAME_EXCP 0.940000 0.060000 1.000000 ( 1.071000) NAME_EXCP (noprivate) 1.120000 0.010000 1.130000 ( 1.136978) IAAA 0.690000 0.000000 0.690000 ( 0.694769) IAAA (noprivate) 1.010000 0.010000 1.020000 ( 1.011105) IZZZ 0.560000 0.000000 0.560000 ( 0.569191) IZZZ (noprivate) 0.900000 0.000000 0.900000 ( 0.895128) PAAA 7.310000 0.090000 7.400000 ( 8.036596) PAAA (noprivate) 7.910000 0.080000 7.990000 ( 8.450394) PZZZ 1.060000 0.000000 1.060000 ( 1.109186) PZZZ (noprivate) 1.390000 0.010000 1.400000 ( 1.411946) JP 50.590000 0.390000 50.980000 ( 52.698865) JP (noprivate) 49.840000 0.230000 50.070000 ( 50.385524) IT 9.440000 0.020000 9.460000 ( 9.502403) IT (noprivate) 9.940000 0.030000 9.970000 ( 10.008055) COM 8.610000 0.030000 8.640000 ( 8.657849) COM (noprivate) 9.330000 0.130000 9.460000 ( 9.700029) -------------------------------------------------- total: 175.410000sec user system total real NAME_SHORT 1.580000 0.000000 1.580000 ( 1.588811) NAME_SHORT (noprivate) 2.000000 0.010000 2.010000 ( 2.024544) NAME_MEDIUM 1.960000 0.020000 1.980000 ( 2.012659) NAME_MEDIUM (noprivate) 2.150000 0.020000 2.170000 ( 2.193273) NAME_LONG 1.660000 0.000000 1.660000 ( 1.666938) NAME_LONG (noprivate) 2.010000 0.000000 2.010000 ( 2.018177) NAME_WILD 0.600000 0.000000 0.600000 ( 0.601061) NAME_WILD (noprivate) 0.920000 0.000000 0.920000 ( 0.920315) NAME_EXCP 0.700000 0.010000 0.710000 ( 0.708406) NAME_EXCP (noprivate) 1.260000 0.010000 1.270000 ( 1.298971) IAAA 0.810000 0.010000 0.820000 ( 0.829160) IAAA (noprivate) 1.180000 0.000000 1.180000 ( 1.207569) IZZZ 0.640000 0.010000 0.650000 ( 0.646752) IZZZ (noprivate) 1.020000 0.000000 1.020000 ( 1.037327) PAAA 6.180000 0.020000 6.200000 ( 6.227082) PAAA (noprivate) 6.970000 0.050000 7.020000 ( 7.089971) PZZZ 0.930000 0.000000 0.930000 ( 0.937254) PZZZ (noprivate) 1.310000 0.010000 1.320000 ( 1.324235) JP 47.930000 0.200000 48.130000 ( 48.440196) JP (noprivate) 48.440000 0.260000 48.700000 ( 49.110888) IT 9.660000 0.090000 9.750000 ( 9.874755) IT (noprivate) 9.950000 0.070000 10.020000 ( 10.163920) COM 7.930000 0.020000 7.950000 ( 7.986893) COM (noprivate) 8.170000 0.010000 8.180000 ( 8.186619) Using Hash: ➜ publicsuffix-ruby git:(thesis-hash) ruby benchmarks/bm_find.rb Rehearsal ------------------------------------------------------------- NAME_SHORT 0.310000 0.000000 0.310000 ( 0.363447) NAME_SHORT (noprivate) 0.360000 0.000000 0.360000 ( 0.402509) NAME_MEDIUM 0.320000 0.000000 0.320000 ( 0.317237) NAME_MEDIUM (noprivate) 0.410000 0.000000 0.410000 ( 0.413092) NAME_LONG 0.400000 0.000000 0.400000 ( 0.396608) NAME_LONG (noprivate) 0.510000 0.000000 0.510000 ( 0.510915) NAME_WILD 0.390000 0.000000 0.390000 ( 0.393804) NAME_WILD (noprivate) 0.510000 0.010000 0.520000 ( 0.507487) NAME_EXCP 0.400000 0.000000 0.400000 ( 0.401723) NAME_EXCP (noprivate) 0.520000 0.000000 0.520000 ( 0.525549) IAAA 0.240000 0.000000 0.240000 ( 0.244243) IAAA (noprivate) 0.360000 0.000000 0.360000 ( 0.359558) IZZZ 0.250000 0.000000 0.250000 ( 0.249716) IZZZ (noprivate) 0.360000 0.000000 0.360000 ( 0.356862) PAAA 0.440000 0.000000 0.440000 ( 0.445464) PAAA (noprivate) 0.590000 0.000000 0.590000 ( 0.591834) PZZZ 0.450000 0.000000 0.450000 ( 0.446044) PZZZ (noprivate) 0.520000 0.000000 0.520000 ( 0.524458) JP 0.320000 0.000000 0.320000 ( 0.327063) JP (noprivate) 0.430000 0.000000 0.430000 ( 0.430906) IT 0.270000 0.000000 0.270000 ( 0.265015) IT (noprivate) 0.340000 0.000000 0.340000 ( 0.345299) COM 0.250000 0.000000 0.250000 ( 0.244028) COM (noprivate) 0.340000 0.010000 0.350000 ( 0.343862) ---------------------------------------------------- total: 9.310000sec user system total real NAME_SHORT 0.220000 0.000000 0.220000 ( 0.221509) NAME_SHORT (noprivate) 0.320000 0.000000 0.320000 ( 0.329044) NAME_MEDIUM 0.290000 0.000000 0.290000 ( 0.296088) NAME_MEDIUM (noprivate) 0.390000 0.000000 0.390000 ( 0.393592) NAME_LONG 0.420000 0.000000 0.420000 ( 0.419251) NAME_LONG (noprivate) 0.500000 0.000000 0.500000 ( 0.499873) NAME_WILD 0.420000 0.000000 0.420000 ( 0.421002) NAME_WILD (noprivate) 0.480000 0.000000 0.480000 ( 0.485180) NAME_EXCP 0.400000 0.000000 0.400000 ( 0.401010) NAME_EXCP (noprivate) 0.510000 0.000000 0.510000 ( 0.506889) IAAA 0.250000 0.000000 0.250000 ( 0.257035) IAAA (noprivate) 0.350000 0.000000 0.350000 ( 0.352895) IZZZ 0.250000 0.000000 0.250000 ( 0.250804) IZZZ (noprivate) 0.350000 0.010000 0.360000 ( 0.352272) PAAA 0.440000 0.000000 0.440000 ( 0.444238) PAAA (noprivate) 0.540000 0.000000 0.540000 ( 0.549019) PZZZ 0.440000 0.000000 0.440000 ( 0.449137) PZZZ (noprivate) 0.550000 0.000000 0.550000 ( 0.559688) JP 0.330000 0.000000 0.330000 ( 0.337413) JP (noprivate) 0.450000 0.010000 0.460000 ( 0.458545) IT 0.240000 0.000000 0.240000 ( 0.247337) IT (noprivate) 0.350000 0.000000 0.350000 ( 0.351233) COM 0.260000 0.000000 0.260000 ( 0.261882) COM (noprivate) 0.340000 0.000000 0.340000 ( 0.347857)
Using the naive indexing: ➜ publicsuffix-ruby git:(master) ruby test/profilers/execution_profiler.rb Total allocated: 204162 bytes (4420 objects) Total retained: 0 bytes (0 objects) allocated memory by gem ----------------------------------- 204002 publicsuffix-ruby/lib 160 other allocated memory by class ----------------------------------- 177036 String 18416 Array 2560 Hash 2134 Regexp 1168 RubyVM::Env 1120 MatchData 800 Proc 576 Enumerator::Lazy 96 Enumerator::Generator 96 Enumerator::Yielder 80 PublicSuffix::Domain 80 PublicSuffix::Rule::Wildcard allocated objects by gem ----------------------------------- 4416 publicsuffix-ruby/lib 4 other allocated objects by class ----------------------------------- 4332 String 32 Array 16 Hash 10 Proc 10 RubyVM::Env 4 Enumerator::Lazy 4 MatchData 4 Regexp 2 Enumerator::Generator 2 Enumerator::Yielder 2 PublicSuffix::Domain 2 PublicSuffix::Rule::Wildcard retained memory by gem ----------------------------------- NO DATA retained memory by file ----------------------------------- NO DATA retained memory by location ----------------------------------- NO DATA retained memory by class ----------------------------------- NO DATA retained objects by gem ----------------------------------- NO DATA retained objects by file ----------------------------------- NO DATA retained objects by location ----------------------------------- NO DATA retained objects by class ----------------------------------- NO DATA Using Hash: ➜ publicsuffix-ruby git:(thesis-hash) ruby test/profilers/execution_profiler.rb Total allocated: 15170 bytes (160 objects) Total retained: 0 bytes (0 objects) allocated memory by gem ----------------------------------- 15010 publicsuffix-ruby/lib 160 other allocated memory by class ----------------------------------- 8076 String 2560 Hash 2134 Regexp 1120 Array 1120 MatchData 80 PublicSuffix::Domain 80 PublicSuffix::Rule::Wildcard allocated objects by gem ----------------------------------- 156 publicsuffix-ruby/lib 4 other allocated objects by class ----------------------------------- 108 String 24 Array 16 Hash 4 MatchData 4 Regexp 2 PublicSuffix::Domain 2 PublicSuffix::Rule::Wildcard retained memory by gem ----------------------------------- NO DATA retained memory by file ----------------------------------- NO DATA retained memory by location ----------------------------------- NO DATA retained memory by class ----------------------------------- NO DATA retained objects by gem ----------------------------------- NO DATA retained objects by file ----------------------------------- NO DATA retained objects by location ----------------------------------- NO DATA retained objects by class ----------------------------------- NO DATA
When the rule is stored, we can remove the value from the Rule as the value if effectively the key of the Hash. ➜ publicsuffix-ruby git:(before) ruby test/profilers/initialization_profiler.rb Total allocated: 5882690 bytes (52219 objects) Total retained: 1375819 bytes (24188 objects) ➜ publicsuffix-ruby git:(before) ruby test/profilers/execution_profiler.rb Total allocated: 15170 bytes (160 objects) Total retained: 0 bytes (0 objects) ➜ publicsuffix-ruby git:(after) ✗ ruby test/profilers/initialization_profiler.rb Total allocated: 6205130 bytes (60280 objects) Total retained: 1052404 bytes (16127 objects) ➜ publicsuffix-ruby git:(after) ✗ ruby test/profilers/execution_profiler.rb Total allocated: 15330 bytes (164 objects) Total retained: 0 bytes (0 objects) compared to master ➜ publicsuffix-ruby git:(master) ruby test/profilers/initialization_profiler.rb Total allocated: 6525758 bytes (72086 objects) Total retained: 1020387 bytes (19234 objects) ➜ publicsuffix-ruby git:(master) ruby test/profilers/execution_profiler.rb Total allocated: 204162 bytes (4420 objects) Total retained: 0 bytes (0 objects) Execution time is unchanged. ➜ publicsuffix-ruby git:(before) ruby test/benchmarks/bm_find.rb user system total real NAME_SHORT 0.260000 0.000000 0.260000 ( 0.262684) NAME_SHORT (noprivate) 0.370000 0.010000 0.380000 ( 0.372534) NAME_MEDIUM 0.330000 0.000000 0.330000 ( 0.335683) NAME_MEDIUM (noprivate) 0.490000 0.000000 0.490000 ( 0.494590) NAME_LONG 0.510000 0.010000 0.520000 ( 0.519750) NAME_LONG (noprivate) 0.590000 0.000000 0.590000 ( 0.594626) NAME_WILD 0.480000 0.000000 0.480000 ( 0.490432) NAME_WILD (noprivate) 0.580000 0.010000 0.590000 ( 0.594776) NAME_EXCP 0.460000 0.000000 0.460000 ( 0.470119) NAME_EXCP (noprivate) 0.590000 0.010000 0.600000 ( 0.601316) IAAA 0.300000 0.000000 0.300000 ( 0.305301) IAAA (noprivate) 0.400000 0.000000 0.400000 ( 0.410586) IZZZ 0.280000 0.000000 0.280000 ( 0.283711) IZZZ (noprivate) 0.400000 0.010000 0.410000 ( 0.408137) PAAA 0.490000 0.000000 0.490000 ( 0.501869) PAAA (noprivate) 0.600000 0.000000 0.600000 ( 0.612187) PZZZ 0.510000 0.010000 0.520000 ( 0.519206) PZZZ (noprivate) 0.590000 0.000000 0.590000 ( 0.600264) JP 0.390000 0.000000 0.390000 ( 0.404432) JP (noprivate) 0.540000 0.010000 0.550000 ( 0.558351) IT 0.290000 0.000000 0.290000 ( 0.298931) IT (noprivate) 0.410000 0.000000 0.410000 ( 0.420742) COM 0.290000 0.010000 0.300000 ( 0.300935) COM (noprivate) 0.400000 0.000000 0.400000 ( 0.409309) ➜ publicsuffix-ruby git:(after) ✗ ruby test/benchmarks/bm_find.rb user system total real NAME_SHORT 0.320000 0.000000 0.320000 ( 0.320201) NAME_SHORT (noprivate) 0.430000 0.000000 0.430000 ( 0.443678) NAME_MEDIUM 0.380000 0.000000 0.380000 ( 0.388169) NAME_MEDIUM (noprivate) 0.490000 0.010000 0.500000 ( 0.491073) NAME_LONG 0.480000 0.000000 0.480000 ( 0.483376) NAME_LONG (noprivate) 0.620000 0.010000 0.630000 ( 0.634896) NAME_WILD 0.570000 0.020000 0.590000 ( 0.628489) NAME_WILD (noprivate) 0.700000 0.030000 0.730000 ( 0.769070) NAME_EXCP 0.580000 0.020000 0.600000 ( 0.618683) NAME_EXCP (noprivate) 0.740000 0.030000 0.770000 ( 0.799244) IAAA 0.410000 0.030000 0.440000 ( 0.474761) IAAA (noprivate) 0.550000 0.040000 0.590000 ( 0.645329) IZZZ 0.380000 0.020000 0.400000 ( 0.432898) IZZZ (noprivate) 0.520000 0.020000 0.540000 ( 0.579073) PAAA 0.680000 0.040000 0.720000 ( 0.760276) PAAA (noprivate) 0.720000 0.020000 0.740000 ( 0.773864) PZZZ 0.700000 0.040000 0.740000 ( 0.782113) PZZZ (noprivate) 0.650000 0.010000 0.660000 ( 0.664647) JP 0.470000 0.000000 0.470000 ( 0.478473) JP (noprivate) 0.580000 0.010000 0.590000 ( 0.589827) IT 0.360000 0.000000 0.360000 ( 0.379309) IT (noprivate) 0.450000 0.010000 0.460000 ( 0.471794) COM 0.330000 0.010000 0.340000 ( 0.334253) COM (noprivate) 0.530000 0.030000 0.560000 ( 0.592813)
Using the new benchmarks introduced in dec53e6, the allocation is clearly lower even during execution time. ➜ publicsuffix-ruby git:(master) ✗ ruby test/profilers/find_profiler.rb Total allocated: 31472 bytes (691 objects) Total retained: 0 bytes (0 objects) ➜ publicsuffix-ruby git:(master) ✗ ruby test/profilers/domain_profiler.rb Total allocated: 37410 bytes (744 objects) Total retained: 0 bytes (0 objects) vs ➜ publicsuffix-ruby git:(thesis-hash) ruby test/profilers/find_profiler.rb Total allocated: 1264 bytes (22 objects) Total retained: 0 bytes (0 objects) ➜ publicsuffix-ruby git:(thesis-hash) ruby test/profilers/domain_profiler.rb Total allocated: 7202 bytes (75 objects) Total retained: 0 bytes (0 objects)
.new now takes all parameters, as you would create a completely new instance when you have the data. A new method called .build is used to create a new Rule from a rule content.
Better distinguish between a Rule (public API) and an Entry (internal API).
It doesn't support keyword arguments with no default, and proper memory profiling.
I tested it on my app. The gem now loads 3 times faster ( 👏 cc @burke |
Thanks for the feedback @casperisfine. I have some more research going on to use a modification of a Trie or a DAFSA to reduce the memory allocation. That said, I'm quite happy with the speed right now. |
weppos
added a commit
that referenced
this pull request
Aug 4, 2017
roback
added a commit
to twingly/twingly-url
that referenced
this pull request
Feb 9, 2018
Unfortunately it doesn't look like this fixes any of our issues, but since it made the profiling run a bit faster (and the fact that the tests didn't break) I made a PR of this anyway. (Profiling total run: 1.6663s -> 1.4801s). Some related links: * weppos/publicsuffix-ruby#130 * weppos/publicsuffix-ruby#133 * sporkmonger/addressable#267
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a major refactoring of the internals of the List implementation (the way the list is stored), and the
find
operation algorithm. The goal was to decrease the memory footprint and increase the speed of the lookup.This is part of a study and research I am conducting about data structures and algorithms. The various commits contains extra information about the various changes and optimizations.
Before
After