- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.9k
speed up parsing for short ipv4s in std::net::Ipv4Addr::from_str #97118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any  Examples of  
 | 
| Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @kennytm (or someone else) soon. Please see the contribution instructions for more information. | 
| i'd also like to suggest we move both of these length checks into  | 
| 
 @JohnCSimon why isn't it ready for review? | 
| From your comment it looked like you still had more to add to this PR - my mistake, then. | 
| makes sense, I was hoping to get feedback on the comment prior to moving the checks. I didn't see a reason for this check to be in the place that it is. | 
| r? libs | 
| Can you say more about "the happy case" here? It looks like too short IPs previously went through a longer attempt to parse, but my (maybe wrong) assumption is that such IPs are fairly rare in any data set of IPs. Are we expecting users to attempt parsing and then fallback on some alternative? 
 Presuming this is the same behavior-wise (probably the case, but would need to check call sites), that seems reasonable to me. | 
| 
 That's right, the happy case is where a too short IP is parsed and the parser exits early. I think IPs with 0 < length < 7 are probably more rare than IPs with length == 0. I was motivated to look at this because we were processing a very large dataset and the question came up whether to do a length check before attempting parsing. Since the parser was slow for the too short IPs, we added the length check. Our dataset had cases of length==0 and 0 < length < 7. If we kept the assumption that such IPs are fairly rare, could we say that the patch is still worthwhile? The criterion output shows a small but supposably measurable performance decline for parsing valid IPs. 
 I would expect this to happen sometimes. In my case, we were provided with a third-party string which may/may not be a valid IP and then writing it in a structured format to a storage archive. Somewhat related, I took a peek at how  
 I believe it is. This  | 
| OK, this PR seems fine then -- let's update it to adjust read_ipv4_addr then. | 
| Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any  Examples of  
 | 
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
| some tests appear to be failing, looking into it. Also, tested the case of an empty string and the stats are different than above (which tested "1.1.1" .. 0<length<7)  | 
| I think tests fail with  | 
| this could/should fail too (ipv6s can be shorter, I think), but it's one example of how we could be consistent in doing length checks across the  @rustbot author | 
| The job  Click to see the possible cause of the failure (guessed by this bot) | 
| ☔ The latest upstream changes (presumably #101143) made this pull request unmergeable. Please resolve the merge conflicts. | 
This aims to improve parsing speed for short and empty strings.
There is a proof of concept based on Criterion here: https://github.com/nkconnor/rust-ipv4-optimization/blob/main/benches/ipv4_fromstr_benchmark.rs
On my machine it offers about a 15X speed-up in the happy case (
len() < 7). The parser today takes a more similar duration for"1.1.1"as it does for"1.1.1.1".Optimizing for this case is especially helpful for situations where input is a non-null string type. For example, a CSV library may offer non-null string type for an IP Address column. IP Address is a very common element in "big data" and this optimization is likely to be appreciated in such contexts.
Criterion Output:
Main Parse OK time: [21.145 ns 21.181 ns 21.222 ns]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
Main Parse <7 time: [15.268 ns 15.286 ns 15.304 ns]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Patch Parse OK time: [21.309 ns 21.506 ns 21.843 ns]
Found 13 outliers among 100 measurements (13.00%)
4 (4.00%) high mild
9 (9.00%) high severe
Patch Parse <7 time: [998.09 ps 999.02 ps 1.0001 ns]