Skip to content

Case Study 2016 05 01

Andrew Gallant edited this page May 1, 2016 · 1 revision

This document is an analysis of users of the regex crate (as reported by its reverse dependencies on crates.io) as of May 1, 2016. The main goal of writing this document is to collect data points on how the API is used.

Methods

I installed cargo-crusader and modified it to only download reverse dependencies. It downloaded the sources of 199 crates. The list of crates and their versions are in [Case-Study-2015-05-01-Crate-List].

Most analysis done below is with simple invocations of find and grep, followed by manual verification of results. As a result, some things may be missed or counted incorrectly.

The regex! macro

Thankfully, it seems like most folks have gotten the word that regex! is slow. Only 14 crates are using it:

$ find ./ -name Cargo.toml | xargs grep -l regex_macros | wc -l
14

Misues of captures

captures can be significantly more expensive than find. Therefore, one should only use captures if one intends on extracting the location of capturing groups. find can be used when one only needs the location of the entire match.

The documentation for captures notes this. It's not clear what else we might do to mitigate misuse.

To find misuse, I ran the following and inspected the output by hand:

$ find ./ -name '*.rs' | sort | xargs grep -C 5 --color=always '\.captures('

The total number of instances of captures calls is 231.

Misuse of captures can be noticed easily if the return value of captures is only used to access information about the entire match.

Crates with misuse: aerial, avro, nix-netconfig, substudy, twig

The small number of instances of misuse is great news. The vast majority of uses of captures are followed immediately by a call to at(n)/pos(n)/name("...") where n is non-zero.

caps.at(n).unwrap() vs. caps[n]

Many uses of a Captures value call caps.at(n).unwrap(), where the unwrap is justified because there is contextual knowledge that the regex matched and that capturing group n must have been part of the match. Using caps.at(n).unwrap() in this context is equivalent to caps[n], modulo a lifetime constraint where the latter is more limited than the former. Nevertheless, it feels like most uses of at(n).unwrap() could be replaced with indexing.

It's not clear what to do about this other than make the availability of indexing more prominent in the documentation. However, this also increases the probability of users running into non-intuitive lifetime problems. Namely, the string returned by the indexing operation is bound to the lifetime of the Captures value instead of to the original haystack, like it is when using at(n).

is_match vs find

It is also possible to misuse find when is_match would work. This is harder to grep for in the sources since find is a very common function name. It is promising that there are many uses of is_match, which suggests folks are aware of it and use it.

The performance penalty between is_match and find is generally much smaller than the penalty between find and captures.

Compilation in a loop

This is a really hard one to detect, but can potentially be a performance killer. The docs have been updated recently to emphasize use of lazy_static to amortize the cost of compilation.

Skimming through uses of Regex::new, it's not clear how widespread the problem is. In particular, it requires more general information about the control flow of the program, and how often it's executed.

Summary

I think this case study wasn't so illuminating. Some usage patterns are too hard for humans to analyze in bulk, and developing the tools to do it seems tricky.

I spent a lot of time looking at the patterns that make up calls to Regex::new, but mostly with an eye toward coming up with a better set of benchmarks. I hope to write more on that in the future.