-
Notifications
You must be signed in to change notification settings - Fork 0
Case Study 2016 05 01
This document is an analysis of users of the regex crate (as reported by its reverse dependencies on crates.io) as of May 1, 2016. The main goal of writing this document is to collect data points on how the API is used.
I installed cargo-crusader and modified it to only download reverse dependencies. It downloaded the sources of 199 crates. The list of crates and their versions are in [Case-Study-2015-05-01-Crate-List].
Most analysis done below is with simple invocations of find
and grep
,
followed by manual verification of results. As a result, some things may be
missed or counted incorrectly.
Thankfully, it seems like most folks have gotten the word that regex!
is
slow. Only 14 crates are using it:
$ find ./ -name Cargo.toml | xargs grep -l regex_macros | wc -l
14
captures
can be significantly more expensive than find
. Therefore, one
should only use captures
if one intends on extracting the location of
capturing groups. find
can be used when one only needs the location of the
entire match.
The documentation for captures
notes this. It's not clear what else we might
do to mitigate misuse.
To find misuse, I ran the following and inspected the output by hand:
$ find ./ -name '*.rs' | sort | xargs grep -C 5 --color=always '\.captures('
The total number of instances of captures
calls is 231
.
Misuse of captures
can be noticed easily if the return value of captures
is
only used to access information about the entire match.
Crates with misuse: aerial, avro, nix-netconfig, substudy, twig
The small number of instances of misuse is great news. The vast majority of
uses of captures
are followed immediately by a call to
at(n)
/pos(n)
/name("...")
where n
is non-zero.
Many uses of a Captures
value call caps.at(n).unwrap()
, where the unwrap
is justified because there is contextual knowledge that the regex matched and
that capturing group n
must have been part of the match. Using
caps.at(n).unwrap()
in this context is equivalent to caps[n]
, modulo a
lifetime constraint where the latter is more limited than the former.
Nevertheless, it feels like most uses of at(n).unwrap()
could be replaced
with indexing.
It's not clear what to do about this other than make the availability of
indexing more prominent in the documentation. However, this also increases the
probability of users running into non-intuitive lifetime problems. Namely, the
string returned by the indexing operation is bound to the lifetime of the
Captures
value instead of to the original haystack, like it is when using
at(n)
.
It is also possible to misuse find
when is_match
would work. This is harder
to grep
for in the sources since find
is a very common function name. It is
promising that there are many uses of is_match
, which suggests folks are
aware of it and use it.
The performance penalty between is_match
and find
is generally much smaller
than the penalty between find
and captures
.
This is a really hard one to detect, but can potentially be a performance
killer. The docs have been updated recently to emphasize use of lazy_static
to amortize the cost of compilation.
Skimming through uses of Regex::new
, it's not clear how widespread the
problem is. In particular, it requires more general information about the
control flow of the program, and how often it's executed.
I think this case study wasn't so illuminating. Some usage patterns are too hard for humans to analyze in bulk, and developing the tools to do it seems tricky.
I spent a lot of time looking at the patterns that make up calls to
Regex::new
, but mostly with an eye toward coming up with a better set of
benchmarks. I hope to write more on that in the future.