Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: convert tag extractor regexs to Re2 #14519

Merged
merged 26 commits into from
Jan 14, 2021
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
1215e11
stats: convert tag extractor regexs to Re2
rojkov Dec 18, 2020
f242af4
extend tag values to include dash and simplify regex for stat prefixes
rojkov Dec 28, 2020
b6865ad
drop optional matches since stat name is not muated until all extract…
rojkov Dec 28, 2020
230baaf
fix clang_tidy check
rojkov Dec 28, 2020
baee204
add reference results for the benchmark
rojkov Dec 28, 2020
12d7cc4
fix check_spelling_pedantic.py
rojkov Dec 28, 2020
c5561f6
fix clang_tidy check once again
rojkov Dec 28, 2020
bdf4545
update regex for tag values
rojkov Dec 29, 2020
b224e44
make regexes expandable with configurable patterns for sections
rojkov Dec 29, 2020
f2fb755
Merge remote-tracking branch 'upstream/master' into re2
rojkov Dec 29, 2020
78aefc7
simplify regex expansion
rojkov Dec 30, 2020
75b8697
Merge remote-tracking branch 'upstream/master' into re2
rojkov Dec 30, 2020
4c5e2d6
revert tag_extractor test to old tag values
rojkov Dec 31, 2020
6f66047
removed unused TagNameValues::addRegex()
rojkov Dec 31, 2020
b04be99
restore square brackets in comment
rojkov Dec 31, 2020
95aa344
Merge remote-tracking branch 'upstream/master' into re2
rojkov Dec 31, 2020
a833dcc
add comments to regexes
rojkov Jan 5, 2021
cc4d575
Merge remote-tracking branch 'upstream/master' into re2
rojkov Jan 5, 2021
9288659
add a comment explaining HTTP_CONN_MANAGER_PREFIX regex
rojkov Jan 5, 2021
1ecbac0
log tag extractor usage
rojkov Jan 8, 2021
f7c9045
put perf debug output under ifdef
rojkov Jan 8, 2021
d4d9553
make output more greppable with TagStats
rojkov Jan 11, 2021
7e4d7f2
Make expandRegex()'s comment more descriptive
rojkov Jan 11, 2021
618a203
amend comments and unify usage of symbol classes in regexes
rojkov Jan 12, 2021
0ad2460
simplify macros
rojkov Jan 12, 2021
910cb04
put destructor under explicit ifdef
rojkov Jan 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 63 additions & 49 deletions source/common/config/well_known_names.cc
Original file line number Diff line number Diff line change
@@ -1,8 +1,30 @@
#include "common/config/well_known_names.h"

#include "absl/strings/str_replace.h"

namespace Envoy {
namespace Config {

namespace {

// Replaces regex placeholders with actual regexes.
std::string expandRegex(const std::string& regex) {
jmarantz marked this conversation as resolved.
Show resolved Hide resolved
return absl::StrReplaceAll(
regex, {// Regex to look for either IPv4 or IPv6 addresses.
{"<ADDRESS>",
R"((?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}_\d+|\[[_aAbBcCdDeEfF[:digit:]]+\]_\d+))"},
htuch marked this conversation as resolved.
Show resolved Hide resolved
// Cipher names can contain alphanumerics with dashes and
// underscores.
{"<CIPHER>", R"([0-9A-Za-z_-]+)"},
htuch marked this conversation as resolved.
Show resolved Hide resolved
// A generic name can contain any character except dots.
{"<NAME>", R"([^\.]+)"},
// Route names may contain dots in addition to alphanumerics and
// dashes with underscores.
{"<ROUTE_CONFIG_NAME>", R"([\w-\.]+)"}});
}

} // namespace

TagNameValues::TagNameValues() {
// Note: the default regexes are defined below in the order that they will typically be matched
// (see the TagExtractor class definition for an explanation of the iterative matching process).
Expand All @@ -24,107 +46,99 @@ TagNameValues::TagNameValues() {
// - Typical * notation will be used to denote an arbitrary set of characters.

// *_rq(_<response_code>)
addRegex(RESPONSE_CODE, "_rq(_(\\d{3}))$", "_rq_");
addRe2(RESPONSE_CODE, R"(_rq(_(\d{3}))$)", "_rq_");

// *_rq_(<response_code_class>)xx
addRegex(RESPONSE_CODE_CLASS, "_rq_(\\d)xx$", "_rq_");
addRe2(RESPONSE_CODE_CLASS, R"(_rq_((\d))xx$)", "_rq_");

// http.[<stat_prefix>.]dynamodb.table.[<table_name>.]capacity.[<operation_name>.](__partition_id=<last_seven_characters_from_partition_id>)
addRegex(DYNAMO_PARTITION_ID,
"^http(?=\\.).*?\\.dynamodb\\.table(?=\\.).*?\\."
"capacity(?=\\.).*?(\\.__partition_id=(\\w{7}))$",
".dynamodb.table.");
addRe2(DYNAMO_PARTITION_ID,
R"(^http\.<NAME>\.dynamodb\.table\.<NAME>\.capacity\.<NAME>(\.__partition_id=(\w{7}))$)",
".dynamodb.table.");

// http.[<stat_prefix>.]dynamodb.operation.(<operation_name>.)<base_stat> or
// http.[<stat_prefix>.]dynamodb.table.[<table_name>.]capacity.(<operation_name>.)[<partition_id>]
addRegex(DYNAMO_OPERATION,
"^http(?=\\.).*?\\.dynamodb.(?:operation|table(?="
"\\.).*?\\.capacity)(\\.(.*?))(?:\\.|$)",
".dynamodb.");
addRe2(DYNAMO_OPERATION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe the two options listed above are a holdover from when the stat extraction was serial and this regex would have to run either before or after the previous one. I wonder if this could be simplified now that every extractor runs on the original stat-name, rather than being chained.

This is not related to the RE2 conversion and I'm fine to leave this as a TODO. But I'm bringing it up now because it might make it easier to inspect the RE pattern.

R"(^http\.<NAME>\.dynamodb.(?:operation|table\.<NAME>\.capacity)(\.(<NAME>))(?:\.|$))",
".dynamodb.");

// mongo.[<stat_prefix>.]collection.[<collection>.]callsite.(<callsite>.)query.<base_stat>
addRegex(MONGO_CALLSITE,
R"(^mongo(?=\.).*?\.collection(?=\.).*?\.callsite\.((.*?)\.).*?query.\w+?$)",
".collection.");
addRe2(MONGO_CALLSITE, R"(^mongo\.<NAME>\.collection\.<NAME>\.callsite\.((<NAME>)\.)query\.)",
".collection.");

// http.[<stat_prefix>.]dynamodb.table.(<table_name>.) or
// http.[<stat_prefix>.]dynamodb.error.(<table_name>.)*
addRegex(DYNAMO_TABLE, R"(^http(?=\.).*?\.dynamodb.(?:table|error)\.((.*?)\.))", ".dynamodb.");
addRe2(DYNAMO_TABLE, R"(^http\.<NAME>\.dynamodb.(?:table|error)\.((<NAME>)\.))", ".dynamodb.");

// mongo.[<stat_prefix>.]collection.(<collection>.)query.<base_stat>
addRegex(MONGO_COLLECTION, R"(^mongo(?=\.).*?\.collection\.((.*?)\.).*?query.\w+?$)",
".collection.");
addRe2(MONGO_COLLECTION, R"(^mongo\.<NAME>\.collection\.((<NAME>)\.).*?query\.)", ".collection.");

// mongo.[<stat_prefix>.]cmd.(<cmd>.)<base_stat>
addRegex(MONGO_CMD, R"(^mongo(?=\.).*?\.cmd\.((.*?)\.)\w+?$)", ".cmd.");
htuch marked this conversation as resolved.
Show resolved Hide resolved
addRe2(MONGO_CMD, R"(^mongo\.<NAME>\.cmd\.((<NAME>)\.))", ".cmd.");

// cluster.[<route_target_cluster>.]grpc.[<grpc_service>.](<grpc_method>.)<base_stat>
addRegex(GRPC_BRIDGE_METHOD, R"(^cluster(?=\.).*?\.grpc(?=\.).*\.((.*?)\.)\w+?$)", ".grpc.");
// cluster.[<route_target_cluster>.]grpc.[<grpc_service>.](<grpc_method>.)*
addRe2(GRPC_BRIDGE_METHOD, R"(^cluster\.<NAME>\.grpc\.<NAME>\.((<NAME>)\.))", ".grpc.");

// http.[<stat_prefix>.]user_agent.(<user_agent>.)<base_stat>
addRegex(HTTP_USER_AGENT, R"(^http(?=\.).*?\.user_agent\.((.*?)\.)\w+?$)", ".user_agent.");
// http.[<stat_prefix>.]user_agent.(<user_agent>.)*
addRe2(HTTP_USER_AGENT, R"(^http\.<NAME>\.user_agent\.((<NAME>)\.))", ".user_agent.");

// vhost.[<virtual host name>.]vcluster.(<virtual_cluster_name>.)<base_stat>
addRegex(VIRTUAL_CLUSTER, R"(^vhost(?=\.).*?\.vcluster\.((.*?)\.)\w+?$)", ".vcluster.");
htuch marked this conversation as resolved.
Show resolved Hide resolved
// vhost.[<virtual host name>.]vcluster.(<virtual_cluster_name>.)*
addRe2(VIRTUAL_CLUSTER, R"(^vhost\.<NAME>\.vcluster\.((<NAME>)\.))", ".vcluster.");

// http.[<stat_prefix>.]fault.(<downstream_cluster>.)<base_stat>
addRegex(FAULT_DOWNSTREAM_CLUSTER, R"(^http(?=\.).*?\.fault\.((.*?)\.)\w+?$)", ".fault.");
// http.[<stat_prefix>.]fault.(<downstream_cluster>.)*
addRe2(FAULT_DOWNSTREAM_CLUSTER, R"(^http\.<NAME>\.fault\.((<NAME>)\.))", ".fault.");

// listener.[<address>.]ssl.cipher.(<cipher>)
addRegex(SSL_CIPHER, R"(^listener(?=\.).*?\.ssl\.cipher(\.(.*?))$)");
addRe2(SSL_CIPHER, R"(^listener\..*?\.ssl\.cipher(\.(<CIPHER>))$)");

// cluster.[<cluster_name>.]ssl.ciphers.(<cipher>)
addRegex(SSL_CIPHER_SUITE, R"(^cluster(?=\.).*?\.ssl\.ciphers(\.(.*?))$)", ".ssl.ciphers.");
addRe2(SSL_CIPHER_SUITE, R"(^cluster\.<NAME>\.ssl\.ciphers(\.(<CIPHER>))$)", ".ssl.ciphers.");

// cluster.[<route_target_cluster>.]grpc.(<grpc_service>.)*
addRegex(GRPC_BRIDGE_SERVICE, R"(^cluster(?=\.).*?\.grpc\.((.*?)\.))", ".grpc.");
addRe2(GRPC_BRIDGE_SERVICE, R"(^cluster\.<NAME>\.grpc\.((<NAME>)\.))", ".grpc.");

// tcp.(<stat_prefix>.)<base_stat>
addRegex(TCP_PREFIX, R"(^tcp\.((.*?)\.)\w+?$)");
addRe2(TCP_PREFIX, R"(^tcp\.((<NAME>)\.))");

// udp.(<stat_prefix>.)<base_stat>
addRegex(UDP_PREFIX, R"(^udp\.((.*?)\.)\w+?$)");
addRe2(UDP_PREFIX, R"(^udp\.((<NAME>)\.))");

// auth.clientssl.(<stat_prefix>.)<base_stat>
addRegex(CLIENTSSL_PREFIX, R"(^auth\.clientssl\.((.*?)\.)\w+?$)");
// auth.clientssl.(<stat_prefix>.)*
addRe2(CLIENTSSL_PREFIX, R"(^auth\.clientssl\.((<NAME>)\.))");

// ratelimit.(<stat_prefix>.)<base_stat>
addRegex(RATELIMIT_PREFIX, R"(^ratelimit\.((.*?)\.)\w+?$)");
// ratelimit.(<stat_prefix>.)*
addRe2(RATELIMIT_PREFIX, R"(^ratelimit\.((<NAME>)\.))");

// cluster.(<cluster_name>.)*
addRe2(CLUSTER_NAME, "^cluster\\.(([^\\.]+)\\.).*");
addRe2(CLUSTER_NAME, R"(^cluster\.((<NAME>)\.))");

// listener.[<address>.]http.(<stat_prefix>.)*
addRegex(HTTP_CONN_MANAGER_PREFIX, R"(^listener(?=\.).*?\.http\.((.*?)\.))", ".http.");
// The <address> part can be anything here (.*?) for the sake of a simpler
// internal state of the regex which performs better.
addRe2(HTTP_CONN_MANAGER_PREFIX, R"(^listener\..*?\.http\.((<NAME>)\.))", ".http.");
jmarantz marked this conversation as resolved.
Show resolved Hide resolved

// http.(<stat_prefix>.)*
addRegex(HTTP_CONN_MANAGER_PREFIX, "^http\\.((.*?)\\.)");
addRe2(HTTP_CONN_MANAGER_PREFIX, R"(^http\.((<NAME>)\.))");

// listener.(<address>.)*
addRegex(LISTENER_ADDRESS,
R"(^listener\.(((?:[_.[:digit:]]*|[_\[\]aAbBcCdDeEfF[:digit:]]*))\.))");
addRe2(LISTENER_ADDRESS, R"(^listener\.((<ADDRESS>)\.))");

// vhost.(<virtual host name>.)*
addRegex(VIRTUAL_HOST, "^vhost\\.((.*?)\\.)");
addRe2(VIRTUAL_HOST, R"(^vhost\.((<NAME>)\.))");

// mongo.(<stat_prefix>.)*
addRegex(MONGO_PREFIX, "^mongo\\.((.*?)\\.)");
addRe2(MONGO_PREFIX, R"(^mongo\.((<NAME>)\.))");

// http.[<stat_prefix>.]rds.(<route_config_name>.)<base_stat>
addRegex(RDS_ROUTE_CONFIG, R"(^http(?=\.).*?\.rds\.((.*?)\.)\w+?$)", ".rds.");
addRe2(RDS_ROUTE_CONFIG, R"(^http\.<NAME>\.rds\.((<ROUTE_CONFIG_NAME>)\.)\w+?$)", ".rds.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why not drop the base_stat at the end in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<ROUTE_CONFIG_NAME> can contain dots and there's no any other anchor after it except EOL. This makes the regex slower unfortunately.
Added a comment to explain it to my future self.


// listener_manager.(worker_<id>.)*
addRegex(WORKER_ID, R"(^listener_manager\.((worker_\d+)\.))", "listener_manager.worker_");
}

void TagNameValues::addRegex(const std::string& name, const std::string& regex,
const std::string& substr) {
descriptor_vec_.emplace_back(Descriptor{name, regex, substr, Regex::Type::StdRegex});
addRe2(WORKER_ID, R"(^listener_manager\.((worker_\d+)\.))", "listener_manager.worker_");
}

void TagNameValues::addRe2(const std::string& name, const std::string& regex,
const std::string& substr) {
descriptor_vec_.emplace_back(Descriptor{name, regex, substr, Regex::Type::Re2});
descriptor_vec_.emplace_back(Descriptor{name, expandRegex(regex), substr, Regex::Type::Re2});
}

} // namespace Config
Expand Down
1 change: 0 additions & 1 deletion source/common/config/well_known_names.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,6 @@ class TagNameValues {
const std::vector<Descriptor>& descriptorVec() const { return descriptor_vec_; }

private:
void addRegex(const std::string& name, const std::string& regex, const std::string& substr = "");
void addRe2(const std::string& name, const std::string& regex, const std::string& substr = "");

// Collection of tag descriptors.
Expand Down
14 changes: 11 additions & 3 deletions source/common/stats/tag_extractor_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@ bool regexStartsWithDot(absl::string_view regex) {

TagExtractorImplBase::TagExtractorImplBase(absl::string_view name, absl::string_view regex,
absl::string_view substr)
: name_(name), prefix_(std::string(extractRegexPrefix(regex))), substr_(substr) {}
: name_(name), prefix_(std::string(extractRegexPrefix(regex))), substr_(substr) {
PERF_TAG_COUNTERS_INIT(counters_);
}

std::string TagExtractorImplBase::extractRegexPrefix(absl::string_view regex) {
std::string prefix;
Expand Down Expand Up @@ -90,6 +92,7 @@ bool TagExtractorStdRegexImpl::extractTag(absl::string_view stat_name, std::vect

if (substrMismatch(stat_name)) {
PERF_RECORD(perf, "re-skip", name_);
PERF_TAG_SKIPPED_INC(counters_);
return false;
}

Expand All @@ -113,9 +116,11 @@ bool TagExtractorStdRegexImpl::extractTag(absl::string_view stat_name, std::vect
std::string::size_type end = remove_subexpr.second - stat_name.begin();
remove_characters.insert(start, end);
PERF_RECORD(perf, "re-match", name_);
PERF_TAG_MATCHED_INC(counters_);
return true;
}
PERF_RECORD(perf, "re-miss", name_);
PERF_TAG_MISSED_INC(counters_);
return false;
}

Expand All @@ -129,15 +134,16 @@ bool TagExtractorRe2Impl::extractTag(absl::string_view stat_name, std::vector<Ta

if (substrMismatch(stat_name)) {
PERF_RECORD(perf, "re2-skip", name_);
PERF_TAG_SKIPPED_INC(counters_);
return false;
}

// remove_subexpr is the first submatch. It represents the portion of the string to be removed.
re2::StringPiece remove_subexpr, value_subexpr;

// The regex must match and contain one or more subexpressions (all after the first are ignored).
if (re2::RE2::FullMatch(re2::StringPiece(stat_name.data(), stat_name.size()), regex_,
&remove_subexpr, &value_subexpr) &&
if (re2::RE2::PartialMatch(re2::StringPiece(stat_name.data(), stat_name.size()), regex_,
&remove_subexpr, &value_subexpr) &&
!remove_subexpr.empty()) {

// value_subexpr is the optional second submatch. It is usually inside the first submatch
Expand All @@ -155,9 +161,11 @@ bool TagExtractorRe2Impl::extractTag(absl::string_view stat_name, std::vector<Ta
remove_characters.insert(start, end);

PERF_RECORD(perf, "re2-match", name_);
PERF_TAG_MATCHED_INC(counters_);
return true;
}
PERF_RECORD(perf, "re2-miss", name_);
PERF_TAG_MISSED_INC(counters_);
return false;
}

Expand Down
39 changes: 39 additions & 0 deletions source/common/stats/tag_extractor_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
#include <regex>
#include <string>

#ifdef ENVOY_PERF_ANNOTATION
#include <fmt/core.h>
#endif

#include "envoy/stats/tag_extractor.h"

#include "common/common/regex.h"
Expand All @@ -14,6 +18,39 @@
namespace Envoy {
namespace Stats {

// To check if a tag extractor is actually used you can run
// bazel test //test/... --test_output=streamed --define=perf_annotation=enabled
#ifdef ENVOY_PERF_ANNOTATION

struct Counters {
uint32_t skipped_{};
uint32_t matched_{};
uint32_t missed_{};
};

#define PERF_TAG_COUNTERS(var) \
~TagExtractorImplBase() override { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: putting the whole destructor here is a bit magical, I think I would prefer it under an explicit #ifdef

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::cout << fmt::format("Stats for {} tag extractor: skipped {}, matched {}, missing {}", \
name_, var->skipped_, var->matched_, var->missed_) \
<< std::endl; \
} \
std::unique_ptr<Counters> var

#define PERF_TAG_COUNTERS_INIT(var) var = std::make_unique<Counters>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wrap the expansion in parens or do/while for safety.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what's the scenario where this would parse unexpectedly?

Usually I'd use do/while in a macro if there were multiple statements in the expansion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I make the macros non-parameterized as they are used only once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense. You could also have

#define PERF_TAG_INC(member) ++counters->member

and then you only need one macro rather than one each for skipped/missed/matched, but I am not too concerned about this; whatever you prefer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This simplifies the code a bit. Updated.

#define PERF_TAG_SKIPPED_INC(var) var->skipped_++
#define PERF_TAG_MISSED_INC(var) var->missed_++
#define PERF_TAG_MATCHED_INC(var) var->matched_++

#else

#define PERF_TAG_COUNTERS(var)
#define PERF_TAG_COUNTERS_INIT(var)
#define PERF_TAG_SKIPPED_INC(var)
#define PERF_TAG_MISSED_INC(var)
#define PERF_TAG_MATCHED_INC(var)

#endif

class TagExtractorImplBase : public TagExtractor {
public:
/**
Expand Down Expand Up @@ -62,6 +99,8 @@ class TagExtractorImplBase : public TagExtractor {
const std::string name_;
const std::string prefix_;
const std::string substr_;

PERF_TAG_COUNTERS(counters_);
};

class TagExtractorStdRegexImpl : public TagExtractorImplBase {
Expand Down
19 changes: 19 additions & 0 deletions test/common/stats/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,25 @@ envoy_cc_test(
],
)

envoy_cc_benchmark_binary(
name = "tag_extractor_impl_benchmark",
srcs = [
"tag_extractor_impl_speed_test.cc",
],
external_deps = [
"benchmark",
],
deps = [
"//source/common/stats:tag_producer_lib",
"@envoy_api//envoy/config/metrics/v3:pkg_cc_proto",
],
)

envoy_benchmark_test(
name = "tag_extractor_impl_benchmark_test",
benchmark_binary = "tag_extractor_impl_benchmark",
)

envoy_cc_test(
name = "thread_local_store_test",
srcs = ["thread_local_store_test.cc"],
Expand Down
Loading