-
Notifications
You must be signed in to change notification settings - Fork 846
Hostdb restructuring #7874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hostdb restructuring #7874
Conversation
427cccb to
22e0bda
Compare
| } | ||
|
|
||
| inline bool | ||
| ResolveInfo::mark_active_server_alive() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the comments:
/** Mark the active target as alive.
*
* @return @c true if the target changed state.
*/
bool mark_active_server_alive();
how this function could return true? should the mark_up() play a role in here?
08c3e53 to
ee34782
Compare
f7a9428 to
7221fd1
Compare
|
@rob05c says he is going to take a look at it |
5e56c8a to
330a203
Compare
|
There is an issue with this PR and the Carp plugin. |
| ========= | ||
|
|
||
| The primary operation for HostDB is to resolve a fully qualified domain name ("FQDN"). As noted each | ||
| FQDN is associated with a single record. Each record has an array of items. When a resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand "FQDN is associated with a single record." FQDNs are associated with a record of each type, e.g. an FQDN can have an A and an AAAA record, right? Or does "record" here refer to a code structure, not a DNS Record?
| ``ResolveInfo`` may contain a reference to a HostDB record, which preserves the record even if it is | ||
| replaced due to DNS queries in other transactions. The record is not required as the resolution | ||
| information can be supplied directly without DNS or HostDB, e.g. a plugin sets the upstream address | ||
| explicitly. The ``resolved_p`` flag indicates if the current information is valid and ready to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand what it means for resolved_p to be false, i.e. what it means for it to be "ready to be used," or why a plugin or core would set that. Docs explaining that might be helpful
| provides a noticeable simplification of the code. | ||
|
|
||
| * Single and multiple address results are treated identically - a singleton is simply a multiple | ||
| of size 1. This yeilds a major simplification of the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo "yields"
| * | ||
| * @param v The variable to update. | ||
| */ | ||
| ConfigDuration(V &v) : _var(&v) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicit?
| if (query_name.empty()) { | ||
| if (is_byname()) { | ||
| Debug("hostdb", "lookup_done() failed for '%.*s'", hash.host_len, hash.host_name); | ||
| Debug("hostdb", "lookup_done() failed for '%.*s'", int(hash.host_name.size()), hash.host_name.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructor style-casts are not Leif-legal.
| } else { | ||
| break; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray ;
| HostDBInfo *live[rr.count()]; | ||
| for (auto &target : rr) { | ||
| // skip dead upstreams. | ||
| if (rr[i].is_dead(now, fail_window)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be target.is_dead ?
rob05c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have some questions about things I don't understand, but I generally agree with this.
I went over all the code. I don't fully understand it, hostdb is very complex and it would take considerably longer than I have right now to get a complete understanding. But, the structure looks fine to me, and I didn't see any obvious bugs.
I also ran it, as well as added debugs to see what functions are called and when during requests. It all seems to work without issue, on successes and NXDOMAINs.
| inline HostDBInfo & | ||
| HostDBInfo::operator=(HostDBInfo const &that) | ||
| { | ||
| if (this != &that) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it normal to assign a HostDBInfo to itself?
| return true; // it's alive and so is valid for selection. | ||
| } | ||
| // Success means this is a zombie and this thread updated the failure time. | ||
| return (t0 + fail_window < now) && last_failure.compare_exchange_strong(t0, now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this CAS will fail if another thread updated underneath us? What should the caller do if it gets a failure?
| } else if (t_state.hdr_info.client_request.version_get() == HTTPVersion(1, 0)) { | ||
| t_state.dns_info.http_version = HTTP_1_0; | ||
| } else { | ||
| t_state.dns_info.http_version = HTTP_1_1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever be H2? I don't really understand this, we're setting the http_version of the DNS record to the version the client used? Shouldn't it be the server?
| SMDebug("dns", "[HttpTransact::HandleRequest] Skipping DNS lookup, provided by plugin"); | ||
| } else if (t_state.dns_info.looking_up == ResolveInfo::ORIGIN_SERVER && t_state.http_config_param->no_dns_forward_to_parent && | ||
| t_state.parent_result.result != PARENT_UNDEFINED) { | ||
| t_state.dns_info.resolved_p = true; // seems dangerous - where's the IP address? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain "dangerous?" Is this an actual bug? Or only if anything ever gets dns_info's address without checking existence?
| s->host_db_info.app.http_data.http_version = HTTP_1_0; | ||
| s->server_info.http_version = HTTP_1_0; | ||
| s->server_info.keep_alive = HTTP_NO_KEEPALIVE; | ||
| s->server_info.http_version = HTTP_1_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this can just be removed, it's duplicated 2 lines up
| "Generic storage for HostDBApplicationInfo is smaller than the union storage."); | ||
| static ts_time hostdb_hostfile_update_timestamp{TS_TIME_ZERO}; | ||
| static char hostdb_filename[PATH_NAME_MAX] = DEFAULT_HOST_DB_FILENAME; | ||
| int hostdb_max_count = DEFAULT_HOST_DB_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we swallow this ugliness to make sure write access to global vars like hostdb_max_count is limited to HostDB.cc?
namespace ts
{
template <typename T, class Tag>
struct GlobalAccess;
template <typename T, class Tag, T Iv = T()>
class Global
{
public:
Global() : _v{Iv} {}
using Type = T;
T operator()() const
{
return _v;
}
friend struct GlobalAccess<T, Tag>;
private:
T _v;
};
} // end namespace ts
// in HostDB.h
struct HostDBMaxCountTag {};
using HostDBMaxCount = ts::Global<int, HostDBMaxCountTag>;
HostDBMaxCount hostdb_max_count;
// test
int f()
{
return hostdb_max_count();
}
// in HostDB.cc
namespace ts
{
template <>
struct GlobalAccess<int, HostDBMaxCountTag>
{
static int &ref()
{
return hostdb_max_count._v;
}
};
}
// test
void g(int i)
{
ts::GlobalAccess<int, HostDBMaxCountTag>::ref() = i;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would that be better than
namespace { int hostdb_max_count = DEFAULT_HOST_DB_SIZE; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's better as long as hostdb_max_count doesn't have to be used outside of HostDB.cc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of the above is to give read-only access outside of HostDB.cc.
|
[approve ci] |
| // number of partitions | ||
| REC_ReadConfigInt32(hostdb_partitions, "proxy.config.hostdb.partitions"); | ||
| // how often to sync hostdb to disk | ||
| REC_EstablishStaticConfigInt32(hostdb_sync_frequency, "proxy.config.cache.hostdb.sync_frequency"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change results in hostdb_sync_frequency always being 0, so no host.db file is ever written.
bneradt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good change to me. I have some suggestions, but they're minor.
| SRV, ///< SRV record. | ||
| HOST ///< Hostname (reverse DNS) | ||
| }; | ||
| char const *name_of(HostDBType t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be a constexpr function.
| /// - 1.1 tweak HostDBApplicationInfo::http_data. | ||
| return ts::VersionNumber(1, 1); | ||
| } | ||
| HostDBInfo &operator=(HostDBInfo const &that); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self_type for HostDBInfo?
| /** Select this target. | ||
| * | ||
| * @param now Current time. | ||
| * @param fail_window Failure window. | ||
| * @return Status of the selection. | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's flesh out the description of the function, describing the intention of its use, and the semantics of the boolean returned (I think true means it's alive, but false means dead or zombie).
| /// Target has failed and is still in the blocked time window. | ||
| bool is_dead(ts_time now, ts_seconds fail_window); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to add an is_zombie function as well.
|
|
||
| bool is_ip_fail_timeout() const; | ||
|
|
||
| void refresh_ip(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to document what this function does.
| struct HostDBCache { | ||
| int start(int flags = 0); | ||
| // Map to contain all of the host file overrides, initialize it to empty | ||
| Ptr<RefCountedHostsFileMap> hosts_file_ptr; | ||
| std::shared_ptr<HostFileMap> host_file; | ||
| std::shared_mutex host_file_mutex; | ||
|
|
||
| // TODO: make ATS call a close() method or something on shutdown (it does nothing of the sort today) | ||
| RefCountCache<HostDBInfo> *refcountcache = nullptr; | ||
| RefCountCache<HostDBRecord> *refcountcache = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're changing HostDBCache, should the HOST_DB_CACHE_MAJOR_VERSION and HOST_DB_CACHE_MINOR_VERSION be updated?
I'm guessing not since they are never referenced. :) Can we simply remove those defines?
| Each info tracks several status values for its corresponding upstream. These are | ||
|
|
||
| * HTTP version | ||
| * Last failure time | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HostDBInfo also has the IP address or SRV name, depending upon its type. That maybe should be mentioned here too.
| window has passed or the single connection succeeds. A successful connection clears the last file | ||
| time and the info becomes alive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"last file time" -> "last fail time"
| State information for the outbound connection has been moved to a refurbished ``DNSInfo`` class | ||
| named ``ResolveInfo``. As much as possible relevant state information has been moved from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referencing DNSInfo is helpful to us as reviewers, but probably not after this PR is merged in. Probably reword the first sentence to:
State information for the output connection is stored in an instance of a
class named ``ResolveInfo``.
| Currently if an upstream is marked down connections are still permitted, the only change is the | ||
| number of retries. This has caused operational problems where dead systems are flooded with requests | ||
| which, despite the timeouts, accumulate in ATS until ATS runs out of memory (there were instances of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sentences use down and dead interchangeably. #7283 records the desire to generally unify and clarify the wording, but for new docs let's make the wording consistent within the doc.
0830c0c to
630a2e8
Compare
The default duration of time_since_epoch() is std::chrono::high_resolution_clock::duration, which will not generally be seconds. hostDB.refcountcache->put expects the epoch count to be a number of seconds. This explicitly casts to seconds so we get that expected value. This also makes some other std::chrono time updates. Co-authored-by: bneradt <bneradt@yahooinc.com>
This is a major restructuring of the HostDB internals, which should (externally) be backwards compatible. The original goal of this work was to provide better control of sending requests to down upstreams. That turned out to require rewriting the internal data structures to support atomic operations on shared state (otherwise it is not possible to have fine grained control of requests to the same upstream). Additional work was done to be in compliance with the long term L7 Routing strategy as defined by the L7 Working group.
This has been tested in production but more should be done.
The documentation needs a lot of work.