Skip to content

Conversation

@SolidWallOfCode
Copy link
Member

This is a major restructuring of the HostDB internals, which should (externally) be backwards compatible. The original goal of this work was to provide better control of sending requests to down upstreams. That turned out to require rewriting the internal data structures to support atomic operations on shared state (otherwise it is not possible to have fine grained control of requests to the same upstream). Additional work was done to be in compliance with the long term L7 Routing strategy as defined by the L7 Working group.

This has been tested in production but more should be done.

The documentation needs a lot of work.

@SolidWallOfCode SolidWallOfCode added this to the 10.0.0 milestone May 24, 2021
@SolidWallOfCode SolidWallOfCode self-assigned this May 24, 2021
}

inline bool
ResolveInfo::mark_active_server_alive()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the comments:

/** Mark the active target as alive.   
 *  
 * @return @c true if the target changed state.   
 */  
 bool mark_active_server_alive();

how this function could return true? should the mark_up() play a role in here?

@SolidWallOfCode SolidWallOfCode force-pushed the hostdb-restructure branch 2 times, most recently from 08c3e53 to ee34782 Compare June 3, 2021 17:31
@SolidWallOfCode SolidWallOfCode force-pushed the hostdb-restructure branch 2 times, most recently from f7a9428 to 7221fd1 Compare June 10, 2021 21:10
@SolidWallOfCode SolidWallOfCode marked this pull request as ready for review June 11, 2021 06:03
@bryancall
Copy link
Contributor

@rob05c says he is going to take a look at it

@bryancall
Copy link
Contributor

There is an issue with this PR and the Carp plugin.

=========

The primary operation for HostDB is to resolve a fully qualified domain name ("FQDN"). As noted each
FQDN is associated with a single record. Each record has an array of items. When a resolution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand "FQDN is associated with a single record." FQDNs are associated with a record of each type, e.g. an FQDN can have an A and an AAAA record, right? Or does "record" here refer to a code structure, not a DNS Record?

``ResolveInfo`` may contain a reference to a HostDB record, which preserves the record even if it is
replaced due to DNS queries in other transactions. The record is not required as the resolution
information can be supplied directly without DNS or HostDB, e.g. a plugin sets the upstream address
explicitly. The ``resolved_p`` flag indicates if the current information is valid and ready to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what it means for resolved_p to be false, i.e. what it means for it to be "ready to be used," or why a plugin or core would set that. Docs explaining that might be helpful

provides a noticeable simplification of the code.

* Single and multiple address results are treated identically - a singleton is simply a multiple
of size 1. This yeilds a major simplification of the implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo "yields"

*
* @param v The variable to update.
*/
ConfigDuration(V &v) : _var(&v) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit?

if (query_name.empty()) {
if (is_byname()) {
Debug("hostdb", "lookup_done() failed for '%.*s'", hash.host_len, hash.host_name);
Debug("hostdb", "lookup_done() failed for '%.*s'", int(hash.host_name.size()), hash.host_name.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructor style-casts are not Leif-legal.

} else {
break;
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray ;

HostDBInfo *live[rr.count()];
for (auto &target : rr) {
// skip dead upstreams.
if (rr[i].is_dead(now, fail_window)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be target.is_dead ?

Copy link
Member

@rob05c rob05c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have some questions about things I don't understand, but I generally agree with this.

I went over all the code. I don't fully understand it, hostdb is very complex and it would take considerably longer than I have right now to get a complete understanding. But, the structure looks fine to me, and I didn't see any obvious bugs.

I also ran it, as well as added debugs to see what functions are called and when during requests. It all seems to work without issue, on successes and NXDOMAINs.

inline HostDBInfo &
HostDBInfo::operator=(HostDBInfo const &that)
{
if (this != &that) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal to assign a HostDBInfo to itself?

return true; // it's alive and so is valid for selection.
}
// Success means this is a zombie and this thread updated the failure time.
return (t0 + fail_window < now) && last_failure.compare_exchange_strong(t0, now);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this CAS will fail if another thread updated underneath us? What should the caller do if it gets a failure?

} else if (t_state.hdr_info.client_request.version_get() == HTTPVersion(1, 0)) {
t_state.dns_info.http_version = HTTP_1_0;
} else {
t_state.dns_info.http_version = HTTP_1_1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever be H2? I don't really understand this, we're setting the http_version of the DNS record to the version the client used? Shouldn't it be the server?

SMDebug("dns", "[HttpTransact::HandleRequest] Skipping DNS lookup, provided by plugin");
} else if (t_state.dns_info.looking_up == ResolveInfo::ORIGIN_SERVER && t_state.http_config_param->no_dns_forward_to_parent &&
t_state.parent_result.result != PARENT_UNDEFINED) {
t_state.dns_info.resolved_p = true; // seems dangerous - where's the IP address?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain "dangerous?" Is this an actual bug? Or only if anything ever gets dns_info's address without checking existence?

s->host_db_info.app.http_data.http_version = HTTP_1_0;
s->server_info.http_version = HTTP_1_0;
s->server_info.keep_alive = HTTP_NO_KEEPALIVE;
s->server_info.http_version = HTTP_1_0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this can just be removed, it's duplicated 2 lines up

"Generic storage for HostDBApplicationInfo is smaller than the union storage.");
static ts_time hostdb_hostfile_update_timestamp{TS_TIME_ZERO};
static char hostdb_filename[PATH_NAME_MAX] = DEFAULT_HOST_DB_FILENAME;
int hostdb_max_count = DEFAULT_HOST_DB_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we swallow this ugliness to make sure write access to global vars like hostdb_max_count is limited to HostDB.cc?

namespace ts
{
template <typename T, class Tag>
struct GlobalAccess;

template <typename T, class Tag, T Iv = T()>
class Global
{
public:
  Global() : _v{Iv} {}

  using Type = T;

  T operator()() const
  {
    return _v;
  }

  friend struct GlobalAccess<T, Tag>;

private:
  T _v;
};

} // end namespace ts

// in HostDB.h
struct HostDBMaxCountTag {};
using HostDBMaxCount = ts::Global<int, HostDBMaxCountTag>;
HostDBMaxCount hostdb_max_count;

// test
int f()
{
  return hostdb_max_count();
}

// in HostDB.cc
namespace ts
{
template <>
struct GlobalAccess<int, HostDBMaxCountTag>
{
  static int &ref()
  {
    return hostdb_max_count._v;
  }
};
}

// test
void g(int i)
{
  ts::GlobalAccess<int, HostDBMaxCountTag>::ref() = i;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that be better than

namespace { int hostdb_max_count = DEFAULT_HOST_DB_SIZE; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's better as long as hostdb_max_count doesn't have to be used outside of HostDB.cc.

Copy link
Contributor

@ywkaras ywkaras Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the above is to give read-only access outside of HostDB.cc.

@jrushford jrushford self-requested a review August 13, 2021 13:50
@bryancall
Copy link
Contributor

[approve ci]

// number of partitions
REC_ReadConfigInt32(hostdb_partitions, "proxy.config.hostdb.partitions");
// how often to sync hostdb to disk
REC_EstablishStaticConfigInt32(hostdb_sync_frequency, "proxy.config.cache.hostdb.sync_frequency");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change results in hostdb_sync_frequency always being 0, so no host.db file is ever written.

Copy link
Contributor

@bneradt bneradt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change to me. I have some suggestions, but they're minor.

SRV, ///< SRV record.
HOST ///< Hostname (reverse DNS)
};
char const *name_of(HostDBType t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can be a constexpr function.

/// - 1.1 tweak HostDBApplicationInfo::http_data.
return ts::VersionNumber(1, 1);
}
HostDBInfo &operator=(HostDBInfo const &that);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self_type for HostDBInfo?

Comment on lines +132 to +137
/** Select this target.
*
* @param now Current time.
* @param fail_window Failure window.
* @return Status of the selection.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's flesh out the description of the function, describing the intention of its use, and the semantics of the boolean returned (I think true means it's alive, but false means dead or zombie).

Comment on lines +129 to +130
/// Target has failed and is still in the blocked time window.
bool is_dead(ts_time now, ts_seconds fail_window);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to add an is_zombie function as well.


bool is_ip_fail_timeout() const;

void refresh_ip();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to document what this function does.

Comment on lines 164 to +171
struct HostDBCache {
int start(int flags = 0);
// Map to contain all of the host file overrides, initialize it to empty
Ptr<RefCountedHostsFileMap> hosts_file_ptr;
std::shared_ptr<HostFileMap> host_file;
std::shared_mutex host_file_mutex;

// TODO: make ATS call a close() method or something on shutdown (it does nothing of the sort today)
RefCountCache<HostDBInfo> *refcountcache = nullptr;
RefCountCache<HostDBRecord> *refcountcache = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're changing HostDBCache, should the HOST_DB_CACHE_MAJOR_VERSION and HOST_DB_CACHE_MINOR_VERSION be updated?

https://github.com/apache/trafficserver/pull/7874/files#diff-944e3e1e73047c1013f2944bd7d5f4eb750efdee052d429ddfdbb799f319d1dbR88-R90

I'm guessing not since they are never referenced. :) Can we simply remove those defines?

Comment on lines +40 to +44
Each info tracks several status values for its corresponding upstream. These are

* HTTP version
* Last failure time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HostDBInfo also has the IP address or SRV name, depending upon its type. That maybe should be mentioned here too.

Comment on lines +57 to +58
window has passed or the single connection succeeds. A successful connection clears the last file
time and the info becomes alive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"last file time" -> "last fail time"

Comment on lines +80 to +81
State information for the outbound connection has been moved to a refurbished ``DNSInfo`` class
named ``ResolveInfo``. As much as possible relevant state information has been moved from the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing DNSInfo is helpful to us as reviewers, but probably not after this PR is merged in. Probably reword the first sentence to:

State information for the output connection is stored in an instance of a 
class named ``ResolveInfo``. 

Comment on lines +137 to +139
Currently if an upstream is marked down connections are still permitted, the only change is the
number of retries. This has caused operational problems where dead systems are flooded with requests
which, despite the timeouts, accumulate in ATS until ATS runs out of memory (there were instances of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sentences use down and dead interchangeably. #7283 records the desire to generally unify and clarify the wording, but for new docs let's make the wording consistent within the doc.

SolidWallOfCode and others added 2 commits April 11, 2022 18:47
The default duration of time_since_epoch() is
std::chrono::high_resolution_clock::duration, which will not generally
be seconds. hostDB.refcountcache->put expects the epoch count to be a
number of seconds. This explicitly casts to seconds so we get that
expected value.

This also makes some other std::chrono time updates.

Co-authored-by: bneradt <bneradt@yahooinc.com>
@SolidWallOfCode SolidWallOfCode deleted the hostdb-restructure branch June 23, 2022 20:42
@zwoop zwoop removed this from the 10.0.0 milestone Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants