Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing client data in tdb #1145

Closed
wants to merge 2,430 commits into from
Closed

Storing client data in tdb #1145

wants to merge 2,430 commits into from

Conversation

avbelov23
Copy link
Contributor

Requires #1115

Changes:

  • Transfer tfw_current_timestamp() to library
  • Keep client data in TDB (client_db and client_tbl_size options) after the last client connection is closed and use their some time (client_lifetime option)

aleksostapenko and others added 30 commits March 20, 2018 02:30
Fix #903: add logging for all error responses.
Fix releasing of requests and responses on client disconnects
Resolve issues from static analyzer report.
Fix #978: mark started flag as false after failed start on reload.
tfw_http_msg_alloc_resp_light() is inlined function and overriding
return code is not possible for it. Set stap for __tfw_http_msg_alloc()
function and check for arguments in order to override return only for
tfw_http_msg_alloc_resp_light.
Message parsing in deproxy was improved and garbage after
headers end no more treated as body. Manually append garbage
to string representation of the message.
Fix some issues in functional tests
Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is not fully done and contain race conditions, it can't be merged in current state.

lib/common.h Outdated
/**
* Tempesta kernel library
*
* Copyright (C) 2015-2018 Tempesta Technologies, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad copy-paste. The file is just created, year 2019 must be stated, not range of 2015-2018.

@@ -29,6 +29,7 @@
#include "server.h"
#include "str.h"
#include "vhost.h"
#include "lib/common.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http.h is big and messy header as for me. Please include the lib/common.h header only at translation units where it's really required: http.c, cache.c, client.c.


# TAG: client_lifetime
#
# Client life time in seconds. This is the time during which the clients with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very accurate description. 'Life time of client accounting data after last client connection was closed. The accounting data is used for Frang limits.`

UPD: Ah, I see in the tfw_client_obtain() that you actually mean client_lifetime to be lifetime of the TfwClient structure from the creation. This is incorrect and buggy behaviour, please see my comment in tfw_client_obtain().

# client_lifetime 3600;
#
# Default:
# client_lifetime 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, that it was copied from sess_lifetime which also has 0 as default value. But both seems to be dangerous defaults and must be redefined in wild life installations. Any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to choose some reasonable value

ent = (TfwClientEntry *)iter.rec->data;
cli = &ent->cli;
if (!memcmp_fast(&cli->addr, &addr, sizeof(cli->addr)) &&
curr_time < ent->expires) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ent->expires is initialized at TfwClient entry initialisation, and you skip expired entries here. This behaviour can be correct but it's not secure.

  1. client_lifetime is configured to, say, 5 seconds.
  2. bot opens a connection
  3. a new TfwClient entry is created during tfw_client_obtain()
  4. bot waits 6 seconds
  5. bot opens a new connection
  6. now bot has two TfwClient entries associated, so frang limits are two times higher for the bot.
  7. bot can open as many connections as he wants and pass all the frang limits.

Setting bigger client_lifetime limits doesn't fix the problem. It just require more time to exploit the vulnerability.

We shouldn't treat TfwClient as expired till it has live connections (TfwClient->conn_users != 0). Expiration time should be evaluated only after all the connections was closed. Thus we need to rearm expiration timestamp on last tfw_client_put() operation.

spin_lock_bh(&cli->conn_lock);

if (list_empty(&cli->conn_list))
TFW_INC_STAT_BH(clnt.online);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed it int the chat yesterday, but we missed race condition here. Connection is added into conn_list only when socket comes to established state, while a new TfwClient instance can be created during frang callbacks. When client spawns multiple connections, race may happen and clnt.online can be increased more than once.

return NULL;
}

len = sizeof(*cli);
rec = tdb_entry_alloc(ip_client_db, key, &len);
if (!rec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary curly brace.

return NULL;
}

len = sizeof(*cli);
rec = tdb_entry_alloc(ip_client_db, key, &len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, on previous review I've missed race condition here. Here is the scenario:

  1. Client spawns more than one connections simultaneously
  2. tfw_client_obtain() called simultaneously for conn1 and conn2
  3. Lookup failed for both conn1 and conn2
  4. Two different TfwClient instances will be created in the same TDB bucket.

Probably we need a new interface for TDB, which will find existing entry or create a new one with one action (without releasing intermediate TDB locks). Any thoughts?

Copy link
Contributor Author

@avbelov23 avbelov23 Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On previous review was locks. You said to remove them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was old locks used, when TfwClients were stored in hash table. That locks (hashtable of spinlocks) wasn't synchronised with TDB and it looked like a workaround standing aside. Every time a new entry was created TDB, spinlock from the hashtable was picked and locked, then TDB search was performed, which also use some locking.

I would like to see more elegant solution. Unlike to cache.c, in client.c we still need to insert a new entry if TDB lookup failed. Having TDB interface capable to do this in one action seems to be cleaner solution.

@krizhanovsky had more ideas on future TDB evolution, let's involve him in this discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ikoveshnikov about hash of locks cli_hash looks fishy and we don't need it because TDB provides internal synchronization for parallel selectors & updaters and/or synchronization patterns, so please remove cli_hash.

The race occurs because we do 2 operations, lookup and insert if the lookup fails, while we need the operations coupled and atomic. This is a frequent patterm, e.g in #500 we need to lookup and insert in-progress cache entry. Actually tdb_htrie_insert() was written in this assumption and it already has some code for lookup and insert on failure.

For now just please a new function tdb_rec_get_alloc() which call tdb_rec_get() and tdb_entry_alloc() if the former fals under one global spinlock (it could be even system wide, not per table). I add a requirement to implement the function properly in #515 and now we need the lock just not to crash on simple scenarios.

TfwClient *c;
CliHashBucket *hb = &cli_hash[i];
int r = 0;
/*TfwClient *c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code is commented. Without the function, PR can't be merged into master, since closing active client connections won't be performed on shutdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and IIRC we have discussed that this logic, all the entries traversal, should be implemented in this PR on TDB side. However, tfw_client_for_each() is called only for all clients destruction on Tempesta FW shutdown and this is not what we actually want to have now: Tempesta may shutdown in many reasons, e.g. a server maintenance, while clients have expire timeouts and can leave independently on the system restarts. I.e. if we have a new client with expire time 5 minutes, then why should we delete the client record on the system reset if we just created the record? So I believe we should just remove the whole client deletion logic now and leave the deletion for garbage collection in #515.

Copy link
Contributor Author

@avbelov23 avbelov23 Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not associated with deleting clients. It is associated with the closing of all connections of all clients. Without going through all the records and closing of all connections will crash. Need tdb_for_each_rec . @krizhanovsky said on December 26th in slack that he would do the function yourself in #515

int i;
if (tfw_runstate_is_reconfig())
return;
if (ip_client_db) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary curly brace.

if (init)
init(cli);

atomic64_inc(&act_cli_n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable was required for tfw_cli_wait_release().

@avbelov23
Copy link
Contributor Author

DONE

Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request is incomplete and significant adjustments are required.

return NULL;
}

len = sizeof(*cli);
rec = tdb_entry_alloc(ip_client_db, key, &len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ikoveshnikov about hash of locks cli_hash looks fishy and we don't need it because TDB provides internal synchronization for parallel selectors & updaters and/or synchronization patterns, so please remove cli_hash.

The race occurs because we do 2 operations, lookup and insert if the lookup fails, while we need the operations coupled and atomic. This is a frequent patterm, e.g in #500 we need to lookup and insert in-progress cache entry. Actually tdb_htrie_insert() was written in this assumption and it already has some code for lookup and insert on failure.

For now just please a new function tdb_rec_get_alloc() which call tdb_rec_get() and tdb_entry_alloc() if the former fals under one global spinlock (it could be even system wide, not per table). I add a requirement to implement the function properly in #515 and now we need the lock just not to crash on simple scenarios.

lib/common.h Outdated
* Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*/
#ifndef __COMMON_H__
#define __COMMON_H__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we need a new file for just this 3-lines function, usually we just copy and paste such small code. However, I don't mind too much about this and if @ikoveshnikov and @aleksostapenko fine with the file, I'm fine too.

If you leave the file then just us __LIB_COMMON_H__ - we use LIB prefix around /lib/ files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, new file may be left - for similar common stuff in future.

HLIST_HEAD_INIT,
}
};
CliHashBucket cli_hash[CLI_HASH_SZ];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get rid of all the hash rudiments.

typedef struct {
time_t expires;
TfwClient cli;
} TfwClientEntry;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that cache_cfg members are unaligned, but in recent code we align structure members with tabs.

@@ -58,26 +67,18 @@ static atomic64_t act_cli_n = ATOMIC64_INIT(0);
void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add TODO #515 employ eviction strategy for the table so that we won't forget about the place. We'll need to move expires member of TfwClientEntry to a TDB record to allow TDB to perform eviction internally.

hlist_del(&cli->hentry);

spin_unlock(cli->hb_lock);
ent = (TfwClientEntry *)((void *)cli - offsetof(TfwClientEntry, cli));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that TfwClientEntry inherits from TfwClient, so cli should be the first member of TfwClientEntry and we don't need the offsetof expression here now and won't need to update the code when we move expires to TDB-internal entry descriptors.

if (!(cli = kmem_cache_alloc(cli_cache, GFP_ATOMIC | __GFP_ZERO))) {
spin_unlock(&hb->lock);
if (unlikely(!ss_active())) {
TFW_DBG("reject allocation of new client after shutdown\n");
return NULL;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requirements from #1115

Next, the hash key for searching a client must be calculated by User-Agent plus IP address,
otherwise, if no User-Agent, only IP address.

Currently, we use netowork IP address as the address of a client, however if a client work
through a forward proxy, then a proxy can pass it's IP address by the first item in X-Forwarded-For,
so if the header is present, then we shall reinsert the TfwClient in TDB with different key.

The reinsert operation must be implemented on TDB layer as a new routine tdb_entry_reinsert()
accepting current and new keys. The function must call tdb_htrie_insert(), copying the data from
the previous location, and a new empty tdb_htrie_delete() left as TODO for #515.

aren't implemented.

TfwClient *c;
CliHashBucket *hb = &cli_hash[i];
int r = 0;
/*TfwClient *c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and IIRC we have discussed that this logic, all the entries traversal, should be implemented in this PR on TDB side. However, tfw_client_for_each() is called only for all clients destruction on Tempesta FW shutdown and this is not what we actually want to have now: Tempesta may shutdown in many reasons, e.g. a server maintenance, while clients have expire timeouts and can leave independently on the system restarts. I.e. if we have a new client with expire time 5 minutes, then why should we delete the client record on the system reset if we just created the record? So I believe we should just remove the whole client deletion logic now and leave the deletion for garbage collection in #515.

}

static int
tfw_client_lifetime(TfwCfgSpec *cs, TfwCfgEntry *ce)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use tfw_cfgop_ prefixes for configuration handlers

TFW_DBG("new client: cli=%p\n", cli);
TFW_DBG_ADDR("client address", &cli->addr, TFW_NO_PORT);

found:
if (!atomic_read(&cli->conn_users)) {
atomic64_inc(&act_cli_n);
TFW_INC_STAT_BH(clnt.online);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to do this after found:, and not before? It seems that before the found: - there is no need in additional check of cli->conn_users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client now appears online when it is allocated or got from tdb and conn_users is zero

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, missed that.

TFW_DBG("new client: cli=%p\n", cli);
TFW_DBG_ADDR("client address", &cli->addr, TFW_NO_PORT);

found:
if (!atomic_read(&cli->conn_users)) {
atomic64_inc(&act_cli_n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The act_cli_n counter (as well as the whole tfw_cli_wait_release() function) had been added to resolve the #1034 issue (explanatory comment is contained here #1118 (comment) in p. 3). Since client deletion logic is removed in this PR and clients instances are persistent entries in TDB for now - the functionality for control of clients releasing became redundant.
Besides, it seems that in case of eviction of clients instances from TDB (in context of future #515 implementation) - that functionality is not needed too: if evict only expired clients, which in turn became expired only after their reference counts became zero.
So, act_cli_n and tfw_cli_wait_release() can be safely removed in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants