Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vhost selection by sticky sessions #1266

Merged
merged 2 commits into from
Jul 1, 2019

Conversation

i-rinat
Copy link
Contributor

@i-rinat i-rinat commented Jun 5, 2019

The patchset adds vhost caching in HTTP sessions. We now save vhost after the first request in a session is processed, and then reuse it for other requests in the session, thus avoiding the selection procedure which may be quite costly.

(fixes #1043)

@@ -3353,6 +3353,20 @@ tfw_http_req_process(TfwConn *conn, const TfwFsmData *data)
TFW_INC_STAT_BH(clnt.msgs_otherr);
}

if (req->sess) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req must not be accessed after tfw_http_send_resp()|tfw_cache_process(). Request can be sent out there or even destroyed.

Saving vhost value should take place right after vhost is determined for the first time (same function, line 3219):

-	if (!req->vhost)
+	if (!req->vhost) {
		req->vhost = tfw_http_tbl_vhost((TfwMsg *)req, &block);
+		tfw_http_sess_pin_vhost(req);
+	}
	if (req->vhost)
		req->location = tfw_location_match(req->vhost,
						   &req->uri_path);

And I would like to move session-specific operations to `http_sess.c_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved code to http_sess.c. Since tfw_http_sess_pin_vhost() is now part of API for sessions and can be used by some other code, it now also checks whenever sess->vhost was non-NULL to track references to vhost more accurately. Also removed outdated comments that I forgot to remove in the initial submission.

@i-rinat i-rinat force-pushed the ri-vhost-selection-by-sticky-session branch from acc6c36 to 7b5c049 Compare June 6, 2019 15:11
@i-rinat
Copy link
Contributor Author

i-rinat commented Jun 6, 2019

Changed code to address review comments. However there is still functional tests failure — investigating that.

Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I am agree with the PR, but I have a little concerns about grace shutdown mode. If there is no such issue, the PR can be merged.

Here in the possible bad pointer dereference scenario:

  • Vhost and related server groups (main and backup) are removed from the configuration, grace shutdown is enabled, sticky session failovering is enabled.
  • All servers from primary (main) server group has finished grace shutdown routine, i.e. servers and server group structures was expired. But some servers from backup group still has active sessions.
  • Suddenly one of the backup servers closes all connections. Since the server on the grace shutdown, we don't reconnect to it.
  • A new request is received in a session that is pinned to that server.
  • We try to find a new matching connection:
    srv_conn = tfw_vhost_get_srv_conn(msg);

    vhost stored in the request - is vhost from the previous configuration. But main server group was already destroyed, thus vhost->main_sg becomes invalid.

Probable solution: expire session if the server was removed from the configuration even if the session failovering is enabled.

if (unlikely(test_bit(TFW_CFG_B_DEL, &srv->flags))) {
TFW_LOG("sticky sched: server %s"
" was removed, set session expired\n",
addr_str);
tfw_http_sess_set_expired(sess);

@i-rinat i-rinat force-pushed the ri-vhost-selection-by-sticky-session branch 2 times, most recently from f5ccca0 to ff8d374 Compare June 26, 2019 21:00
@i-rinat
Copy link
Contributor Author

i-rinat commented Jun 26, 2019

There was a bug in the previous version of the patchset: reference to a vhost was stored in a session, which through indirection held references to server groups. After configuration reload that references were keeping server group structures in zombie state — not dead, but also not functional, since schedulers were removed from them. That caused 502 responses emitted until client closed their connection.

The issue was resolved by introducing a flag into TfwVhost. It's set during reconfiguration (in tfw_cfgop_vhosts_list_free()) and checked when we are about to reuse cached vhost in tfw_http_sess_obtain().

@i-rinat
Copy link
Contributor Author

i-rinat commented Jun 26, 2019

thus vhost->main_sg becomes invalid

@ikoveshnikov,
As far as I understand the code, TfwVhost holds a reference to main_sg, indirectly, through instances of TfwLocation. So if TfwVhost instance is alive, so are dependent instances of TfwSrvGroup. That's why I think this is not an issue.

What may happen is zombie TfwSrvGroup, which holds memory, but has ->sched_data cleared. It's not functional, but still safe to access, so no crashes, but misbehavior instead. I believe that issue was addressed in recent changes: by checking if TfwVhost instance was abandoned (TFW_VHOST_B_INVALID bit in flags).

Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except one minor notice.

@@ -128,6 +128,10 @@ enum {
TFW_D_CACHE_PURGE_INVALIDATE,
};

enum {
TFW_VHOST_B_INVALID = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would request another name here. Something marked as 'invalid' usually means that something is wrong with the internal state object. 'Invalid' objects cant pass configuration phase and exist in running configuration. Name like TFW_VHOST_B_REMOVED sound more accurate and it's very clear why this instance exist and why it's not usable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to TFW_VHOST_B_REMOVED. It does look better, indeed.

write_lock(&sess->lock);
sess->vhost = NULL;
write_unlock(&sess->lock);
tfw_vhost_put(cached_vhost);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like suspicious optimisation. Same HTTP session can appear in multiple requests that can be processed on multiple CPUs at the same time. Imagine the situation: two threads are taking the read lock and find out that the vhost is removed; then both decides to drop the reference; since the reference is cached during read lock the reference would be dropped twice, causing use-after-free errors. I suggest following change:

-				TfwVhost *cached_vhost = sess->vhost;
				read_unlock(&sess->lock);
				write_lock(&sess->lock);
-				sess->vhost = NULL;
+				if (sess->vhost) {
+					tfw_vhost_put(sess->vhost);
+					sess->vhost = NULL;
+				}
				write_unlock(&sess->lock);
-				tfw_vhost_put(cached_vhost);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed code according to your suggestion. Thanks!


tfw_vhost_get(sess->vhost);
req->vhost = sess->vhost;
read_unlock(&sess->lock);
goto found;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seemed strange to me here, I was thinking about possible configuration updates like scheduler changes. vhost name changes and other changes that can spoil moving session from old vhost instance to a new one. I took me some time to recheck this, but everything looks good here.

i-rinat added 2 commits July 1, 2019 14:37
TfwStickyConn lost its members over the time, and now there are only two:
a pointer to TfwSrvConn and a lock to protect that pointer.
Caching vhost in the session helps to skip vhost selection procedure which
may be quite costly.
@i-rinat i-rinat force-pushed the ri-vhost-selection-by-sticky-session branch from ff8d374 to 7450d7c Compare July 1, 2019 13:41
@i-rinat i-rinat merged commit 9f7ee98 into master Jul 1, 2019
@i-rinat i-rinat deleted the ri-vhost-selection-by-sticky-session branch July 1, 2019 13:52
Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fast sticky session scheduler is broken
2 participants