roachtest: measure connections/node and memory consumption #25404

tbg · 2018-05-10T14:19:42Z

Repurposing this issue somewhat. We want to give baseline expectations on how many client connections per node a cluster can handle. Certainly this will depend on the workload, the configured sql memory, and background operations in the cluster (memory is really the limiting factor here), but we should at least obtain (and uphold) ballpark estimates for simple workloads, such as

SELECT 1 (i.e. basically idling).
kv (with small data)

@knz outlined the following approach for measuring basic per-connection memory usage:

create an 1-node cluster
reduce the amount of background activity to a maximum:
1. disable cluster metrics
2. disable GC
restart the node.
wait for idle memory usage to stabilize (measure)
connect 10 clients (leave session idle).
wait for memory usage to stabilize. Measure.
connect 10 more clients (leave session idle).
repeat from two steps above until memory or file descriptors exhausted
plot the line
linear regression

Doing this on a single node cluster is a good first step, but we should also do this on real clusters to a ccount for their increased memory consumption due to coordination.

Original post below.

Reported on Gitter by user @erichocean. My working assumption is that this is the lack of memory management in the storage layer, something that we should try to verify via heap profiles. This is the same cluster as in #25403.

I've currently got at least one node with ~2000 connections, but if I push much beyond that, they start crashing (which is how I saw that ranges became unavailable).

I really need about 50000 connections with an open transaction at once. They don't conflict with each other.
That would be closer to 3500 per node for me.

They are killed by the OOM manager in the Linux kernel.
But these are big nodes, 32GiB each of RAM

There's one table with a row fetched for update, then 5-6 rows fetched from other tables that are read-only and rarely updated (by another connection), and then another table that stores one row for the request, and a child table with rows for each log we capture. It's all wrapped in a begin/commit loop with appropriate retry logic.
UUID on the row for update as a primary key, same for the request and request logs.
At worst, one of the 5-6 rows that are fetched could get updated in the middle of the transaction, causing a retry.
But that's rare.
We have ~20 million of these rows we fetch for processing, and the goal is to be processing about 10000/sec in terms of throughput.
Latency is irrelevant for us.
I currently have 15 machines that do the processing, and they are idle 80% of the time because I can't get enough connections to the database open with my current setup.
So my throughput is being limited by Cockroach latency.
And I can't hide it by opening more connections. :(

I am on 2.0, I've got 15 nodes, and each node has 4/8 3.4GHz cores, 32Gb of RAM, two SSDs (each set up as a store), 10GigE networking, and runs nothing but cockroach.

The text was updated successfully, but these errors were encountered:

tbg · 2018-05-19T05:31:26Z

@knz from the SQL perspective, approximately how much memory do you expect an idle sql client conn to consume?

knz · 2018-05-19T08:44:54Z

Back in 2016 when I chose the upfront allocation constant (baseSQLMemoryBudget) I had it calibrated at under 5KB, so I picked 10KB to be safe.

I see the value has been bumped to 21K since -- apparently even the simplest SQL queries will use up to 10K (I'm not entirely surprised by that), and the initial bump from the baseline 10K to 20K by the first query in every session used to be logged and thus make the log file too chatty.

Since then we have reduced the logging so perhaps we could reduce the base allocation (needs to be calibrated again)

knz · 2018-05-19T09:59:20Z

Note my previous answer is only about heap allocs on behalf of the SQL session once it's established. I did not count the size of the goroutine, especially its stack. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

erichocean · 2018-05-21T06:02:43Z

Our target is 3K active connections per node, but with relatively small datasets per connection (i.e. these aren't big, sweeping transactions that touch a lot of rows). If we could do 5MiB/connection, that'd be ~15GiB for all 3K connections, and roughly half the available RAM on the node.

I'd expect an idle connection to consume in the tens to hundreds of KiB of RAM.

knz · 2018-05-21T09:13:52Z

Last time I checked that's where we are indeed.

erichocean · 2018-05-24T21:24:23Z

@knz how can I measure that on my cluster? I'm getting frequent node crashes (kernel OOM killer) with well under 2000 connections on a 32GiB node.

tbg · 2018-05-28T09:31:01Z

ping @knz

knz · 2018-05-28T11:34:13Z

How to measure per-connection memory usage?

create a 1-node cluster
reduce the amount of background activity to a maximum:
1. disable cluster metrics
2. disable GC
restart the node.
wait for idle memory usage to stabilize (measure)
connect 10 clients (leave session idle).
wait for memory usage to stabilize. Measure.
connect 10 more clients (leave session idle).
repeat from two steps above until memory or file descriptors exhausted
plot the line
linear regression

Then you have it, memory usage per client + baseline.

tbg · 2018-10-11T11:51:53Z

Folding this into #10320.

tbg added the A-kv-client Relating to the KV client and the KV interface. label May 10, 2018

tbg self-assigned this May 10, 2018

tbg changed the title ~~sql,storage: OOM killer terminates nodes when too many connections are open~~ roachtest: measure connections/node May 29, 2018

tbg changed the title ~~roachtest: measure connections/node~~ roachtest: measure connections/node and memory consumption May 29, 2018

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018

tbg added this to the 2.1 milestone Jul 22, 2018

petermattis removed this from the 2.1 milestone Oct 5, 2018

tbg closed this as completed Oct 11, 2018

rafiss mentioned this issue Aug 21, 2020

Add limit on number of client connections per node #35428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: measure connections/node and memory consumption #25404

roachtest: measure connections/node and memory consumption #25404

tbg commented May 10, 2018 •

edited

Loading

tbg commented May 19, 2018

knz commented May 19, 2018

knz commented May 19, 2018 via email

erichocean commented May 21, 2018 •

edited

Loading

knz commented May 21, 2018

erichocean commented May 24, 2018

tbg commented May 28, 2018

knz commented May 28, 2018 •

edited

Loading

tbg commented Oct 11, 2018

roachtest: measure connections/node and memory consumption #25404

roachtest: measure connections/node and memory consumption #25404

Comments

tbg commented May 10, 2018 • edited Loading

tbg commented May 19, 2018

knz commented May 19, 2018

knz commented May 19, 2018 via email

erichocean commented May 21, 2018 • edited Loading

knz commented May 21, 2018

erichocean commented May 24, 2018

tbg commented May 28, 2018

knz commented May 28, 2018 • edited Loading

tbg commented Oct 11, 2018

tbg commented May 10, 2018 •

edited

Loading

erichocean commented May 21, 2018 •

edited

Loading

knz commented May 28, 2018 •

edited

Loading