Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: measure connections/node and memory consumption #25404

Closed
tbg opened this issue May 10, 2018 · 9 comments
Closed

roachtest: measure connections/node and memory consumption #25404

tbg opened this issue May 10, 2018 · 9 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented May 10, 2018

Repurposing this issue somewhat. We want to give baseline expectations on how many client connections per node a cluster can handle. Certainly this will depend on the workload, the configured sql memory, and background operations in the cluster (memory is really the limiting factor here), but we should at least obtain (and uphold) ballpark estimates for simple workloads, such as

  1. SELECT 1 (i.e. basically idling).
  2. kv (with small data)

@knz outlined the following approach for measuring basic per-connection memory usage:

  1. create an 1-node cluster
  2. reduce the amount of background activity to a maximum:
    1. disable cluster metrics
    2. disable GC
  3. restart the node.
  4. wait for idle memory usage to stabilize (measure)
  5. connect 10 clients (leave session idle).
  6. wait for memory usage to stabilize. Measure.
  7. connect 10 more clients (leave session idle).
  8. repeat from two steps above until memory or file descriptors exhausted
  9. plot the line
  10. linear regression

Doing this on a single node cluster is a good first step, but we should also do this on real clusters to a ccount for their increased memory consumption due to coordination.

Original post below.


Reported on Gitter by user @erichocean. My working assumption is that this is the lack of memory management in the storage layer, something that we should try to verify via heap profiles. This is the same cluster as in #25403.

I've currently got at least one node with ~2000 connections, but if I push much beyond that, they start crashing (which is how I saw that ranges became unavailable).

I really need about 50000 connections with an open transaction at once. They don't conflict with each other.
That would be closer to 3500 per node for me.

They are killed by the OOM manager in the Linux kernel.
But these are big nodes, 32GiB each of RAM

There's one table with a row fetched for update, then 5-6 rows fetched from other tables that are read-only and rarely updated (by another connection), and then another table that stores one row for the request, and a child table with rows for each log we capture. It's all wrapped in a begin/commit loop with appropriate retry logic.
UUID on the row for update as a primary key, same for the request and request logs.
At worst, one of the 5-6 rows that are fetched could get updated in the middle of the transaction, causing a retry.
But that's rare.
We have ~20 million of these rows we fetch for processing, and the goal is to be processing about 10000/sec in terms of throughput.
Latency is irrelevant for us.
I currently have 15 machines that do the processing, and they are idle 80% of the time because I can't get enough connections to the database open with my current setup.
So my throughput is being limited by Cockroach latency.
And I can't hide it by opening more connections. :(

I am on 2.0, I've got 15 nodes, and each node has 4/8 3.4GHz cores, 32Gb of RAM, two SSDs (each set up as a store), 10GigE networking, and runs nothing but cockroach.

@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label May 10, 2018
@tbg tbg self-assigned this May 10, 2018
@tbg
Copy link
Member Author

tbg commented May 19, 2018

@knz from the SQL perspective, approximately how much memory do you expect an idle sql client conn to consume?

@knz
Copy link
Contributor

knz commented May 19, 2018

Back in 2016 when I chose the upfront allocation constant (baseSQLMemoryBudget) I had it calibrated at under 5KB, so I picked 10KB to be safe.

I see the value has been bumped to 21K since -- apparently even the simplest SQL queries will use up to 10K (I'm not entirely surprised by that), and the initial bump from the baseline 10K to 20K by the first query in every session used to be logged and thus make the log file too chatty.

Since then we have reduced the logging so perhaps we could reduce the base allocation (needs to be calibrated again)

@knz
Copy link
Contributor

knz commented May 19, 2018 via email

@erichocean
Copy link

erichocean commented May 21, 2018

Our target is 3K active connections per node, but with relatively small datasets per connection (i.e. these aren't big, sweeping transactions that touch a lot of rows). If we could do 5MiB/connection, that'd be ~15GiB for all 3K connections, and roughly half the available RAM on the node.

I'd expect an idle connection to consume in the tens to hundreds of KiB of RAM.

@knz
Copy link
Contributor

knz commented May 21, 2018

Last time I checked that's where we are indeed.

@erichocean
Copy link

@knz how can I measure that on my cluster? I'm getting frequent node crashes (kernel OOM killer) with well under 2000 connections on a 32GiB node.

@tbg
Copy link
Member Author

tbg commented May 28, 2018

ping @knz

@knz
Copy link
Contributor

knz commented May 28, 2018

How to measure per-connection memory usage?

  1. create a 1-node cluster
  2. reduce the amount of background activity to a maximum:
    1. disable cluster metrics
    2. disable GC
  3. restart the node.
  4. wait for idle memory usage to stabilize (measure)
  5. connect 10 clients (leave session idle).
  6. wait for memory usage to stabilize. Measure.
  7. connect 10 more clients (leave session idle).
  8. repeat from two steps above until memory or file descriptors exhausted
  9. plot the line
  10. linear regression

Then you have it, memory usage per client + baseline.

@tbg tbg changed the title sql,storage: OOM killer terminates nodes when too many connections are open roachtest: measure connections/node May 29, 2018
@tbg tbg changed the title roachtest: measure connections/node roachtest: measure connections/node and memory consumption May 29, 2018
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018
@tbg tbg added this to the 2.1 milestone Jul 22, 2018
@petermattis petermattis removed this from the 2.1 milestone Oct 5, 2018
@tbg
Copy link
Member Author

tbg commented Oct 11, 2018

Folding this into #10320.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

4 participants