Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide performance benchmarks with and without Proxy #1871

Open
Stono opened this issue Jul 13, 2023 · 14 comments
Open

Provide performance benchmarks with and without Proxy #1871

Stono opened this issue Jul 13, 2023 · 14 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@Stono
Copy link

Stono commented Jul 13, 2023

Bug Description

Hello!
I've been investigating some reports from some of our users around latency spikes. I've narrowed the investigation down to when an application receives a large burst of requests, its connection pool can grow quite rapidly, very quickly. What i'm observing is that parallel connections to cloudsql (via cloudsql proxy) seem to have a linear increase in latency with number of connections being made, affecting all connection attempts, not just the first.

I have a test setup that will connect, perform a query, then disconnect.

For example, here's the timings around a single connection attempt:

{
  "query": 71,
  "connect": 70,
  "durationMs": 72
}

If i make 5 connection attempts in parallel:

[
  {
    "query": 109,
    "connect": 108,
    "durationMs": 110
  },
  {
    "query": 112,
    "connect": 110,
    "durationMs": 112
  },
  {
    "query": 118,
    "connect": 117,
    "durationMs": 119
  },
  {
    "query": 119,
    "connect": 116,
    "durationMs": 120
  },
  {
    "query": 124,
    "connect": 123,
    "durationMs": 124
  }
]

And here is 10:

[
  {
    "query": 147,
    "connect": 146,
    "durationMs": 147
  },
  {
    "query": 156,
    "connect": 155,
    "durationMs": 156
  },
  {
    "query": 160,
    "connect": 157,
    "durationMs": 160
  },
  {
    "query": 161,
    "connect": 159,
    "durationMs": 161
  },
  {
    "query": 163,
    "connect": 161,
    "durationMs": 163
  },
  {
    "query": 174,
    "connect": 173,
    "durationMs": 174
  },
  {
    "query": 180,
    "connect": 177,
    "durationMs": 180
  },
  {
    "query": 183,
    "connect": 179,
    "durationMs": 183
  },
  {
    "query": 184,
    "connect": 178,
    "durationMs": 184
  },
  {
    "query": 185,
    "connect": 182,
    "durationMs": 185
  }
]

The pattern continues the more connections I make.

Example code (or command)

This is the crude bit of typescript i wrote to test this:


const timings: { connect: number; query: number; durationMs: number }[] = []
const connectionTest = async (): Promise<void> => {
  const start = new Date()
  const pgClient = new Client({
    host: 'postgres',
    port: 5432,
    database: 'istio_test',
    user: requireEnv('AT_POSTGRES_USERNAME')
  })
  await pgClient.connect()
  const connect = new Date().getTime() - start.getTime()
  const results = await pgClient.query('SELECT 1 + 1 AS solution')
  if (results.rows[0].solution.toString().trim() !== '2') {
    throw new Error('Did not get the expected response from cloudsql')
  }
  const query = new Date().getTime() - start.getTime()
  await pgClient.end()
  const end = new Date()
  const durationMs = end.getTime() - start.getTime()
  timings.push({ query, connect, durationMs })
}

const promises: Promise<void>[] = []

const iterations = parseInt(req.query.iterations ?? '1', 10)
for (let i = 0; i < iterations; i += 1) {
  promises.push(connectionTest())
}
await Promise.all(promises)

This is using the pg library. However all of our users are using java, so this isn't a client library issue.

Steps to reproduce?

Code sample provided above

Environment

  1. OS type and version: Rocky Linux 8
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.5.0
  3. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME):--private-ip --prometheus --http-address 0.0.0.0 --http-port 9739 --auto-iam-authn, termination period: 30

Additional Details

  • I have reproduced this locally using a local cloudsql proxy, and gcloud default credentials to connect to the instance.
  • I have reproduced this issue without using IAM (username and password instead).
@Stono Stono added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jul 13, 2023
@Stono
Copy link
Author

Stono commented Jul 13, 2023

Been doing a bit more testing:

Via proxy (SSL), IAM enabled:

1x connection: {"connect":181,"query":211,"durationMs":216}
50x connections: {"connect":793.9,"query":825.68,"durationMs":828}

Via proxy (SSL), username/password:

1x: {"connect":152,"query":183,"durationMs":188}
50x: {"connect":406.48,"query":436.56,"durationMs":437.86}

No proxy, username/password, no SSL:

1x: {"connect":145,"query":169,"durationMs":169}
50x: {"connect":299.44,"query":329.9,"durationMs":329.9}

This is starting to look more like a cloudsql problem rather than a proxy problem?

@Stono
Copy link
Author

Stono commented Jul 13, 2023

Thought i'd try against pg14 local docker instances:

1x: {"connect":30,"query":32,"durationMs":37}
50x: {"connect":88.18,"query":91.74,"durationMs":99.04}
100x: {"connect":115.23,"query":121.95,"durationMs":132.03}

So now i'm questioning my test script... going to try with a different pg client and eventually a different language.

Certainly one interesting observation here regardless is that IAM is 2x as slow as username/password on new connections.

@enocom enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jul 13, 2023
@enocom enocom changed the title Parallel connection attempts cause a lot of latency Provide performance benchmarks with and without Proxy Jul 13, 2023
@enocom
Copy link
Member

enocom commented Jul 13, 2023

Thanks for the response @Stono.

We don't provide performance benchmarks, but increasingly see a need for it. Let's repurpose this issue for our team to provide some baseline numbers for comparison.

Generally, the Proxy will introduce some overhead as it's a few extra hops to your database (localhost -> proxy -> server side proxy -> localhost db). Separately, we recommend scaling the Proxy with CPU for more throughput and more memory for more connections.

Finally, if you're writing an app in Node.js, we do have a connector now: https://github.com/GoogleCloudPlatform/cloud-sql-nodejs-connector. That will eliminate some of the latency and is worth trying out.

@Stono
Copy link
Author

Stono commented Jul 13, 2023

Hey @enocom thanks for the response. Pretty sure the latency we're observing now is not the proxy anyway. Even testing against a local postgres instance with no proxy i observe similar behaviours, basically if you fire a batch of connection requests to postgres in parallel, they all take a long time to connect. Adding in TLS and Workload Identity on cloudsql exacerbate that.

Saying that, providing latency numbers for the proxy is always welcome to help folks make informed decisions particularly on latency sensitive applications! You'll probably see what i'm seeing if you start doing the performance benchmarks of proxy yourself. Try 1, 10 concurrent then 50 concurrent.

For context; we use the proxy for a language agnostic way to wrap up connecting to cloudsql on our internal platform (so the apps don't need to worry about it, and they're written in python, java, node, bash, whatever haha). It works really nicely, i actually much prefer this approach to getting people instrumenting their code with libraries that need to be kept up to date (we have circa 600 apps).

@enocom
Copy link
Member

enocom commented Jul 13, 2023

This is definitely an area where we'd like to provide more guidance to help folks make their own measurements and compare against what we expect customers to see.

I'll update here when we have more on the topic.

@honDhan
Copy link

honDhan commented Jul 14, 2023

Hi! Wanted to also mention that performance benchmarks would be appreciated.
Our company currently deploys on GKE with each pod having our application container, and cloud sql proxy as a sidecar.
Some questions I've had when looking into potential solutions:

  • is it better to use the python client library? or have an app+sidecar? not sure if python overhead is better than k8s traffic overhead.
  • is there a performance benefit in upgrading to v2? we currently use v1.

maybe my questions can help the team think through what kinds of benchmarks are useful. thanks in advance!

@enocom
Copy link
Member

enocom commented Jul 14, 2023

In process connectors will provide a better user experience and will likely be faster given they don't have to do the localhost hop and go straight to the database's proxy server.

We haven't done any formal benchmarking of v1 vs v2, but v2 will startup faster. V2 offers a bunch of additional benefits like support for tracing (as shown above), prometheus support, and others. So we do recommend upgrading.

@bobintornado
Copy link

bobintornado commented Jul 25, 2023

@enocom thanks for the info!
is there a Why upgrading to v2 page available somewhere? Many thanks in advance!

@enocom
Copy link
Member

enocom commented Jul 25, 2023

We have a migration guide, but don't explicitly compare v1 and v2 other than here.

In addition to the feature list in the README, v2 will start up faster and continue to get new features. v1 meanwhile will continue to get security updates, but new features will land in v2.

@Stono
Copy link
Author

Stono commented Jul 26, 2023

Just sharing this here as i found it interesting https://twitter.com/BdKozlovski/status/1684098236426878976?t=SWYsfn24ltvFSyEOKHjjEQ&s=19

Cloudflare use Postgres at scale and point out how expensive connections are and how they use https://www.pgbouncer.org to mitigate that.

Led me down a rabbit hole wondering if cloudsql proxy could implement such a pattern. Probably scope creep but would be cool.

@enocom
Copy link
Member

enocom commented Aug 1, 2023

For what it's worth, a person can run the Proxy behind pgbouncer. And we have an example here of how to do that in a basic way.

@enocom enocom assigned ruyadorno and unassigned enocom Aug 7, 2023
@enocom enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 22, 2024
@enocom enocom assigned enocom and unassigned ruyadorno Feb 22, 2024
@enocom enocom assigned jackwotherspoon and unassigned enocom May 1, 2024
@kazysgurskas
Copy link

Hello, sorry for bumping an old issue, but I suspect this is still relevant.

Environment:

gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0-alpine
running on GKE as a sidecar pattern
main application - PHP 8.2 running Laravel ^11 framework, mysql PDO library

Switching to cloud-sql-proxy and IAM authentication introduced a consistent increase in PHP processing time of ~60ms. Considering, PHP processing time with direct connection via private IP was sub 30ms, that is quite concerning.

Tried using UNIX socket, instead of TCP, but did not notice any difference.

Is there anything to be explored here, considering there's no native connector to cloudSql for PHP?

@enocom
Copy link
Member

enocom commented Aug 27, 2024

Have you considered connecting directly with private IP?

EDIT: I see you're already using private IP. In general, if you're running a latency sensitive workload, you'll always get the best performance from a direct connection (fewer components involved).

@kazysgurskas
Copy link

Have you considered connecting directly with private IP?

EDIT: I see you're already using private IP. In general, if you're running a latency sensitive workload, you'll always get the best performance from a direct connection (fewer components involved).

Yeah, I'm aware of that, the reason I opted for cloud-sql-proxy is the IAM auth and security posture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

7 participants