Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-postgres: profile all tests #31298

Closed
postamar opened this issue Oct 11, 2023 · 2 comments
Closed

source-postgres: profile all tests #31298

postamar opened this issue Oct 11, 2023 · 2 comments
Assignees
Labels
connectors/source/postgres team/db-dw-sources Backlog for Database and Data Warehouse Sources team

Comments

@postamar
Copy link
Contributor

postamar commented Oct 11, 2023

In Q4a we have the goal of bringing the CI for source-postgres down to under 15 minutes.

In order to do this, we need to collect metrics on what the CI is doing. @alafanechere is currently building a dashboard tracking end-to-end latency. On my end, I need to look more granularly at how time is spent in unit tests and integration tests. Gradle scans ( https://scans.gradle.com/s/6ejg436oz6kt2/tests/overview for instance ) are useful but for each long-running test I'd like to understand why it's long running: is it actually doing work, or is it waiting for things to happen? For instance, CdcPostgresSourceTest waits 1 minute for debezium to emit records.

The definition of done for this task is to come up with a top-10 list of tests being the most wasteful of time.

@postamar postamar added team/db-dw-sources Backlog for Database and Data Warehouse Sources team connectors/source/postgres labels Oct 11, 2023
@prateekmukhedkar
Copy link
Contributor

CdcPostgresSourceTest waits 1 minute for debezium to emit records -> this may be related to Debezium engine shutdown

@postamar
Copy link
Contributor Author

I posted results of this in slack and I'll repost them here:

https://airbytehq-team.slack.com/archives/C05GA80UVRD/p1698095298318999

I did some JVM profiling on source-postgres today.

Findings
There are 3 low-hanging fruit that stand out, starting with the lowest:

  1. Waiting for testcontainers. It's worse than I thought. See for yourselves in the flame graphs, click on the little 🔍 icon top left and type "org/testcontainers" in the regex modal. I think this is also the easiest thing to fix (just create less of them).
  2. Waiting for the connector container itself. This was surprising to me. I knew java connectors were slow to start but I didn't realize how much they compounded. Hit 🔍 again and type "readBytes" to visualize. Speeding this up probably involves slimming down the CDK dependencies, which involves a bit of toil.
  3. Waiting for Debezium. No surprise there. I don't know how low-hanging of a fruit this is but waiting a whole minute during every test is our biggest item in our critical path so it deserves some investigation at least.

We should be in a much, much better place after these are dealt with!

Methodology
I made pretty flame graphs that profile wall clock time using async-profiler with -e wall -t -d 3600 -i 100000 -I 'org/junit/jupiter/*'. The tests I chose are the ones listed in the spreadsheet I made a couple of weeks ago.

https://airbytehq-team.slack.com/archives/C05GA80UVRD/p1698181410463169

I did some JVM profiling on source-postgres again. This time on what happens when we just do --spec.

Findings
There's no going around that java program start up is slow, but there's a few things which can be done which readily improve performance in tests:

  1. Put less jars on the classpath. We do spend a significant amount of time unzipping jars. We already knew this had to be done in the CDK for other good reasons.
  2. Bumping to slf4j 2.x leads to some improvements due to avoiding some expensive classloader nonsense when setting up the logger, which itself is a considerable portion of the startup time.
  3. We have a fancy log4j config which is important in prod but which is expensive to set up in tests. This can be bypassed entirely by passing another config via JAVA_OPTS="-Dlog4j.configurationFile=zzz", which in turn can be mounted on the image container in integration tests by setting JOB_DEFAULT_ENV_JAVA_OPTS, yay.
  4. We can similarly pass JIT tuning parameters which are more suited to the short-lived java processes we see in integration tests. For instance -XX:TieredStopAtLevel=1 tells the jvm to not bother compiling things too well.

The main action item here is to clean up dependencies, which it turns out have a significant impact on test time (because of connector startup latencies).
In addition to that there exist a few cute tweaks in the form of CLI flags and properties which are of more marginal utility but which are easy, transparent and low-risk in a testing context. Without removing any dependencies, with these I can go from 600ms down to 350ms on my macbook, but I'm not sure this can be improved much further beyond that only by passing useful flags.

Methodology
Same as yesterday's post, I use async-profiler. First I build ./gradlew ...:source-postgres:distTar and then I go into ..../source-postgres/build/distributions and there I tar xf source-postgres*. Following which, JAVA_OPTS="-XX:TieredStopAtLevel=1 -Dlog4j.configurationFile=zzz -agentpath:/Users/postamar/dev/async-profiler-2.9-macos/build/libasyncProfiler.so=start,event=cpu,file=profile.html,interval=100000" source-postgres-0.50.32/bin/source-postgres --spec. Note that this is a CPU sample and not a wall-clock sample but as far as the main thread is concerned the flame graph is basically the same.

flamegraphs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connectors/source/postgres team/db-dw-sources Backlog for Database and Data Warehouse Sources team
Projects
None yet
Development

No branches or pull requests

2 participants