Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid locking common records and remove transaction annotation #2036

Merged
merged 6 commits into from
Jul 15, 2022

Conversation

collado-mike
Copy link
Collaborator

@collado-mike collado-mike commented Jul 14, 2022

Problem

The OpenLineage API implementation is unnecessarily wrapped in a @Transaction annotation, which means that under heavy write load, there's a lot of unnecessary contention. For example, we lock the Namespace record in the upsertNamespaceRow method, even though we don't really update the namespace. Since the whole request is wrapped in a transaction, the namespace record is locked for the entire time we write new jobs, runs, and datasets. If many jobs for the same namespace run simultaneously, each OpenLineage request will wait until the namespace record is unlocked, forming a queue of requests, rather than handling each in parallel. This also happens for JobContexts and RunArgs, as oftentimes, many jobs will run with no arguments, no source code location, and no SQL facet. Many completely unrelated jobs will block while awaiting a lock on an empty JobContext and/or empty RunArgs.

I updated each of these upsert queries to do nothing on conflict (they're really immutable records) to avoid locking the impacted record.

Moreover, I don't see a reason for this API to be wrapped in a transaction at all. These changes don't need to be made atomically. OpenLineage requests are mostly idempotent, so if only partial database updates are applied and the same request made twice, there's no duplication in the data. Jobs, Runs, Datasets, JobContexts, and RunArgs created in the first request will be found and used in the second. Foreign key constraints require us to always write a new record before updating an existing one to reference it, so there's no opportunity to leave the database in an inconsistent state.

This screenshot shows the database under heavy write load - note that waits are dominated by transactionid (Waiting for a transaction to finish) and the primary contender in this case is the NamespaceDao.

Screen Shot 2022-07-14 at 1 41 37 PM

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
Signed-off-by: Michael Collado <collado.mike@gmail.com>
@collado-mike collado-mike requested a review from wslulciuc July 14, 2022 20:46
@codecov
Copy link

codecov bot commented Jul 14, 2022

Codecov Report

Merging #2036 (0d1f82d) into main (cdee331) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head 0d1f82d differs from pull request most recent head dfb0aee. Consider uploading reports for the commit dfb0aee to get more accurate results

@@             Coverage Diff              @@
##               main    #2036      +/-   ##
============================================
+ Coverage     78.89%   78.92%   +0.03%     
- Complexity     1014     1018       +4     
============================================
  Files           197      199       +2     
  Lines          5548     5557       +9     
  Branches        421      421              
============================================
+ Hits           4377     4386       +9     
  Misses          724      724              
  Partials        447      447              
Impacted Files Coverage Δ
api/src/main/java/marquez/db/OpenLineageDao.java 95.78% <ø> (ø)
api/src/main/java/marquez/db/JobContextDao.java 100.00% <100.00%> (ø)
api/src/main/java/marquez/db/JobVersionDao.java 89.85% <100.00%> (+0.46%) ⬆️
api/src/main/java/marquez/db/NamespaceDao.java 91.30% <100.00%> (+0.82%) ⬆️
api/src/main/java/marquez/db/RunArgsDao.java 100.00% <100.00%> (ø)

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

default NamespaceRow upsertNamespaceRow(
UUID uuid, Instant now, String name, String currentOwnerName) {
doUpsertNamespaceRow(uuid, now, name, currentOwnerName);
return findNamespaceByName(name).orElseThrow();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The RETURNING * was to avoid inserting, then querying for the namespace in two separate queries. But, also understand the cost of always updating updated_at to trigger the modified row to be returned. At some point, I'd love to flush out the usage of updated_at for all of our tables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the RETURNING * only returns if the record was modified - if we want to DO NOTHING to avoid locking the record, we have to execute a separate query 🤷🏽‍♂️

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis and write up @collado-mike on the DB performance! Who knew such "minor" changes could improve query performance so much. As for transactions, I agree. Originally, we wanted to keep the write path for Marquez in sync with metadata collected via OL. Now, I do think we'll have to wrap some DB calls in a transaction, but let's add them as we encounter data inconsistencies.

@wslulciuc wslulciuc enabled auto-merge (squash) July 15, 2022 23:16
@wslulciuc wslulciuc merged commit 78eeae0 into main Jul 15, 2022
@wslulciuc wslulciuc deleted the namespace_no_conflict branch July 15, 2022 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants