Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upsert optimization #325

Merged
merged 8 commits into from
Nov 19, 2021
Merged

Conversation

peterbroadhurst
Copy link
Contributor

One of the earliest findings in #316 @shorsher is that the speed at which messages can be injected into FireFly using a parallel loading test runner, is significantly faster than the speed at which batches can be processed.

This is not surprising due to one process currently being parallel, and one being sequential. That's something we can look at parallelizing on the back-end later.

However, the most immediately actionable finding from log analysis was just how many DB operations we were doing.

On a simple Laptop test, it was 3-5ms per PostgreSQL request (in a single TX), so that adds up fast!

We found with the first example analyzed:

  • There were 120 messages in the batch
  • Overall processing of the batch was about 4 seconds
  • The DB operations to load the messages (x120) + data (x120) into the database took 3/4 of those seconds
  • Each DB operation was an Upsert meaning
    • Message ops took 4 DB calls - SELECT message, UPDATE message, DELETE datarefs, INSERT datarefs
    • Data ops took 2 DB calls - SELECT data, UPDATE data

So the first improvement in the hardening phase, is to reduce these calls, by optimizing for the case we know we're in.
These optimizations are implemented in SQL, but in a way a similar optimization should be possible in NoSQL DBs once implemented.

  • Add a node hint to batches, to know which node sent them
  • Change message and data signatures on the DB layer to take in an UpsertOptimization option
    • UpsertOptimizationSkip - no optimization - same as today
    • UpsertOptimizationNew - optimize for INSERT
    • UpsertOptimizationExisting - optimize for UPDATE
  • Removed the options for:
    • allowExisting - logic should always allow for this... given
    • allowHashUpdate - now always false - the hash of message and data cannot change on any code path
  • If we are the sending node, then optimize for UPDATE
    • We use the hash as a key when updating, and check a row is updated
    • For the optimization to work we require the DB to return rowsUpdated (well adopted we believe)
    • Note for message the datarefs are fixed, so on the optimized path we don't touch them on update
  • If we are not the sending node, then optimize for INSERT
    • Just do it, and fall back to error checking if it fails

@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2021

Codecov Report

Merging #325 (5ab6131) into main (7e4fa6a) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main      #325      +/-   ##
===========================================
+ Coverage   99.98%   100.00%   +0.01%     
===========================================
  Files         232       232              
  Lines       12574     12631      +57     
===========================================
+ Hits        12572     12631      +59     
+ Misses          1         0       -1     
+ Partials        1         0       -1     
Impacted Files Coverage Δ
pkg/fftypes/batch.go 100.00% <ø> (ø)
internal/assets/token_transfer.go 100.00% <100.00%> (ø)
internal/batch/batch_manager.go 100.00% <100.00%> (ø)
internal/batch/batch_processor.go 100.00% <100.00%> (ø)
internal/broadcast/definition.go 100.00% <100.00%> (ø)
internal/broadcast/message.go 100.00% <100.00%> (ø)
internal/data/blobstore.go 100.00% <100.00%> (ø)
internal/data/data_manager.go 100.00% <100.00%> (ø)
internal/database/sqlcommon/batch_sql.go 100.00% <100.00%> (ø)
internal/database/sqlcommon/config_record_sql.go 100.00% <100.00%> (ø)
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e4fa6a...5ab6131. Read the comment docs.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
…it on update

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst peterbroadhurst marked this pull request as ready for review November 17, 2021 03:05
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@shorsher
Copy link
Member

shorsher commented Nov 19, 2021

Here are some comparison results for context. All benchmarks were ran on my MBP with a CLI generated 2 node firefly environment, using postgres as the DB:

Before #325, we were consistently seeing > 1 minute latency between the last sent message and 0 pending transactions.

Requests      [total, rate, throughput]         3000, 50.02, 49.98
Duration      [total, attack, wait]             1m0s, 59.979s, 41.133ms
Latencies     [min, mean, 50, 90, 95, 99, max]  11.924ms, 54.866ms, 28.605ms, 72.405ms, 111.312ms, 798.671ms, 1.515s
Bytes In      [total, mean]                     1775629, 591.88
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m21.009812988s
Requests      [total, rate, throughput]         3000, 50.02, 49.76
Duration      [total, attack, wait]             1m0s, 59.98s, 303.142ms
Latencies     [min, mean, 50, 90, 95, 99, max]  13.387ms, 175.692ms, 85.487ms, 292.843ms, 1.018s, 1.437s, 1.947s
Bytes In      [total, mean]                     1775678, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m59.015816118s

With #325:

Duration      [total, attack, wait]             59.991s, 59.977s, 14.409ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.681ms, 42.796ms, 21.667ms, 62.05ms, 105.421ms, 650.269ms, 1.123s
Bytes In      [total, mean]                     1775694, 591.90
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 15.009402704s
Requests      [total, rate, throughput]         3000, 50.02, 49.96
Duration      [total, attack, wait]             1m0s, 59.977s, 64.636ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.85ms, 22.499ms, 19.238ms, 36.131ms, 46.965ms, 71.564ms, 167.495ms
Bytes In      [total, mean]                     1775656, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.005748929s
Requests      [total, rate, throughput]         3000, 50.02, 50.01
Duration      [total, attack, wait]             59.994s, 59.977s, 16.459ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.671ms, 32.417ms, 18.36ms, 33.254ms, 45.816ms, 597.197ms, 1.161s
Bytes In      [total, mean]                     1775683, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.007021619s

Batch processing time was consistently reduced to <= 1 second, down from ~4.

[2021-11-17T10:29:40.179-05:00]  INFO -> BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager
......
[2021-11-17T10:29:40.391-05:00]  INFO <- BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager

🚀 🚀

// This is a performance critical function, as we stream data into the database for every message, in every batch.
//
// First attempt the operation based on the optimization passed in.
// The expectation is that this will practically hit of the time, as only recovery paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this sentence feels like it's missing a word (same with the other places it was copied)


if existing {
if recreateDatarefs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment a few lines down indicates we "Run through the ones in the message, finding ones that already exist". I don't think that's accurate anymore (we always insert all the refs if we get here, and might delete them all beforehand... but it's never a partial update).

Copy link
Contributor

@awrichar awrichar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Couldn't see any issues with the functionality. Left a few small notes on wording.

Also note that db migrations 44-45 were claimed by #317, so this will need to be updated to use 46.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst peterbroadhurst merged commit 5bb28cf into hyperledger:main Nov 19, 2021
@peterbroadhurst peterbroadhurst deleted the upsert-optimization branch November 19, 2021 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants