Upsert optimization #325

peterbroadhurst · 2021-11-17T02:25:59Z

One of the earliest findings in #316 @shorsher is that the speed at which messages can be injected into FireFly using a parallel loading test runner, is significantly faster than the speed at which batches can be processed.

This is not surprising due to one process currently being parallel, and one being sequential. That's something we can look at parallelizing on the back-end later.

However, the most immediately actionable finding from log analysis was just how many DB operations we were doing.

On a simple Laptop test, it was 3-5ms per PostgreSQL request (in a single TX), so that adds up fast!

We found with the first example analyzed:

There were 120 messages in the batch
Overall processing of the batch was about 4 seconds
The DB operations to load the messages (x120) + data (x120) into the database took 3/4 of those seconds
Each DB operation was an Upsert meaning
- Message ops took 4 DB calls - SELECT message, UPDATE message, DELETE datarefs, INSERT datarefs
- Data ops took 2 DB calls - SELECT data, UPDATE data

So the first improvement in the hardening phase, is to reduce these calls, by optimizing for the case we know we're in.
These optimizations are implemented in SQL, but in a way a similar optimization should be possible in NoSQL DBs once implemented.

Add a node hint to batches, to know which node sent them
Change message and data signatures on the DB layer to take in an UpsertOptimization option
- UpsertOptimizationSkip - no optimization - same as today
- UpsertOptimizationNew - optimize for INSERT
- UpsertOptimizationExisting - optimize for UPDATE
Removed the options for:
- allowExisting - logic should always allow for this... given
- allowHashUpdate - now always false - the hash of message and data cannot change on any code path
If we are the sending node, then optimize for UPDATE
- We use the hash as a key when updating, and check a row is updated
- For the optimization to work we require the DB to return rowsUpdated (well adopted we believe)
- Note for message the datarefs are fixed, so on the optimized path we don't touch them on update
If we are not the sending node, then optimize for INSERT
- Just do it, and fall back to error checking if it fails

codecov-commenter · 2021-11-17T02:31:11Z

Codecov Report

Merging #325 (5ab6131) into main (7e4fa6a) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##             main      #325      +/-   ##
===========================================
+ Coverage   99.98%   100.00%   +0.01%     
===========================================
  Files         232       232              
  Lines       12574     12631      +57     
===========================================
+ Hits        12572     12631      +59     
+ Misses          1         0       -1     
+ Partials        1         0       -1

Impacted Files	Coverage Δ
pkg/fftypes/batch.go	`100.00% <ø> (ø)`
internal/assets/token_transfer.go	`100.00% <100.00%> (ø)`
internal/batch/batch_manager.go	`100.00% <100.00%> (ø)`
internal/batch/batch_processor.go	`100.00% <100.00%> (ø)`
internal/broadcast/definition.go	`100.00% <100.00%> (ø)`
internal/broadcast/message.go	`100.00% <100.00%> (ø)`
internal/data/blobstore.go	`100.00% <100.00%> (ø)`
internal/data/data_manager.go	`100.00% <100.00%> (ø)`
internal/database/sqlcommon/batch_sql.go	`100.00% <100.00%> (ø)`
internal/database/sqlcommon/config_record_sql.go	`100.00% <100.00%> (ø)`
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e4fa6a...5ab6131. Read the comment docs.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

…it on update Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

shorsher · 2021-11-19T02:39:27Z

Here are some comparison results for context. All benchmarks were ran on my MBP with a CLI generated 2 node firefly environment, using postgres as the DB:

Before #325, we were consistently seeing > 1 minute latency between the last sent message and 0 pending transactions.

Requests      [total, rate, throughput]         3000, 50.02, 49.98
Duration      [total, attack, wait]             1m0s, 59.979s, 41.133ms
Latencies     [min, mean, 50, 90, 95, 99, max]  11.924ms, 54.866ms, 28.605ms, 72.405ms, 111.312ms, 798.671ms, 1.515s
Bytes In      [total, mean]                     1775629, 591.88
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m21.009812988s

Requests      [total, rate, throughput]         3000, 50.02, 49.76
Duration      [total, attack, wait]             1m0s, 59.98s, 303.142ms
Latencies     [min, mean, 50, 90, 95, 99, max]  13.387ms, 175.692ms, 85.487ms, 292.843ms, 1.018s, 1.437s, 1.947s
Bytes In      [total, mean]                     1775678, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m59.015816118s

With #325:

Duration      [total, attack, wait]             59.991s, 59.977s, 14.409ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.681ms, 42.796ms, 21.667ms, 62.05ms, 105.421ms, 650.269ms, 1.123s
Bytes In      [total, mean]                     1775694, 591.90
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 15.009402704s

Requests      [total, rate, throughput]         3000, 50.02, 49.96
Duration      [total, attack, wait]             1m0s, 59.977s, 64.636ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.85ms, 22.499ms, 19.238ms, 36.131ms, 46.965ms, 71.564ms, 167.495ms
Bytes In      [total, mean]                     1775656, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.005748929s

Requests      [total, rate, throughput]         3000, 50.02, 50.01
Duration      [total, attack, wait]             59.994s, 59.977s, 16.459ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.671ms, 32.417ms, 18.36ms, 33.254ms, 45.816ms, 597.197ms, 1.161s
Bytes In      [total, mean]                     1775683, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.007021619s

Batch processing time was consistently reduced to <= 1 second, down from ~4.

[2021-11-17T10:29:40.179-05:00]  INFO -> BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager
......
[2021-11-17T10:29:40.391-05:00]  INFO <- BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager

🚀 🚀

awrichar · 2021-11-19T15:58:52Z

internal/database/sqlcommon/message_sql.go

+	// This is a performance critical function, as we stream data into the database for every message, in every batch.
+	//
+	// First attempt the operation based on the optimization passed in.
+	// The expectation is that this will practically hit of the time, as only recovery paths


nit: this sentence feels like it's missing a word (same with the other places it was copied)

awrichar · 2021-11-19T16:02:29Z

internal/database/sqlcommon/message_sql.go


-	if existing {
+	if recreateDatarefs {


The comment a few lines down indicates we "Run through the ones in the message, finding ones that already exist". I don't think that's accurate anymore (we always insert all the refs if we get here, and might delete them all beforehand... but it's never a partial update).

awrichar

Looks really good. Couldn't see any issues with the functionality. Left a few small notes on wording.

Also note that db migrations 44-45 were claimed by #317, so this will need to be updated to use 46.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst added 3 commits November 16, 2021 21:34

Add data upsert optimization option

ef3ccb1

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Add optimization to messages, noting we no longer get sequence ret/em…

aafe36a

…it on update Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

GetNodeUUID tests

81429dd

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst force-pushed the upsert-optimization branch from e20060c to 81429dd Compare November 17, 2021 02:39

peterbroadhurst added 2 commits November 16, 2021 21:42

Re-instate coverage

102f8be

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Optimized insert sequence must make it to event

5ab6131

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst requested review from awrichar and shorsher November 17, 2021 03:05

peterbroadhurst marked this pull request as ready for review November 17, 2021 03:05

peterbroadhurst requested review from nguyer and nickgaski as code owners November 17, 2021 03:05

Include batch in logging on BatchPin

9d81d39

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

shorsher mentioned this pull request Nov 18, 2021

Hardening: FireFly Hardening & Performance Testing - umbrella issue #316

Open

peterbroadhurst mentioned this pull request Nov 19, 2021

Replace token pool "validate tx" logic stubs with a new "activate" action #317

Merged

awrichar reviewed Nov 19, 2021

View reviewed changes

awrichar approved these changes Nov 19, 2021

View reviewed changes

peterbroadhurst added 2 commits November 19, 2021 11:13

Re-sequence migrations and review comments

512b863

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Merge with main

e1d2ed2

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst merged commit 5bb28cf into hyperledger:main Nov 19, 2021

peterbroadhurst deleted the upsert-optimization branch November 19, 2021 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsert optimization #325

Upsert optimization #325

peterbroadhurst commented Nov 17, 2021

codecov-commenter commented Nov 17, 2021 •

edited

Loading

shorsher commented Nov 19, 2021 •

edited

Loading

awrichar Nov 19, 2021

awrichar Nov 19, 2021

awrichar left a comment

Upsert optimization #325

Upsert optimization #325

Conversation

peterbroadhurst commented Nov 17, 2021

codecov-commenter commented Nov 17, 2021 • edited Loading

Codecov Report

shorsher commented Nov 19, 2021 • edited Loading

awrichar Nov 19, 2021

Choose a reason for hiding this comment

awrichar Nov 19, 2021

Choose a reason for hiding this comment

awrichar left a comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 17, 2021 •

edited

Loading

shorsher commented Nov 19, 2021 •

edited

Loading