Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve throughput #59

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Improve throughput #59

wants to merge 3 commits into from

Conversation

labianchin
Copy link
Collaborator

@labianchin labianchin commented Apr 15, 2019

  1. Encode JDBC ResultSet into ByteBuffer using directBinaryEncoder, write using appendEncoded(). This avoids a bit of copying bytes between buffers.
  2. Use a BlockingQueue to asynchronously read from JDBC and write to file.

Early experiments show that it can improve throughput by ~ 15 ~ 30 %.

master/#60 (https://travis-ci.org/spotify/dbeam/builds/520639708#L1545):

scenario    records  writeElapsedMs  msPerMillionRows  bytesWritten  kBps
deflate1t5  1000000  10288           10288             43722188      4249.823
deflate1t5  1000000  9884            9884              43722188      4423.531
deflate1t5  1000000  9819            9819              43722188      4452.814
||query     1000000  9888            8570              61220677      6191.411
||query     1000000  9835            8620              52475971      5335.635
||query     1000000  9874            8380              52475971      5314.56

#61 (just encode to binary, no multi threading) (https://travis-ci.org/spotify/dbeam/builds/520647554#L1524):

scenario    records  writeElapsedMs  msPerMillionRows  bytesWritten  kBps
deflate1t5  1000000  9674            9674              53403623      5520.324
deflate1t5  1000000  9407            9407              53403623      5677.008
deflate1t5  1000000  9396            9396              53403623      5683.655
||query     1000000  9484            8580              53411427      5631.74
||query     1000000  9596            8705              74772355      7792.033
||query     1000000  9800            9130              53411427      5450.145

This PR (https://travis-ci.org/spotify/dbeam/builds/520648264#L1539):

scenario    records  writeElapsedMs  msPerMillionRows  bytesWritten  kBps
deflate1t5  1000000  8464            8464              53406242      6309.811
deflate1t5  1000000  7791            7791              53406242      6854.863
deflate1t5  1000000  7741            7741              53406242      6899.139
||query     1000000  8110            7350              74773099      9219.864
||query     1000000  8510            7075              74773099      8786.498
||query     1000000  8204            7330              64093112      7812.422

@labianchin labianchin force-pushed the encodeManually branch 5 times, most recently from f022b8b to 8b42101 Compare April 16, 2019 08:29
@anish749
Copy link
Contributor

The results look promising ..

@labianchin labianchin force-pushed the encodeManually branch 3 times, most recently from 3d6007a to bdeec3b Compare July 8, 2019 13:24
@codecov
Copy link

codecov bot commented Jul 8, 2019

Codecov Report

Merging #59 into master will decrease coverage by 0.66%.
The diff coverage is 81.81%.

@@             Coverage Diff              @@
##             master      #59      +/-   ##
============================================
- Coverage     89.69%   89.02%   -0.67%     
- Complexity      177      181       +4     
============================================
  Files            22       23       +1     
  Lines           679      711      +32     
  Branches         51       53       +2     
============================================
+ Hits            609      633      +24     
- Misses           47       54       +7     
- Partials         23       24       +1

@codecov
Copy link

codecov bot commented Aug 5, 2019

Codecov Report

Merging #59 into master will decrease coverage by 0.46%.
The diff coverage is 84.61%.

@@             Coverage Diff              @@
##             master      #59      +/-   ##
============================================
- Coverage     90.08%   89.62%   -0.47%     
  Complexity      230      230              
============================================
  Files            25       26       +1     
  Lines           908      925      +17     
  Branches         65       65              
============================================
+ Hits            818      829      +11     
- Misses           59       65       +6     
  Partials         31       31              

@labianchin labianchin force-pushed the encodeManually branch 3 times, most recently from eed0043 to cc47f5c Compare August 5, 2019 14:23
@anish749
Copy link
Contributor

I don't remember, but did you also try separating out the read and writes into two different DoFns? Does that move the thread management to Beam?

@labianchin
Copy link
Collaborator Author

I don't remember, but did you also try separating out the read and writes into two different DoFns? Does that move the thread management to Beam?

I remember trying on earlier versions of DBeam to have two phases: read JDBC and write to Avro. Problem was that Beam was waiting for the read bundle to complete, serialize and then write to Avro, which was very inefficient. If we found a way to "stream" from different DoFns (including the fanout: one jdbc query fans out to millions of records), then we could rely on Beam for that.

@anish749
Copy link
Contributor

thanks, I remember now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants