Faker gets 250% faster #20741

evantahler · 2022-12-20T23:59:24Z

After the platform performance research and the failure to use Faker within the Airbyte platform to generate 1TB of data, I spent some time looking into the bottlenecks of our Python CDK sources. Just like our Java sources, the limiting factors are:

Parallelization of CPU-intensive tasks
JSON parsing is slow

This PR explores solutions to these problems:

Use a WorkerPool to generate the fake records in parallel.
- The code had to be rearchitected to become thread-safe
- The number of worker "threads" is now available within the connector's configuration (default 4)
In those parallel workers, also generate the AirbyteRecord JSON in parallel, ofloading that work from the main thread.
- A stub AirbyteMessageWithCachedJSON message was created that renders its own JSON string upon intilization

How much faster did we get? The speedups in this PR add up to a 250% speedup for a 100,000 user sync (20s to 7s)!

Notes:

Testing (SAT) now needs to run with only one thread or else you can't predict the order of the records returned

sherifnada

I spent some time looking into the bottlenecks of our Python CDK sources

Do you have some graphs or numbers you can share about CPU hotspots?

do you have a good understanding of why this speed up happened?
My understanding of Python is that due to the GIL, multi-threading doesn’t help on CPU-bound operations like Faker (so threads are mostly useful to speed up blocking I/O bound operations). This is just my mental model which is subject to change when brought into contact with real world experience.

evantahler · 2022-12-21T22:51:40Z

@sherifnada this code uses multi-process code, and your mental model is right! I initially started with mutli-threading, and indeed, there's wasn't a speed up. WorkerPool actually makes a few different python process and shares data with IPC, and that's how we win.

evantahler · 2022-12-21T23:37:16Z

/test connector=connectors/source-faker

🕑 connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3753636010
✅ connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3753636010
Python tests coverage:

Name                                               Stmts   Miss  Cover
----------------------------------------------------------------------
source_faker/__init__.py                               2      0   100%
source_faker/source.py                                17      3    82%
source_faker/streams.py                              191     53    72%
source_faker/utils.py                                 18      6    67%
source_faker/airbyte_message_with_cached_json.py       8      4    50%
----------------------------------------------------------------------
TOTAL                                                236     66    72%
	 Name                                                 Stmts   Miss  Cover   Missing
	 ----------------------------------------------------------------------------------
	 source_acceptance_test/base.py                          12      4    67%   16-19
	 source_acceptance_test/config.py                       140      5    96%   87, 93, 238, 242-243
	 source_acceptance_test/conftest.py                     208     92    56%   36, 42-44, 49, 54, 77, 83, 89-91, 110, 115-117, 123-125, 131-132, 137-138, 143, 149, 158-167, 173-178, 193, 217, 248, 254, 262-267, 275-280, 288-301, 306-312, 319-330, 337-353
	 source_acceptance_test/plugin.py                        69     25    64%   22-23, 31, 36, 120-140, 144-148
	 source_acceptance_test/tests/test_core.py              402    115    71%   53, 58, 93-104, 109-116, 120-121, 125-126, 308, 346-363, 376-387, 391-396, 402, 435-440, 478-485, 528-530, 533, 598-606, 618-621, 626, 682-683, 689, 692, 728-738, 751-776
	 source_acceptance_test/tests/test_incremental.py       158     14    91%   52-59, 64-77, 240
	 source_acceptance_test/utils/asserts.py                 39      2    95%   62-63
	 source_acceptance_test/utils/common.py                  94     10    89%   16-17, 32-38, 72, 75
	 source_acceptance_test/utils/compare.py                 62     23    63%   21-51, 68, 97-99
	 source_acceptance_test/utils/connector_runner.py       133     33    75%   24-27, 46-47, 50-54, 57-58, 73-75, 78-80, 83-85, 88-90, 93-95, 124-125, 159-161, 208
	 source_acceptance_test/utils/json_schema_helper.py     107     13    88%   30-31, 38, 41, 65-68, 96, 120, 192-194
	 ----------------------------------------------------------------------------------
	 TOTAL                                                 1603    336    79%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/tests/test_core.py:386: Backward compatibility tests are disabled for version 1.0.0.
======================== 30 passed, 1 skipped in 37.87s ========================

airbyte-integrations/connectors/source-faker/source_faker/airbyte_message_with_cached_json.py

airbyte-integrations/connectors/source-faker/source_faker/streams.py

airbyte-integrations/connectors/source-faker/source_faker/spec.json

airbyte-integrations/connectors/source-faker/source_faker/streams.py

airbyte-integrations/connectors/source-faker/source_faker/airbyte_message_with_cached_json.py

airbyte-integrations/connectors/source-faker/source_faker/streams.py

evantahler · 2022-12-22T19:40:07Z

Docs for how to allocate more resources to a connector when deployed: https://docs.airbyte.com/operator-guides/configuring-connector-resources#configuring-connector-specific-requirements

evantahler · 2022-12-22T22:44:58Z

airbyte-integrations/connectors/source-faker/source_faker/purchase_generator.py

+    def prepare(self):
+        """
+        Note: the instances of the mimesis generators need to be global.
+        Yes, they *should* be able to be instance variables on this class, which should only instantiated once-per-worker, but that's not quite the case:
+        * relying only on prepare as a pool initializer fails because we are calling the parent process's method, not the fork
+        * Calling prepare() as part of generate() (perhaps checking if self.person is set) and then `print(self, current_process()._identity, current_process().pid)` reveals multiple object IDs in the same process, resetting the internal random counters
+        """
+
+        seed_with_offset = self.seed
+        if self.seed is not None and len(current_process()._identity) > 0:
+            seed_with_offset = self.seed + current_process()._identity[0]
+
+        global dt
+        global numeric


@sherifnada I was able to address about 1/2 of your comments. For the rest, I added docs :D

evantahler · 2022-12-22T23:30:58Z

/test connector=connectors/source-faker

🕑 connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3761891204
✅ connectors/source-faker https://github.com/airbytehq/airbyte/actions/runs/3761891204
Python tests coverage:

Name                                               Stmts   Miss  Cover
----------------------------------------------------------------------
source_faker/__init__.py                               2      0   100%
source_faker/streams.py                              126      1    99%
source_faker/source.py                                17      3    82%
source_faker/utils.py                                 18      6    67%
source_faker/airbyte_message_with_cached_json.py       8      4    50%
source_faker/user_generator.py                        28     16    43%
source_faker/purchase_generator.py                    55     41    25%
----------------------------------------------------------------------
TOTAL                                                254     71    72%
	 Name                                                 Stmts   Miss  Cover   Missing
	 ----------------------------------------------------------------------------------
	 source_acceptance_test/base.py                          12      4    67%   16-19
	 source_acceptance_test/config.py                       140      5    96%   87, 93, 238, 242-243
	 source_acceptance_test/conftest.py                     208     92    56%   36, 42-44, 49, 54, 77, 83, 89-91, 110, 115-117, 123-125, 131-132, 137-138, 143, 149, 158-167, 173-178, 193, 217, 248, 254, 262-267, 275-280, 288-301, 306-312, 319-330, 337-353
	 source_acceptance_test/plugin.py                        69     25    64%   22-23, 31, 36, 120-140, 144-148
	 source_acceptance_test/tests/test_core.py              402    115    71%   53, 58, 93-104, 109-116, 120-121, 125-126, 308, 346-363, 376-387, 391-396, 402, 435-440, 478-485, 528-530, 533, 598-606, 618-621, 626, 682-683, 689, 692, 728-738, 751-776
	 source_acceptance_test/tests/test_incremental.py       158     14    91%   52-59, 64-77, 240
	 source_acceptance_test/utils/asserts.py                 39      2    95%   62-63
	 source_acceptance_test/utils/common.py                  94     10    89%   16-17, 32-38, 72, 75
	 source_acceptance_test/utils/compare.py                 62     23    63%   21-51, 68, 97-99
	 source_acceptance_test/utils/connector_runner.py       133     33    75%   24-27, 46-47, 50-54, 57-58, 73-75, 78-80, 83-85, 88-90, 93-95, 124-125, 159-161, 208
	 source_acceptance_test/utils/json_schema_helper.py     107     13    88%   30-31, 38, 41, 65-68, 96, 120, 192-194
	 ----------------------------------------------------------------------------------
	 TOTAL                                                 1603    336    79%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/tests/test_core.py:386: Backward compatibility tests are disabled for version 1.0.0.
======================== 30 passed, 1 skipped in 37.60s ========================

* [faker] decouple stream state * add PR # * commit Stream instantiate changes * fixup expected record * skip backward test for this version too * Apply suggestions from code review Co-authored-by: Augustin <augustin@airbyte.io> * lint * Create realistic datasets of 10GB, 100GB, and 1TB in size (#20558) * Faker CSV Streaming utilities * readme * don't do a final pipe to jq or you will run out or ram * doc * Faker gets 250% faster (#20741) * Faker is 250% faster * threads in spec + lint * pass tests * revert changes to record helper * cleanup * update expected_records * bump default records-per-slice to 1k * enforce unique email addresses * cleanup * more comments * `parallelism` and pass tests * update expected records * cleanup notes * update readme * update expected records * auto-bump connector version Co-authored-by: Augustin <augustin@airbyte.io> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>

Faker is 250% faster

abf9b49

octavia-squidington-iv added area/connectors Connector related issues CDK Connector Development Kit connectors/source/faker labels Dec 20, 2022

evantahler added 2 commits December 20, 2022 16:12

threads in spec + lint

77de03f

pass tests

e1200af

evantahler requested a review from sherifnada December 21, 2022 00:29

sherifnada reviewed Dec 21, 2022

View reviewed changes

revert changes to record helper

1660099

octavia-squidington-iv removed the CDK Connector Development Kit label Dec 21, 2022

cleanup

104d7dd

evantahler requested a review from sherifnada December 21, 2022 23:23

evantahler changed the title ~~Faker is 250% faster~~ Faker gets 250% faster Dec 21, 2022

update expected_records

5becbc1

evantahler added 2 commits December 21, 2022 15:37

Merge branch 'evan/faker-hydra' into evan/faker-hydra-multiprocess

5c490a2

bump default records-per-slice to 1k

4c60b07

evantahler marked this pull request as ready for review December 21, 2022 23:52

evantahler requested a review from a team December 21, 2022 23:52

enforce unique email addresses

f2f8e6f

sherifnada suggested changes Dec 22, 2022

View reviewed changes

evantahler added 3 commits December 22, 2022 13:43

cleanup

da4c02e

more comments

1def34c

parallelism and pass tests

ebbe63e

evantahler requested a review from sherifnada December 22, 2022 22:40

evantahler added 2 commits December 22, 2022 14:42

update expected records

310605b

cleanup notes

64a5045

evantahler commented Dec 22, 2022

View reviewed changes

evantahler mentioned this pull request Dec 22, 2022

Create test data sets of varying sizes (10GB/100GB/1TB) #20410

Closed

sherifnada approved these changes Dec 24, 2022

View reviewed changes

evantahler merged commit fe328c3 into evan/faker-hydra Jan 2, 2023

evantahler deleted the evan/faker-hydra-multiprocess branch January 2, 2023 22:33

evantahler mentioned this pull request Jan 3, 2023

Increase source-faker Resource Requirements #20976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faker gets 250% faster #20741

Faker gets 250% faster #20741

evantahler commented Dec 20, 2022 •

edited

Loading

sherifnada left a comment •

edited

Loading

evantahler commented Dec 21, 2022

evantahler commented Dec 21, 2022 •

edited by github-actions bot

Loading

evantahler commented Dec 22, 2022

evantahler Dec 22, 2022

evantahler commented Dec 22, 2022 •

edited by github-actions bot

Loading

Faker gets 250% faster #20741

Faker gets 250% faster #20741

Conversation

evantahler commented Dec 20, 2022 • edited Loading

sherifnada left a comment • edited Loading

Choose a reason for hiding this comment

evantahler commented Dec 21, 2022

evantahler commented Dec 21, 2022 • edited by github-actions bot Loading

Build Passed

evantahler commented Dec 22, 2022

evantahler Dec 22, 2022

Choose a reason for hiding this comment

evantahler commented Dec 22, 2022 • edited by github-actions bot Loading

Build Passed

evantahler commented Dec 20, 2022 •

edited

Loading

sherifnada left a comment •

edited

Loading

evantahler commented Dec 21, 2022 •

edited by github-actions bot

Loading

evantahler commented Dec 22, 2022 •

edited by github-actions bot

Loading