-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faker gets 250% faster #20741
Faker gets 250% faster #20741
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time looking into the bottlenecks of our Python CDK sources
Do you have some graphs or numbers you can share about CPU hotspots?
do you have a good understanding of why this speed up happened?
My understanding of Python is that due to the GIL, multi-threading doesn’t help on CPU-bound operations like Faker (so threads are mostly useful to speed up blocking I/O bound operations). This is just my mental model which is subject to change when brought into contact with real world experience.
@sherifnada this code uses multi-process code, and your mental model is right! I initially started with mutli-threading, and indeed, there's wasn't a speed up. WorkerPool actually makes a few different python process and shares data with IPC, and that's how we win. |
/test connector=connectors/source-faker
Build PassedTest summary info:
|
airbyte-integrations/connectors/source-faker/source_faker/airbyte_message_with_cached_json.py
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/spec.json
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/spec.json
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/airbyte_message_with_cached_json.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-faker/source_faker/streams.py
Outdated
Show resolved
Hide resolved
Docs for how to allocate more resources to a connector when deployed: https://docs.airbyte.com/operator-guides/configuring-connector-resources#configuring-connector-specific-requirements |
def prepare(self): | ||
""" | ||
Note: the instances of the mimesis generators need to be global. | ||
Yes, they *should* be able to be instance variables on this class, which should only instantiated once-per-worker, but that's not quite the case: | ||
* relying only on prepare as a pool initializer fails because we are calling the parent process's method, not the fork | ||
* Calling prepare() as part of generate() (perhaps checking if self.person is set) and then `print(self, current_process()._identity, current_process().pid)` reveals multiple object IDs in the same process, resetting the internal random counters | ||
""" | ||
|
||
seed_with_offset = self.seed | ||
if self.seed is not None and len(current_process()._identity) > 0: | ||
seed_with_offset = self.seed + current_process()._identity[0] | ||
|
||
global dt | ||
global numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sherifnada I was able to address about 1/2 of your comments. For the rest, I added docs :D
/test connector=connectors/source-faker
Build PassedTest summary info:
|
* [faker] decouple stream state * add PR # * commit Stream instantiate changes * fixup expected record * skip backward test for this version too * Apply suggestions from code review Co-authored-by: Augustin <augustin@airbyte.io> * lint * Create realistic datasets of 10GB, 100GB, and 1TB in size (#20558) * Faker CSV Streaming utilities * readme * don't do a final pipe to jq or you will run out or ram * doc * Faker gets 250% faster (#20741) * Faker is 250% faster * threads in spec + lint * pass tests * revert changes to record helper * cleanup * update expected_records * bump default records-per-slice to 1k * enforce unique email addresses * cleanup * more comments * `parallelism` and pass tests * update expected records * cleanup notes * update readme * update expected records * auto-bump connector version Co-authored-by: Augustin <augustin@airbyte.io> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
After the platform performance research and the failure to use Faker within the Airbyte platform to generate 1TB of data, I spent some time looking into the bottlenecks of our Python CDK sources. Just like our Java sources, the limiting factors are:
This PR explores solutions to these problems:
AirbyteMessageWithCachedJSON
message was created that renders its own JSON string upon intilizationHow much faster did we get? The speedups in this PR add up to a 250% speedup for a 100,000 user sync (20s to 7s)!
Notes: