Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Resolve conflicting schema issues by removing dependency on PyArrow #67

Merged
merged 14 commits into from
Mar 7, 2024

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Feb 25, 2024

Resolves: #89

We've discovered through user testing that the PyArrow dependency, and PyArrow itself, are inherently fragile when it comes to JSON schemas with variable typing. PyArrow exceeds at handling large volumes of typed data very quickly, but there's no way to process ragged and potentially conflicting types without reverting json types to strings.

Paqruet, similarly, has problems rendering json structures without child nodes. So, something like "extra_properties: {}" would cause a runtime error.

We've recently moved our file writer to a JSONL file writer, in order to better handle ragged, non-populated objects, and unpredictable types. However, we still have a dependency on pyarrow, which is problematic for some users' workloads.

This PR will remove the pyarrow dependency and replace it with a flow that simply writes JSONL lines to local files as quickly as possible.

@aaronsteers aaronsteers changed the title (Stub) Removing dependency on PyArrow Refactor: (Stub) Removing dependency on PyArrow Feb 25, 2024
@aaronsteers aaronsteers marked this pull request as draft February 25, 2024 17:02
@aaronsteers
Copy link
Contributor Author

Performance snapshot, prior to changes (approx 15K records per second):

AJ-Steers-MacBook-Pro---GGHWM7QWPJ:airbyte-lib-private-beta ajsteers$ poetry run python ./examples/run_faker.py

Enter the value for secret 'DUMMY_SECRET': 
Installing Faker source...
Faker source installed.
Connection check succeeded for `source-faker`.
Started `source-faker` read operation at 09:23:55...

                                                    Read Progress                                                     

Started reading at 17:23:55.                                                                                          

Read 500,100 records over 33 seconds (15,154.5 records / second).                                                     

Wrote 500,100 records over 53 batches.                                                                                

Finished reading at 17:24:28.                                                                                         

Started finalizing streams at 17:24:28.                                                                               

Finalized 53 batches over 3 seconds.                                                                                  

Completed 3 out of 3 streams:                                                                                         

 • purchases                                                                                                          
 • products                                                                                                           
 • users                                                                                                              

Completed writing at 17:24:32. Total time elapsed: 36 seconds                                                         

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Completed `source-faker` read operation at 09:24:32.
Stream products: 100 records
Stream users: 250000 records
Stream purchases: 250000 records
AJ-Steers-MacBook-Pro---GGHWM7QWPJ:airbyte-lib-private-beta ajsteers$ 

@aaronsteers aaronsteers changed the title Refactor: (Stub) Removing dependency on PyArrow Refactor: (Draft) Removing dependency on PyArrow Mar 1, 2024
@aaronsteers
Copy link
Contributor Author

Refactoring is functionally complete - although still some cleanup to do.

Performance is boosted by approximately 30% - from 15K records per second to 20K records per second.

AJ-Steers-MacBook-Pro---GGHWM7QWPJ:PyAirbyte ajsteers$ poetry run python ./examples/run_faker.py

Enter the value for secret 'DUMMY_SECRET': 
Installing Faker source...
Faker source installed.
Connection check succeeded for `source-faker`.
Started `source-faker` read operation at 10:51:43...
Completed `source-faker` read operation at 10:52:13.
Stream products: 100 records
Stream users: 250000 records
Stream purchases: 250000 records

                                               Read Progress                                               

Started reading at 18:51:43.                                                                               

Read 500,100 records over 25 seconds (20,004.0 records / second).                                          

Wrote 500,100 records over 51 batches.                                                                     

Finished reading at 18:52:09.                                                                              

Started finalizing streams at 18:52:09.                                                                    

Finalized 51 batches over 3 seconds.                                                                       
                                                    ...                                                    AJ-Steers-MacBook-Pro---GGHWM7QWPJ:PyAirbyte ajsteers$ 

@aaronsteers aaronsteers marked this pull request as ready for review March 7, 2024 21:16
@aaronsteers
Copy link
Contributor Author

@bindipankhudi - Tests are passing locally and lint cleanup is complete.

I am taking another pass through the code myself but wanted to ping you of the latest status.

@aaronsteers aaronsteers changed the title Refactor: (Draft) Removing dependency on PyArrow Refactor: Removing dependency on PyArrow Mar 7, 2024
@bindipankhudi
Copy link
Contributor

bindipankhudi commented Mar 7, 2024 via email

airbyte/_batch_handles.py Outdated Show resolved Hide resolved
@aaronsteers aaronsteers changed the title Refactor: Removing dependency on PyArrow Fix: Resolve conflicting schema issues by removing dependency on PyArrow Mar 7, 2024
Copy link
Contributor

@bindipankhudi bindipankhudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite some refactoring. Looks great and cleaner than before!

@aaronsteers aaronsteers merged commit ce2bcf4 into main Mar 7, 2024
11 checks passed
@aaronsteers aaronsteers deleted the aj/refactor/remove-pyarrow branch March 7, 2024 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 Bug: Failure when subsequent records have fundamentally incompatible schemas
2 participants