Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Sheets Incremental Sync #3936

Closed
wallies opened this issue Jun 8, 2021 · 9 comments
Closed

Google Sheets Incremental Sync #3936

wallies opened this issue Jun 8, 2021 · 9 comments

Comments

@wallies
Copy link
Contributor

wallies commented Jun 8, 2021

Tell us about the new connector you’d like to have

  • Which source and which destination?
    Google Sheets Incremental Sync is documented as coming soon. Do we have a date on this?
    With Incremental sync or full sync is it possible to create new files in S3 based on volume or time windows

  • Do you need a specific version of the underlying data source e.g: you specifically need support for an older version of the API or DB? No

Describe the context around this new connector

  • Which team in your company wants this integration, what for? This helps us understand the use case.
    Data Integration team

  • How often do you want to run syncs?
    Full syncs and incremental syncs happen based on time windows or volume

  • If this is an API source connector, which entities/endpoints do you need supported?

Describe the alternative you are considering or using

What are you considering doing if you don’t have this integration through Airbyte?
Looking at rudderstack

@wallies wallies added area/connectors Connector related issues new-connector labels Jun 8, 2021
@sherifnada
Copy link
Contributor

@wallies how would you expect incremental sync to work with Sheets? Specifically, would you expect that an entire row is re-synced if any cell changes? I'm assuming new rows would be replicated as well.

@wallies
Copy link
Contributor Author

wallies commented Jun 8, 2021

@sherifnada We currently sync from datasources not supported into google sheets like survicate. I would expect if any row or cell changes or new rows added that this would sync, instead of syncing the entire sheet which could be thousands of rows.

@sherifnada sherifnada added lang/python type/enhancement New feature or request and removed new-connector labels Jun 8, 2021
@sherifnada
Copy link
Contributor

Implementation note: we should use this opportunity to explore moving the connector to use the CDK

@annalvova05
Copy link
Contributor

annalvova05 commented Jul 27, 2021

Google Sheets API does not have the ability to support full incremental sync.

Incremental SYNC OPTIONS:

Option 1

(API side) sync only new rows:
https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get
stream_state = {‘row_number’: ‘<last_row_number>’}

Pros:

  • API side filtering

Cons:

  • does not sync changed rows

Option 2

(API side) sync with filter based on user-specified cursor_field (column):
https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/batchGetByDataFilter
stream_state = {‘user_specified_cursor_field‘: ‘<max_value_in_cursor_column’}

Pros:

  • sync new and updated records

Cons:

  • client-side filtering (we still read all the data)
  • user must specify cursor_field which is not a natural or easy part of many analysts' workflow

Option 3

(Client side) directly compare file revisions:
https://developers.google.com/drive/api/v3/reference/revisions/list

stream_state = {‘file_revision’: ‘<last_file_revision>’}

Pros:

  • sync new and updated records

Cons:

  • сlient-side filtering
  • performance issues when compare big google sheets.

@sherifnada
Copy link
Contributor

sherifnada commented Jul 27, 2021

@wallies do you have any preferences or feedback on the options above? I think we're leaning towards option 1 but it still comes with asterisks i.e: if you rearrange rows for any reason then you might incur data loss.

@wallies
Copy link
Contributor Author

wallies commented Jul 27, 2021

@sherifnada I was thinking Option 1 would be a good place to start. Although thinking about it more, what I was actually thinking when raising this was more Option 2, as we have partials in google sheets, that get updated, so would be better to sync on a modified date, which is a column.

@sherifnada
Copy link
Contributor

@oustynova oustynova moved this to Ready for implementation in GL Roadmap Dec 15, 2021
@oustynova oustynova moved this from Ready for implementation to Backlog in GL Roadmap Dec 17, 2021
@sherifnada sherifnada moved this from Backlog to Ready for implementation in GL Roadmap Dec 23, 2021
@midavadim midavadim moved this from Ready for implementation to On hold in GL Roadmap Dec 24, 2021
@midavadim
Copy link
Contributor

Changed status to "on Hold" becuase the implementation approach is not defined.

@sherifnada sherifnada moved this from On hold to Ready for implementation in GL Roadmap Jan 12, 2022
@sherifnada sherifnada moved this from Ready for implementation to Blocked in GL Roadmap Jan 13, 2022
@igrankova igrankova moved this to Backlog (unscoped) in GL Roadmap Feb 2, 2022
@sherifnada
Copy link
Contributor

We're closing this issue as the feasibility study above indicated it's not possible to implement incremental syncs reliably. Given the scale of a typical spreadsheet, full refresh syncs are usually fine to pick up new records/deletes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants
@wallies @midavadim @sherifnada @annalvova05 @oustynova @igrankova and others