Epic: Implement GitHub integration sync #1368

joshsmith · 2018-01-12T21:14:57Z

Problem

We need to reliably and concurrently sync tasks and comments to and from GitHub in a non-blocking way.

By providing a timestamp-based concurrency control system we can use a known algorithm to make our GitHub integration more robust.

More importantly, we will be able to unblock our other objectives. We cannot proceed with onboarding projects or volunteers unless GitHub sync is stable, since our overall strategy depends on us connecting volunteers to tasks.

Tasks

In scope

Out of scope

Outline

We would have a sync operation for each type of internal record we want to sync. For example:

TaskSyncOperation
CommentSyncOperation

Every sync operation record, regardless of type, would have a:

direction - :inbound | :outbound
github_app_installation_id - the id of the app installation for this sync
github_updated_at - the last updated at timestamp for the resource on GitHub
canceled_by - the id of the SyncOperation that canceled this one
duplicate_of - the id of the SyncOperation that this is a duplicate of
dropped_for - the id of the SyncOperation that this was dropped in favor of
state
- :queued - waiting to be processed
- :processing - currently being processed; limited to one per instance of the synced record, e.g. comment_id
- :completed - successfully synced
- :errored - should be paired with a reason for the error
- :canceled - another operation supersedes this one, so we should not process it
- :dropped - this operation was outdated and was dropped
- :duplicate - another operation already existed that matched the timestamp for this one
- :disabled - we received the operation but cannot sync it because the repo no longer syncs to a project

Then each type would have type-specific fields, e.g. a CommentSyncOperation would have:

comment_id - the id of our comment record
github_comment_id - the id of our cached record for the external resource
github_comment_external_id - the id of the resource from the external provider (GitHub)

If the event is due to the resource being created, there will not be a conflict. If the resource was created from our own clients, then there is no external GitHub ID yet. The same is true of events coming in from external providers and there is no internal record yet. I'm not yet clear as to whether we should conduct any conflict checking on these event types, but my guess is no. It should likely jump straight to :processing.

When an event comes in from GitHub we should (using a github_comment as our example):

delegate to the proper sync operation table for the particular resource (in our example this would be comment_sync_operations)
check if there are any operations for the github_comment_external_id where:
- the github_updated_at is after our operation's last updated timestamp (limit 1)
  - if yes, set state to :dropped and stop processing, set dropped_for to the id of the operation in the limit 1 query
- the github_updated_at timestamp for the relevant record_ is equal to our operation's last updated timestamp (limit 1)
  - if yes, set state to :duplicate and stop processing, set duplicate_of to the id of the operation in the limit 1
- the modified_at timestamp for the relevant record_ is after our operation's last updated timestamp
  - if yes, set state to :dropped and stop processing, set dropped_for to the id of the operation in the limit 1 query
check if there are any :queued operations for the integration_external_id where:
- github_updated_at is before our operation's last updated timestamp
  - if yes, set state of those operations to :canceled and set canceled_by to the id of this event
check if there is any other :queued operation or :processing operation for the integration_external_id
- if yes, set state to :queued
when :processing, check again to see if we can proceed, then create or update the comment through the relationship on the record for comment_id
when :completed, kick off process to look for next :queued item where the github_updated_at timestamp is the oldest

We would also need within the logic for updating the given record to check whether the record's updated timestamp is after the operation's timestamp. If it is, then we need to bubble the changeset validation error and mark the operation as :dropped per the above.

Some upsides of the approaches above that I wanted to document, in no particular order:

The tracking above generates some implicit audit trails that will be helpful for debugging.
Any unique-per-record queued operations can be run in parallel without issue, i.e. we can run operations for %Comment{id: 1} and %Comment{id: 2} without any conflict.
We can avoid "thundering herd" problems when the system receives back pressure by having control over precisely how the queue is processed.
We can use this in conjunction with rate limiting to only process the number of events we have in the queue for the given rate limit and defer further processing until after the rate limit has expired.

The text was updated successfully, but these errors were encountered:

joshsmith · 2018-01-12T21:43:26Z

A thought on the queue: perhaps we could spawn a task every time a operation record is enqueued or processed. The task looks for operation records to process and exits if none are currently able to be processed.

We would want to process every operation that does not have a operation for the same record being processed currently.

By way of example, assume the following operations:

github_issue_github_id: 1 - processing
github_issue_github_id: 1 - queued
github_issue_github_id: 2 - queued
github_issue_github_id: 2 - queued

If we were to spawn a task now, we would get the following:

github_issue_github_id: 1 - processing
github_issue_github_id: 1 - queued
github_issue_github_id: 2 - processing
github_issue_github_id: 2 - queued

Assume 1 now moves to processed. Then our task would we would now move 2 to processing while 4 still remains queued:

github_issue_github_id: 1 - processed
github_issue_github_id: 1 - processing
github_issue_github_id: 2 - processing
github_issue_github_id: 2 - queued

Later still, 3 moves to processed, so then 4 moves to processing:

github_issue_github_id: 1 - processed
github_issue_github_id: 1 - processing
github_issue_github_id: 2 - processed
github_issue_github_id: 2 - processing

And so on.

The queue is therefore asynchronous across resources but synchronous per resource.

joshsmith · 2018-01-12T22:23:18Z

Tightly grouped events happen fairly frequently, particularly when issues/pull requests are labeled, assigned, etc at time of creation:

The image above shows multiple "simultaneous" events occurring on the same pull request.

We'll likely need to consider some type of ordering-system that is not timestamp-based solely, but some combination of the timestamp, event, and data that changed. In the short-term, it can probably be resolved by dropping those events and simply using the fetch of the remote data to make our changes.

joshsmith · 2018-01-12T23:30:41Z

Might I suggest a quick name change here: instead of TaskSyncTransaction we could call this a TaskSyncOperation. Thoughts?

begedin · 2018-01-15T08:44:48Z

Operation means less overlap and confusion when discussing actual transactions, so I'm all for it.

joshsmith · 2018-01-17T22:37:08Z

I've updated the bits above to use operation over transaction everywhere, except the Ecto.Multi case.

joshsmith mentioned this issue Jan 12, 2018

RFC: Design notes for integration syncing #1189

Closed

joshsmith added this to the 🔄 GitHub Integration Sync milestone Jan 12, 2018

joshsmith changed the title ~~Implement GitHub integration sync~~ Epic: Implement GitHub integration sync Jan 12, 2018

joshsmith added the needs clarification label Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Implement GitHub integration sync #1368

Epic: Implement GitHub integration sync #1368

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018

begedin commented Jan 15, 2018

joshsmith commented Jan 17, 2018

Epic: Implement GitHub integration sync #1368

Epic: Implement GitHub integration sync #1368

Comments

joshsmith commented Jan 12, 2018 • edited Loading

Problem

Tasks

In scope

Out of scope

Outline

joshsmith commented Jan 12, 2018 • edited Loading

joshsmith commented Jan 12, 2018 • edited Loading

joshsmith commented Jan 12, 2018

begedin commented Jan 15, 2018

joshsmith commented Jan 17, 2018

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading

joshsmith commented Jan 12, 2018 •

edited

Loading