Skip to content
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.

Add deduplication of source page views #76

Merged
merged 3 commits into from
Feb 10, 2022

Conversation

MarkMacArdle
Copy link
Contributor

Description & motivation

Segment's source table in your warehouse for page views may contain multiple rows for a page_view_id. I'm seeing rows with the same id and received_at timestamps that can differ by up to a week. This was causing two issues:

  • Unique tests on the primary keys for segment_web_page_views and segment_web_page_views_sessionized would fail
  • Subsequent runs of the package would fail as a merge from an incremental task would get an error. This would cause subsequent tables to not update (running --full-refresh would allow an update).

Adding deduping CTEs fixes these problems. I'm using BigQuery.

Checklist

  • I have verified that these changes work locally
  • [N/A] I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)

Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable! Thanks for submitting the improvement @MarkMacArdle 🙏

@joellabes joellabes merged commit 4731ffe into dbt-labs:master Feb 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet