Skip to content
This repository was archived by the owner on Dec 4, 2024. It is now read-only.

Commit

Permalink
Merge pull request #45 from fishtown-analytics/refactor/checklist
Browse files Browse the repository at this point in the history
Refactor/checklist
  • Loading branch information
Claire Carroll authored Apr 28, 2020
2 parents d756a94 + 287a73e commit 428056f
Show file tree
Hide file tree
Showing 19 changed files with 139 additions and 87 deletions.
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @clrcrl
58 changes: 58 additions & 0 deletions .github/issue_template/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
name: Bug report
about: Report a bug or an issue you've found with this package
title: ''
labels: bug, triage
assignees: ''

---

### Describe the bug
<!---
A clear and concise description of what the bug is. You can also use the issue title to do this
--->

### Steps to reproduce
<!---
In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.
--->

### Expected results
<!---
A clear and concise description of what you expected to happen.
--->

### Actual results
<!---
A clear and concise description of what you expected to happen.
--->

### Screenshots and log output
<!---
If applicable, add screenshots or log output to help explain your problem.
--->

### System information
**The contents of your `packages.yml` file*:*

**Which database are you using dbt with?**
- [ ] postgres
- [ ] redshift
- [ ] bigquery
- [ ] snowflake
- [ ] other (specify: ____________)


**The output of `dbt --version`:**
```
<output goes here>
```

**The operating system you're using:**

**The output of `python --version`:**

### Additional context
<!---
Add any other context about the problem here.
--->
20 changes: 20 additions & 0 deletions .github/issue_template/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
name: Feature request
about: Suggest an idea for dbt
title: ''
labels: enhancement, triage
assignees: ''

---

### Describe the feature
A clear and concise description of what you want to happen.

### Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

### Additional context
Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.

### Who will this benefit?
What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.
8 changes: 8 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
## Description & motivation
<!---
Describe your changes, and why you're making them.
-->

## Checklist
- [ ] I have verified that these changes work locally
- [ ] I have updated the README.md (if applicable)
58 changes: 13 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
### Segment Sessionization
# dbt-segment
This [dbt package](https://docs.getdbt.com/docs/package-management):
* Performs "user stitching" to tie all events associated with a cookie to the same user_id
* Transforms pageviews into sessions ("sessionization")

This package requires [dbt](https://www.getdbt.com/) >= 0.12.2.

### Installation instructions
## Installation instructions

1. Include this package in your `packages.yml` -- check [here](https://hub.getdbt.com/fishtown-analytics/segment/latest/)
for installation instructions.
2. Include the following in your `dbt_project.yml` directly within your
`models:` block (making sure to handle indenting appropriately):
2. Run `dbt deps`
3. Include the following in your `dbt_project.yml` directly within your
`models:` block (making sure to handle indenting appropriately). **Update the value to point to your segment page views table**.

```YAML
# dbt_project.yml
Expand All @@ -28,7 +28,7 @@ You may have to do some pre-processing in an upstream model to get it into this
Similarly, if you need to union multiple sources, de-duplicate records, or filter
out bad records, do this in an upstream model.

3. Optionally configure extra parameters – see [dbt_project.yml](dbt_project.yml)
4. Optionally configure extra parameters by adding them to your own `dbt_project.yml` file – see [dbt_project.yml](dbt_project.yml)
for more details:
```yaml
# dbt_project.yml
Expand All @@ -40,47 +40,15 @@ models:
segment_page_views_table: "{{ source('segment', 'pages') }}"
segment_sessionization_trailing_window: 3
segment_inactivity_cutoff: 30 * 60
segment_pass_through_columns: []=
segment_pass_through_columns: []

```
4. Execute `dbt seed` -- this project includes a CSV that must be seeded for it
5. Execute `dbt seed` -- this project includes a CSV that must be seeded for it
the package to run successfully.
5. Execute `dbt run` – the Segment models will get built as part of your run!
6. Execute `dbt run` – the Segment models will get built as part of your run!

### Database support
These package can be used on Redshift, Snowflake, and BigQuery.
## Database support
This package has been tested on Redshift, Snowflake, and BigQuery.

### Description of model
#### segment_web_page_views

This is a base model for Segment's web page views table. It does some straightforward renaming and parsing of Segment raw data in this table.

#### segment_web_user_stitching

This model performs "user stitching" on top of web event data. User stitching is the process of tying all events associated with a cookie to the same user_id, and solves a common problem in event analytics that users are only identified part way through their activity stream. This model returns a single user_id for every anonymous_id, and is later joined in to build a `blended_user_id` field, that acts as the primary user identifier for all sessions.

#### segment_web_page_views__sessionized

The purpose of this model is to assign a `session_id` to page views. The business logic of how this is done is that any period of inactivity of 30 minutes or more resets the session, and any subsequent page views are assigned a new `session_id`.

#### segment_web_sessions__initial

This model performs the aggregation of page views into sessions. The `session_id` having already been calculated in `segment_web_page_views__sessionized`, this model simply calls a bunch of window functions to grab the first or last value of a given field and store it at the session level.

#### segment_web_sessions__stitched

This model joins initial session data with user stitching to get the field `blended_user_id`, the id for a user across all devices that they can be identified on. This logic is broken out from other models because, while incremental, it will frequently need to be rebuilt from scratch: this is because the user stitching process can change the `blended_user_id` values for historical sessions.

It is recommended to typically run this model in its default configuration (incrementally) but on some regular basis to do a `dbt run --full-refresh --models segment_web_sessions__stitched+` so that this model and downstream models get rebuilt.

#### segment_web_sessions

The purpose of this model is to expose a single web session, derived from Segment web events. Sessions are the most common way that analysis of web visitor behavior is conducted, and although Segment doesn't natively output session data, this model uses standard logic to create sessions out of page view events.

A session is meant to represent a single instance of web activity where a user is actively browsing a website. In this case, we are demarcating sessions by 30 minute windows of inactivity: if there is 30 minutes of inactivity between two page views, the second page view begins a new session. Additionally, page views across different devices will always be tied to different sessions.

The logic implemented in this particular model is responsible for incrementally calculating a user's session number; the core sessionization logic is done in upstream models.

### Contributing ###

Additional contributions to this repo are very welcome! Please submit PRs to master. All PRs should only include functionality that is contained within all Segment deployments; no implementation-specific details should be included.
### Contributing
Additional contributions to this repo are very welcome! Check out [this post](https://discourse.getdbt.com/t/contributing-to-a-dbt-package/657) on the best workflow for contributing to a package. All PRs should only include functionality that is contained within all Segment deployments; no implementation-specific details should be included.
14 changes: 7 additions & 7 deletions analysis/mode_queries/audience_overview.sql
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,16 @@
-#}

with source as (

select * from {{ref('segment_web_sessions')}}

)

, final as (

select
date_trunc({% raw %}'{{date_part}}'{% endraw %}, session_start_tstamp)::date as period,

count(*) as sessions,
count(distinct blended_user_id) as distinct_users,
sum(page_views) as page_views,
Expand All @@ -29,12 +29,12 @@ with source as (
sum(case when session_number > 1 then 1 else 0 end) as repeat_sessions

from source

where session_start_tstamp >= '{% raw %}{{start_date}}{% endraw %}'
and session_start_tstamp < '{% raw %}{{end_date}}{% endraw %}'

group by 1

)

select * from final
Expand Down
2 changes: 1 addition & 1 deletion data/referrer_mapping.csv
Original file line number Diff line number Diff line change
Expand Up @@ -1802,4 +1802,4 @@ social,Tumblr,t.umblr.com
search,Sogou,sogou.com
search,Sogou,m.sogou.com
social,UISDC,uisdc.com
social,UISDC,hao.uisdc.com
social,UISDC,hao.uisdc.com
16 changes: 9 additions & 7 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: 'segment'
version: '1.0'

source-paths: ["models"]
analysis-paths: ["analysis"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
Expand All @@ -12,21 +12,23 @@ clean-targets:
- "target"
- "dbt_modules"

require-dbt-version: ">=0.14.0"

models:
vars:
# location of raw data table
segment_page_views_table:
# number of trailing hours to re-sessionize for.
# events can come in late and we want to still be able to incorporate
segment_page_views_table:

# number of trailing hours to re-sessionize for.
# events can come in late and we want to still be able to incorporate
# them into the definition of a session without needing a full refresh.
segment_sessionization_trailing_window: 3
segment_sessionization_trailing_window: 3

# sessionization inactivity cutoff: of there is a gap in page view times
# that exceeds this number of seconds, the subsequent page view will
# start a new session.
segment_inactivity_cutoff: 30 * 60

# If there are extra columns you wish to pass through this package,
# define them here. Columns will be included in the `segment_web_sessions`
# model as `first_<column>` and `last_<column>`. Extremely useful when
Expand Down
2 changes: 1 addition & 1 deletion integration_tests/data/example_segment_pages.csv
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
id,anonymous_id,user_id,received_at,sent_at,timestamp,url,path,title,search,referrer,context_campaign_source,context_campaign_medium,context_campaign_name,context_campaign_term,context_campaign_content,context_ip,context_user_agent
ajs-9527e97fc714abe02a23b3d24cf09a36,507f191e810c19729de860ea,97980cfea0067,2015-02-23 22:28:55,2015-02-23 22:28:55,2015-02-23 22:28:55,https://www.getdbt.com/,/product,Product,?name=ferret,https://www.google.com/,facebook,social,autumn_collection_2015,foo,bar,8.8.8.8,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
ajs-9527e97fc714abe02a23b3d24cf09a36,507f191e810c19729de860ea,97980cfea0067,2015-02-23 22:28:55,2015-02-23 22:28:55,2015-02-23 22:28:55,https://www.getdbt.com/,/product,Product,?name=ferret,https://www.google.com/,facebook,social,autumn_collection_2015,foo,bar,8.8.8.8,"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
2 changes: 0 additions & 2 deletions integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ version: '1.0'

profile: 'integration_tests'

require-dbt-version: ">=0.15.0"

models:
segment:
vars:
Expand Down
1 change: 0 additions & 1 deletion integration_tests/packages.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@

packages:
- local: ../
2 changes: 1 addition & 1 deletion models/base/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@

This is a base model for Segment's web page views table. It does some straightforward renaming and parsing of Segment raw data in this table.

{% enddocs %}
{% enddocs %}
2 changes: 1 addition & 1 deletion models/base/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ models:
- name: page_view_id
tests:
- unique
- not_null
- not_null
18 changes: 9 additions & 9 deletions models/base/segment_web_page_views.sql
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
with source as (

select * from {{var('segment_page_views_table')}}

),

renamed as (

select

id as page_view_id,
anonymous_id,
user_id,

received_at as received_at_tstamp,
sent_at as sent_at_tstamp,
timestamp as tstamp,
Expand All @@ -21,7 +21,7 @@ renamed as (
path as page_url_path,
title as page_title,
search as page_url_query,

referrer,
replace(
{{ dbt_utils.get_url_host('referrer') }},
Expand All @@ -43,19 +43,19 @@ renamed as (
{{ dbt_utils.split_part(dbt_utils.split_part('context_user_agent', "'('", 2), "' '", 1) }},
';', '')
end as device

{% if var('segment_pass_through_columns') != [] %}
,
{{ var('segment_pass_through_columns') | join (", ")}}

{% endif %}

from source

),

final as (

select
*,
case
Expand All @@ -69,4 +69,4 @@ final as (

)

select * from final
select * from final
6 changes: 2 additions & 4 deletions models/sessionization/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The implementation of this logic is rather involved, and requires multiple CTEs.

{% docs segment_web_sessions__initial %}

This model performs the aggregation of page views into sessions. The `session_id` having already been calculated in `segment_web_page_views__sessionized`, this model simply calls a bunch of window functions to grab the first or last value of a given field and store it at the session level.
This model performs the aggregation of page views into sessions. The `session_id` having already been calculated in `segment_web_page_views__sessionized`, this model simply calls a bunch of window functions to grab the first or last value of a given field and store it at the session level.

{% enddocs %}

Expand All @@ -42,12 +42,10 @@ It is recommended to typically run this model in its default configuration (incr

{% docs segment_web_sessions %}

The purpose of this model is to expose a single web session, derived from Segment web events. Sessions are the most common way that analysis of web visitor behavior is conducted, and although Segment doesn't natively output session data, this model uses standard logic to create sessions out of page view events.
The purpose of this model is to expose a single web session, derived from Segment web events. Sessions are the most common way that analysis of web visitor behavior is conducted, and although Segment doesn't natively output session data, this model uses standard logic to create sessions out of page view events.

A session is meant to represent a single instance of web activity where a user is actively browsing a website. In this case, we are demarcating sessions by 30 minute windows of inactivity: if there is 30 minutes of inactivity between two page views, the second page view begins a new session. Additionally, page views across different devices will always be tied to different sessions.

The logic implemented in this particular model is responsible for incrementally calculating a user's session number; the core sessionization logic is done in upstream models.

{% enddocs %}


2 changes: 1 addition & 1 deletion models/sessionization/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ models:
description: ''
tests:
- unique
- not_null
- not_null
2 changes: 1 addition & 1 deletion models/sessionization/segment_web_sessions__stitched.sql
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,4 @@ joined as (

)

select * from joined
select * from joined
Loading

0 comments on commit 428056f

Please sign in to comment.