Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transforms for session calculation #54110

Open
graphaelli opened this issue Mar 24, 2020 · 4 comments
Open

Transforms for session calculation #54110

graphaelli opened this issue Mar 24, 2020 · 4 comments
Labels
:ml/Transform Transform

Comments

@graphaelli
Copy link
Member

Describe the feature:

Transform support/example for calculating sessions from event streams with unaligned windows - eg per-device web sessions from click streams.

We're interested in generating sessions out of clicks streams for APM - session ids would not be typical in that kind of data, more likely it would be tuples of (timestamp, deviceid). The expectation is that facts used to build the stream can arrive at any time, sessions are defined by a period of inactivity or a period of max activity - eg 30 minutes of inactivity or 12 hours of continuous activity. There other constraints we can impose if needed.

An example of a single sessionization might look like:

Given
PUT clicks
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "device": {
        "type": "keyword"
      }
    }
  }
}

# a: 2 sessions
# b: 1 session
# c: 1 session
PUT clicks/_bulk
{"index":{}}
{"@timestamp":"2020-01-01T12:00:00Z","device":"a"}
{"index":{}}
{"@timestamp":"2020-01-01T12:00:00Z","device":"b"}
{"index":{}}
{"@timestamp":"2020-01-01T12:10:30Z","device":"a"}
{"index":{}}
{"@timestamp":"2020-01-01T13:00:00Z","device":"c"}
{"index":{}}
{"@timestamp":"2020-01-01T13:00:00Z","device":"a"}
{"index":{}}
{"@timestamp":"2020-01-01T13:01:00Z","device":"c"}
{"index":{}}
{"@timestamp":"2020-01-01T13:02:00Z","device":"c"}

Calculate 30 minute sessions with:

GET clicks/_search
{
  "size": 0,
  "aggs": {
    "devices": {
      "terms": {
        "field": "device",
        "size": 10
      },
      "aggs": {
        "count": {
          "bucket_script": {
            "buckets_path": {
              "count": "permin._bucket_count"
            },
            "script": "params.count"
          }
        },
        "permin": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "minute",
            "min_doc_count": 1,
            "keyed": false
          },
          "aggs": {
            "t": {
              "min": {
                "field": "@timestamp"
              }
            },
            "diff": {
              "serial_diff": {
                "buckets_path": "t",
                "lag": 1
              }
            },
            "sessions": {
              "bucket_selector": {
                "buckets_path": {
                  "diff": "diff"
                },
                "script": "params.diff == null || params.diff > 1800000"
              }
            }
          }
        }
      }
    }
  }
}

This does not factor in a maximum session length. I'd expect to run this periodically,. with a lookback to the beginning of the day/month/30 day period and restate all sessions in that time frame.

cc @joshdevins

@graphaelli graphaelli added the :ml/Transform Transform label Mar 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@joshdevins
Copy link
Member

joshdevins commented Mar 26, 2020

I can echo the requirements that @graphaelli has laid out for the search metrics use-case. Search behaviour (queries and clicks) follow very similar patterns and we would have similar needs. We would, however, sessionize already aggregated/summarized queries from click streams into search sessions. For example, given (q queries, c clicks, qs query summaries, s sessions):

Aggregate per-query metrics based on click streams after a search query:

q1, c1, c2 -> qs1
q2, c3, c4, c5 -> qs2

Aggregate query summaries into sessions:

qs1, qs2 -> s1

The semantics would be very similar. For example, sessions should consist of queries/query summaries that are no more than 30 minutes apart, from a single user.

I could imagine a kind of sessionize transform that just requires setting the time field, the time gap of inactivity, and the user/device/pivot field.

@hendrikmuhs
Copy link

@joshdevins I think as long as you are owning the application layer, you can introduce some uuid to build sessions afterwards.

For the devices usecase it sounds like, we have no control over data creation and therefore need the described solution to build sessions.

I wonder if this requires something on the aggregation layer.

@joshdevins
Copy link
Member

joshdevins commented Mar 27, 2020

@hendrikmuhs Definitely possible, but sometimes not practical. It can also be useful to change the "timeout" value and determine session semantics at read time rather than at write. So I suspect the guidance will indeed be to do this in the application/client, but there is still value in being able to do it after the fact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform
Projects
None yet
Development

No branches or pull requests

4 participants